Senior Site Reliability Engineer - remote

Posted 3 years ago  • San Francisco, CA
Stack Overflow

Role Mission: Theorem's Site Reliability Engineer will be responsible for accelerating the delivery of working, reliable software into production through the systematic application of software automation to the firm’s IT operations.  

Level: Senior engineer with previous career experience and domain expertise in engineering operations directly focused on cloud infrastructure. 

What You'll Be Doing

Serve as the technical operations lead for the following systems:

  • Cloud configuration (AWS + IONOS / Terraform)
    • Complete a turndown of Theorem's IONOS datacenter presence.
    • Theorem's cloud environment to be completely codified in Terraform and reconfigurable via a CD pipeline.
    • Reduce privileges of technical employees to require creation of production infrastructure via Terraform.
  • Container orchestration (our Kubernetes clusters)
    • Create an access audit log for our production Kubernetes clusters.
    • Perform cluster software upgrades.
    • Provision a high performance Ceph storage cluster.
  • Continuous integration &deployment (currently Jenkins)
    • Build a continuous delivery pipeline for deploying changes to alerting infrastructure.
    • Explore alternative cloud-native CI/CD offerings to potentially replace Jenkins
  • Monitoring &alerting (Prometheus / Alertmanager / PagerDuty)
    • Redefine our alert-handling policies to significantly reduce the frequency of unhandled alerts.
    • Creation of a cost allocation dashboard that associates cloud costs to business programs and reveals how Theorem spends its money on compute resources.
    • Creation of an alert SLO dashboard that reveals how much time we spend dealing with operational toil and which systems are being operationally neglected.

Support and assist the following data infrastructure objectives:

  • Complete a database migration away from AWS Redshift to Parquet in S3.
  • Deploy a distributed dataflow engine (e.g. Apache Spark) for use in the data pipeline
  • Streamline the data pipeline deployment process.
  • Ingestion of systems operation data into the data warehouse.

Competencies

  • The ability to work efficiently and effectively while being remote (Must be located within the US)
  • Experience with and knowledge of public cloud offerings such as AWS (preferred), GCP, Azure, etc.
  • Previous experience with infrastructure automation tools such as Terraform (preferred), CloudFormation, Ansible, Chef, Puppet, etc.
  • Experience with various flavors of the Linux operating system, including shell scripting in bash or zsh.
  • Working knowledge of Python.
  • Previous experience working with container orchestration systems, such as Kubernetes (preferred), Borg, Twine, Apache Mesos, etc.
  • Has a deep appreciation for continuous integration and deployment (CI/CD).Is familiar with standard practices in software monitoring and alerting.
  • Has a working knowledge of SQL, data warehousing, and ETL pipelines.
  • Understanding of staging environments, integration and release testing.
  • Attention to detail
  • Hardworking and gritty
  • Sharp and fast learner
  • Transparency &intellectual honesty
  • Welcomes and adapts behavior to feedback
  • Collaborative and team-oriented
  • Ability to communicate and escalate issues

Training &Experience

  • Previous experience and subject matter expert in aspects of build engineering. Experience using modern build systems such as Bazel, Buck, Pants, Please, etc. is desired, but not a hard requirement. If they have no previous experience with any of those tools, they should signal enthusiasm for learning to use and extend Bazel.
  • Has a deep appreciation for continuous integration and deployment (CI/CD). All engineers contribute CI/CD pipelines, but our hire will help us organize and streamline our CI pipelines to make it simple for other engineers to contribute to.
  • Has previous experience managing containerized workloads. Ideally previous experience with Kubernetes, but not required. Our hire will eventually become a primary owner for our Kubernetes clusters.
  • Is familiar with standard practices in software monitoring and alerting. They will be expected to create software monitors and configure alert routing so that the appropriate engineers are notified of failures.
  • Understands how to use staging environments and integration testing to create system release tests. All engineers are expected to create release tests;our hire will help define standard testing environments that makes it easier for others to write release tests.
  • Bachelor's degree or higher in computer science, software engineering, or related discipline.