Senior Site Reliability Engineer (US Remote based)

Posted 3 years ago
Stack Overflow

My client is a VC backed company who has a SaaS based solutions that is changing the way life science companies bring their product onto market. They are going through a phase of high growth and are looking to bring on board a talented Senior Site Reliability Engineer to help build and scale our growing SaaS platform.

As a member of their Product Engineering Team, you'll be a resident expert on their cloud-based hosting environment working as part of a cross-functional team designing, building, deploying, and running the product. You will be responsible for reliability, security, and compliance.

The platform was designed as cloud-native from the outset and takes advantage of today’s

current technologies including Docker, Kubernetes, AWS (EKS, S3, Aurora RDS, IAM, KMS,

auto-scaling groups, &load balancers), NGINX, MongoDB, Kafka, APM instrumentation, and an

automated CI/CD pipeline. Everything runs on Linux. Many of the tools are built using Bash and

Python.

The role responsibilities

  • Develop, test, and deploy secure, reliable, highly available, and scalable infrastructure using AWS technologies with Terraform or AWS CDK
  • Navigate application and infrastructure stack fluently
  • Build, configure, and maintain CI/CD pipelines to optimize application build, test, deployment, and operations
  • Collaborate and work closely with technology and product teams
  • Develop, deploy, and maintain tools to drive efficiency, repeatability, security, and compliance
  • Define, measure, elevate, and improve operational key performance indicators (e.g. MTTR, lead time, change failure rate)
  • Participate in on-call rotations with engineering
  • Deploy production code and configuration updates
  • Implement a complete backup and disaster recovery strategy for all test and production systems
  • Help drive DevOps as an integral part of a strategic software team

Candidate requirements

  • You have at least 3 years of hands-on experience in Site Reliability, DevOps or similar roles
  • When it comes to infrastructure, you always think of automation first: You avoid manual changes in production at all costs
  • You live and breathe AWS Cloud: Services and acronyms such as EKS, S3, RDS urora, ELB, VPC, MSK, WAF, EBS, CloudFormation, and CloudWatch are part of your daily vocabulary and you can’t wait to read about the new services that could make your life even easier
  • You strive to be the first to know about production issues even before they occur, with carefully designed monitoring and strive to work with the team to resolve them before anyone else could notice
  • You practice and preach good security: you would never think of sending credentials over cleartext, you use two factor authentication whenever possible, credentials in code? never! and you make sure that latest security patches are always applied
  • You make sure there are always current backups of everything, and you know they are reliable because you routinely test restores
  • You are passionate about maintaining consistent staging and testing environments, because you wouldn’t dream of deploying something to production untested
  • You love working with developers and testers, and making their lives even easier by setting up super streamlined and reliable continuous integration and deployment systems
  • You can install and configure Linux in your sleep, and troubleshoot network issues just as easily