Lead Site Reliability Engineer - remote

Posted 3 years ago  • San Mateo, CA
Stack Overflow

**Considering candidates in San Mateo, Santa Barbara, or US-based remote**

We are looking for a Lead Site Reliability Engineer passionate about highly available systems and the processes it takes to get there. You will work with multiple teams at Envidation and be responsible for how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services in production. Experience with cloud architecture, cloud security, continuous integration, continuous delivery, infrastructure as code, and a strong operational background are all a must. More than a set of skills, we are looking for someone who is curious, collaborative, great at communicating, and is always willing to learn. If you are ready to help build a great SRE organization this is the perfect opportunity.

RESPONSIBILITIES

  • Partner our engineering teams to properly manage and respond to production issues.
  • Ensuring that proper logging, monitoring and alerting is set up.
  • Working with teams when incidents happen and making sure we fix issues in a timely manner as well as understand the root cause and drive action items so they don't happen again.
  • Work with each team on their disaster recovery plans including leading tabletop exercises
  • Partner with DevOps, Test Automation, IT, Engineers, Project Managers, Quality and leadership to understand where the opportunities for improvement are

QUALIFICATIONS

Minimum Qualifications:

  • Experience in site reliability or devops
  • Experience in leading and building out an SRE function
  • Hands on experience managing / supporting Linux production environments
  • Experience with AWS
  • Experience with Incident Management including
  • Experience with Kubernetes
  • Experience with CICD tools
  • Strong written and verbal communication, including ability to quickly synthesize and analyze inputs from a variety of sources.

Preferred Qualifications:

  • Experience with AWS EKS, ECS, Faregate
  • Experience with the Atlaissian stack (JIRA, Confluence, OpsGenie)
  • Experience with DataDog