Site Reliability Engineer (SRE) - remote
About the Role
The Senior Site Reliability Engineer (SRE) will bring deep expertise designing and supporting highly-scalable, highly-available infrastructure and applications in Kubernetes, as well as promoting microservice design patterns in complex working environments within the cloud. This role will serve as a subject matter expert on all aspects of our containerized deployments, including deployment, configuration, scaling, and upgrades. The ideal candidate will be passionate about mentoring other team members and departments on the adoption of new technologies and design principles, as well as promoting DevOps culture and collaboration. This role will also work closely to ensure deployments are successful in both production and non-production environments.
What You’ll Do
- Troubleshoot complicated, cross platform issues handling OS, AWS, networking and databases
- Work closely with Development, QA and Production Support teams to make sure releases are on time and successful
- Ensure the reliability and security of the infrastructure while building proactive dynamic monitoring, alerting and metrics solutions to make sure each environment is meeting the SLA requirements
- Build infrastructure in both AWS and GCP using Terraform
- Seek to minimize or eliminate manual hand-offs and to also link all automated workflows
- Support the Kubernetes application/infrastructure in both production and non-production environments
- Establish and test disaster recovery policies and procedures
- Responsible for resiliency and scalability of the infrastructure
- Track and apply all required patches
- Demonstrate experience in the creation and management of technical documentation
Skills &Qualifications
- BA in Computer Science or Information Systems or combination of education and related work experience
- 5 years of Site Reliability experience (SRE)
- 5 years of DevOps experience
- 2 years with Kubernetes experience
- 3 years with Cloud Platform experience, AWS and GCP
- 5 years with Production infrastructure experience
- Strong coding experience in Ruby, Python or similar languages
- Proven experience to automate routine repeatable tasks
- Strong sense of ownership, ability to work independently and proven track record of driving products and changes
- Strong experience in production support and operations
- Strong experience in monitoring application / infrastructure performance and availability while creating metrics for management use
- Strong experience in Terraform, Ansible, Jenkins, Linux, Docker, Helm, Elasticsearch, Prometheus
- Strong automation, problem-solving skills, and ability to follow through to completion
- Ability to wear multiple hats and multitask effectively in a fast paced environment
- Capable of working independently as well as part of a group