Site Reliability Engineer - remote

YouGov
Posted 4 years ago
Stack Overflow

Role: 

Crunch.io part of the YouGov PLC is looking for an energetic, passionate Site Reliability Engineer. you will join our talented individuals in being responsible for expanding the Crunch platform and its operational excellence. We are inviting you to join our small, fully remote team of developers and operators helping make our platform faster, more secure, and more reliable.

Our Tech Stack:

We currently run our in-house production Python code against Redis, MongoDB, and ElasticSearch services. We proxy API requests through NGINX, load balance with ELBs, and deploy our React web application to AWS CloudFront CDN. We use EFS for persistent storage. Our current CI/CD process is built around GitHub, Jenkins, BlueOcean including unit, integration, and end to end tests and automated system deployments. We deploy to auto-scaling Groups using Ansible and Cloud-Init.

In the future (and to some degree currently), all or part of our platform will include Kubernetes, Helm, FluxV2, and Spinnaker.

Experience required:

  • 5+ Experience being an on-call DevOps, SRE, or Cloud Operations senior engineers
  • Experience implementing Terraform best practices for infrastructure in AWS 
  • Proven track record of designing, building, sizing, optimizing, and maintaining cloud infrastructure especially in AWS
  • Proven experience automation glue code, and managing production infrastructure in AWS
  • Proven track record of designing, implementing, and maintaining full build/release pipelines in a cloud environment (Jenkins experience preferred)
  • Experience with containers and container orchestration tools (Docker / Kubernetes / Helm production experience required) (Spinnaker experience preferred)
  • Experience with improving developer experience with desktop tooling and scripts
  • Expertise with Linux system administration (2 yrs) and networking technologies including (IPv6 nice to have).
  • Knowledge of NoSQL database operations and concepts
  • Experience with MongoDB, Elasticsearch, and Redis (at least 1 year)
  • Capability to write programs/scripts to solve both short-term systems problems and to automate repetitive workflows (Python and Bash preferred)
  • Exceptional English communication and troubleshooting skills.
  • Understanding and experience with implementing best security practices in AWS / Linux / Kubernetes and other listed services, pen testing and internal vulnerability analysis / incident response
  • Experience in monitoring, system performance data collection and analysis, and reporting

What will I be doing?

  • Monitor and detect emerging customer-facing incidents on the Crunch platform;assist in their proactive resolution, and work to prevent them from occurring
  • Coordinate and participate in a weekly on-call rotation, where you will handle short term customer incidents (from direct surveillance or through alerts via our Support Engineers)
  • Diagnose live incidents, differentiate between platform issues versus usage issues across the entire stack;hardware, software, application and network within physical datacenter and cloud-based environments, take the first steps towards resolution, and see the problem through to resolution
  • Automate routine monitoring and troubleshooting tasks
  • Provide consistent, high-quality feedback and recommendations to our product managers and development teams regarding product defects or recurring performance issues
  • Drive improvements and advancements to the platform in areas such as container orchestration, service mesh, request/retry strategies
  • Build frameworks and tools to empower safe, developer-led changes, automate the manual steps and provide insight into our complex system
  • Work directly with the team to enhance the performance, scalability and observability of resources of multiple applications and ensure that production handoff requirements are met and escalate issues
  • Embed into SRE projects to stay close to the operational workflows and issues
  • Evangelize the adoption of best practices in relation to performance and reliability across the organization
  • Maintain project and operational workload statistics
  • Promote a healthy and functional work environment
  • Work with Team Lead and/or external security contractors to do periodic penetration testing, and drive resolution for any issues discovered
  • Administer a large portfolio of SaaS tools used throughout the company