Site Reliability Engineer - (100% remote)

Posted 3 years ago
We are looking for a Site Reliability Engineer

You will be a key member of a tight-knit group of talented Engineers who are responsible for keeping ours and our customer’s Kubernetes clusters operational and healthy. You’ll also have a key role in the development of the product itself, working together with our Platform Engineers to deliver the greatest Kubernetes service possible.

Giant Swarm is a fast-growing open-source infrastructure management platform used by modern enterprises. Our vision is to empower developers around the world to ship great products. We are a diverse, fully remote (since 2014) and experienced team that is growing and spread across Europe - with a headquarters in Cologne.

YOUR JOB

  • You maintain, operate and upgrade our own and our customer’s Kubernetes clusters.
  • You will design, configure, build, and maintain our core infrastructure, from kernel parameters to the cloud provider templates.
  • You understand how servers and systems work and you tweak their behavior to your needs.
  • You will be responsible for our monitoring, logging and alerting.
  • You will help resolve incidents on our own and our customer’s clusters.
  • You participate in the on-call support schedule
  • You are a go-to person in case our developers need advice regarding infrastructure.
  • You will automate all the things, and the thought of Terraform doesn’t make you cry.
  • We (and the majority of our customers) are currently mostly distributed around Europe (around UTC), thus, your main time zone should be somewhere between +/-2UTC to ensure better communication.

REQUIREMENTS

  • You must have deep, hands-on knowledge of Kubernetes from both the end-user and the operational side.
  • You’re comfortable debugging systems at all levels, from kernel fundamentals right up to workloads running on Kubernetes.
  • You’re happy troubleshooting a wide variety of issues and you’re not afraid to parse thousands of lines of logs in pursuit of an answer.
  • You have good coding skills (preferably Go, but Python or similar is fine as well)
  • You have experience with maintaining infrastructure with code and you know the pros and cons of various automation tools (We use Terraform &Ansible but Chef, Puppet and the lot is also a good start).
  • You are fluent with Cloud Native Tools running on top of Kubernetes (prometheus, grafana, ingress controller, …) you know how to use them and how to configure them.
  • You automate all the things by writing code. Using bash scripts makes you sad :)