Site Reliability Engineer - remote

URBANTZ

Posted 3 years ago

About the product and your role:

The Urbantz solution is critical and in the core of our customers' business. Reliability is our daily concern and we need a Site Reliability Engineer responsible for the overall health and performance of our platform.

You'll be the first SRE at Urbantz and part of your responsibility will be to help us to hire people to grow the team and, if you are interested in such a position, lead this team in the future.

Today we are facing common challenges of a fast-growing company:

Our product evolves quickly

Scaling

Transforming from one to multiple stream-aligned teams

Our mindset is "You build it, you run it". But at this stage, we have a lack of knowledge regarding "Ops". Our goal is to avoid silos, and creating an Ops team was never an option.
Following the DevOps principles, we want to bring Ops mindset into our Stream-aligned teams. You will, first, act as a doer by implementing good practices and processes, but one of key focus areas will be to act as an enabler.

How we are organized:

We follow the principles of the book Team Topologies (https://teamtopologies.com/) with autonomous and cross-functional teams (called Stream-aligned team), and a Platform Team who build a digital platform (https://martinfowler.com/articles/talk-about-platforms.html) for the other team.
We have two Stream-aligned Teams composed of ~5 Software Engineers, 1 QA, 1 PM and 1 EM
We have a Platform Team of 3 Platform Engineers

What you will do:

Define some SLO in collaboration with the entire Engineering Team
Improve observability of our systems through monitoring and alerting
Be an active contributor in the culture of authoring blameless post-mortems by conducting post-incident reviews
Improve and document our release process, service setup, teardown and failover
Create an operational playbook/runbook
Put in place a disaster recovery testing at least annually
Optimize on-call rotations and processes
Teach engineers in stream-aligned teams about SRE practices

Your profile:

Understanding and experience in managing cloud infrastructure and platforms, such as AWS and Azure
Experience with production system administration and web operations
Experience with Terraform and Kubernetes
Experience with programming using JavaScript, Node.JS
Good understanding of TCP/IP, DNS and Load balancers setup and troubleshooting
Experience in massive-scale web operations
MongoDB and general database NoSQL knowledge, including performance and optimization
Experience with Monitoring tools (Grafana)
Excellent information management practices, such as detailed documentation, usage of wikis, and other collaboration tools
Strong comprehension of continuous integration and continuous deployment methodologies.
Excellent written and verbal communications

What’s in it for you?

Join a winning team. Great people that work hard but have fun doing that.
A fast-growing company where you are given a lot of autonomy and trust.
Enter the promising, ever-growing world, of last mile logistics.
A competitive package
You can make a huge impact, and grow with the company.
If you want to just “work” somewhere, we probably aren’t the right place. If you want to make a serious difference with positive, real-world implications, then we want to see you!

Apply