About the product and your role:
The Urbantz solution is critical and in the core of our customers' business. Reliability is our daily concern and we need a Site Reliability Engineer responsible for the overall health and performance of our platform.
You'll be the first SRE at Urbantz and part of your responsibility will be to help us to hire people to grow the team and, if you are interested in such a position, lead this team in the future.
Today we are facing common challenges of a fast-growing company:
You'll be the first SRE at Urbantz and part of your responsibility will be to help us to hire people to grow the team and, if you are interested in such a position, lead this team in the future.
Today we are facing common challenges of a fast-growing company:
- Our mindset is "You build it, you run it". But at this stage, we have a lack of knowledge regarding "Ops". Our goal is to avoid silos, and creating an Ops team was never an option.
- Following the DevOps principles, we want to bring Ops mindset into our Stream-aligned teams. You will, first, act as a doer by implementing good practices and processes, but one of key focus areas will be to act as an enabler.
How we are organized:
- We follow the principles of the book Team Topologies (https://teamtopologies.com/) with autonomous and cross-functional teams (called Stream-aligned team), and a Platform Team who build a digital platform (https://martinfowler.com/articles/talk-about-platforms.html) for the other team.
- We have two Stream-aligned Teams composed of ~5 Software Engineers, 1 QA, 1 PM and 1 EM
- We have a Platform Team of 3 Platform Engineers
What you will do:
- Define some SLO in collaboration with the entire Engineering Team
- Improve observability of our systems through monitoring and alerting
- Be an active contributor in the culture of authoring blameless post-mortems by conducting post-incident reviews
- Improve and document our release process, service setup, teardown and failover
- Create an operational playbook/runbook
- Put in place a disaster recovery testing at least annually
- Optimize on-call rotations and processes
- Teach engineers in stream-aligned teams about SRE practices
Your profile:
- Understanding and experience in managing cloud infrastructure and platforms, such as AWS and Azure
- Experience with production system administration and web operations
- Experience with Terraform and Kubernetes
- Experience with programming using JavaScript, Node.JS
- Good understanding of TCP/IP, DNS and Load balancers setup and troubleshooting
- Experience in massive-scale web operations
- MongoDB and general database NoSQL knowledge, including performance and optimization
- Experience with Monitoring tools (Grafana)
- Excellent information management practices, such as detailed documentation, usage of wikis, and other collaboration tools
- Strong comprehension of continuous integration and continuous deployment methodologies.
- Excellent written and verbal communications
What’s in it for you?
- Join a winning team. Great people that work hard but have fun doing that.
- A fast-growing company where you are given a lot of autonomy and trust.
- Enter the promising, ever-growing world, of last mile logistics.
- A competitive package
- You can make a huge impact, and grow with the company.
- If you want to just “work” somewhere, we probably aren’t the right place. If you want to make a serious difference with positive, real-world implications, then we want to see you!