As a Site Reliability Engineer (SRE) at Upbound, you’ll be a vital part of the production services the company is building its business on. You’ll be applying engineering principles to design and build highly reliable and scaled infrastructure and services, deployment pipelines and processes to frequently and safely release updates, and monitoring and alerting systems to ensure it all stays healthy.
In this role, you will be…
- Taking ownership of the health and reliability of the live production service and infrastructure, ensuring that SLOs/SLAs are consistently met
- Designing, building, and automating critical portions of the Upbound Cloud service infrastructure
- Troubleshooting and problem-solving effectively to remediate infrastructure related issues that affect service health
- Reporting and fixing bugs in private and public projects.
- Providing routine maintenance and support of Kubernetes based infrastructure, including extending Kubernetes API and functionality via CRD/Controller applications
- Entrusted to make technology decisions for the business, procuring the right technology and designing and implementing a self-service solution for the teams that consume Upbound infrastructure
- Collaborating with the development teams to assess and recommend technologies that support company organizational needs
- Balancing tradeoffs between enterprise and open source technologies to better serve Upbound
- Supporting the full project lifecycle - discovery, analysis, architecture, design, documentation, building, migration, automation, and production-readiness