Site Reliability Engineer (SRE) - Asia Pacific (APAC) - remote

Canonical

Posted 4 years ago

What is Canonical?

Canonical is a growing international software company that works with the open-source community to deliver Ubuntu, “the world’s best free software platform”. Our mission is to release the potential of free software in the lives of individuals and organisations. Our services are helping individuals and businesses worldwide to reduce costs, improve efficiency and enhance security with Ubuntu.

Job Summary:

The IS Team at Canonical supports and maintains all of Canonical’s production services. Members of the team use real-life operational experiences to contribute to product improvements. The team is in charge of running services used by over 60 million Ubuntu users.

As an SRE you’ll be in a unique position to improve Canonical products and the Open-source technologies they’re based on. You’ll do this by providing critical feedback to developers on how their products operate at scale as well as writing code, submitting bugs, and working with other teams within the company. You will also be encouraged to develop and submit fixes and enhancements directly and to collaborate with development teams during the design and implementation phases.

You’ll be part of a global team of SREs that work together and support each other to provide the best possible services to our company, Canonical’s customers and the Ubuntu Community.

As a Site Reliability Engineer you will:

Understand and operate cloud and container technology from kernel to dashboard - OpenStack and Kubernetes both for Canonical and its clients
Maintain operational responsibility for all of Canonical’s core services, networks, and infrastructure
Develop new features and improve the resilience and scalability of the existing cloud and container portfolio at Canonical
Automate operations for reuse across the worlds largest companies, taking into consideration the complexities of distributed systems
Develop skills in troubleshooting, capacity planning, and performance analysis
Collaborate with development teams to design service architecture, documentation, playbooks, policies and operational procedures
Provide assistance and collaborate with globally distributed engineering, operations, and support peers.
Be given uninterrupted software development time to collaborate on larger coding projects and automate manual tasks
Carry final responsibility for time-critical escalations

The successful Site Reliability Engineer candidate will have:

Bachelor's degree or greater, preferably in computer science or related engineering field
Python software development experience, with large projects
Strong modern engineering background (peer-review, unit testing, SCM, CI/CD, Agile)
Preference for treating configuration as code and automating to reliably solve problems.
Extensive knowledge of cloud computing concepts and technologies
Practical knowledge of Linux networking, routing, and firewalls
Hands-on experience administering Linux servers for personal or professional use
Able to communicate clearly and effectively in English over email, IRC, video or voice calls and in-person
Self-driven, able to troubleshoot from kernel to web, and willing to ask others when appropriate
A willingness to be flexible and able to learn new things quickly.
Be challenged by the needs of fast-changing environments.
Happy to work within distributed teams.
Be passionate and familiarized about open-source, especially Ubuntu or Debian

Apply