Site Reliability Engineer/Systems Administrator - remote

OnTheGoSystems
Posted 3 years ago
We Work Remotely
We are looking for a Site Reliability Engineer (SRE) who will be responsible for the day-to-day operation of our servers, as well as load analysis and long-term optimization. We want to work with a person who deeply understands servers, can accurately analyze what they are doing, debug issues in real-time and work proactively to anticipate and prevent problems.
About us and the job.

OnTheGoSystems is a software development company, which creates and sells WordPress plugins. Our sites serve over 250,000 clients who regularly log-in, get support, download updates and read the content.
 
Our servers run on AWS and are built considering a high load system. We run a combination of our own code, alongside WordPress and other 3rd party code. Most of our code runs PHP, but we also use JavaScript (NodeJS) and Python.
 
We use a number of services and technologies from AWS, including CloudWatch, Elastic Load Balancer, EC2, RDS, VPC, Network ACLs, Security Groups, S3, Route53, CloudFront.
 
You will be joining a talented team of developers and systems engineers, who design, build and run our infrastructure. As we grow, we are looking for a talented and passionate SRE, who specialises in load analysis, server cost and uptime. You will be helping us design, develop and analyse our infrastructure considering the implementation of monitoring and alert systems and perform data analysis that can support capacity planning, continuous improvement and incident response.
 
What responsibilities you will have
- Manage network servers and technology tools.
- Monitor performance and maintain systems according to requirements.
- Troubleshoot issues and outages.
- Ensure security through access controls, backups and firewalls.
- Establish and enforce SRE best practices through platform constraints and high-fidelity system modelling.
- Upgrade systems with new releases and models.
- Build internal technical documentation, manuals and IT policies.
- Responsible for automated processes and possibly writing a few of their own.
- Be on our on-call 24/7 rotation to respond to availability incidents.

What is required for this role
- Expert in AWS Systems Administration: knows in-depth about monitoring and reporting as well as deployment and provisioning for high availability in web applications.
- Be able to manage a team of IT professionals. 
- Strong sense of ownership demonstrated through shipping production-quality code and infrastructure equipped with testing, monitoring and documentation.
- Experience with administration Linux servers, running websites in WordPress with Nginx.
- Familiar with monitoring platforms, we are using NewRelic.
- Solid knowledge of automating processes and scripting for infrastructure. We use CI/CD with GitLab.
- A plus would be a good understanding of VPNs with an emphasis on security.

Tools that you must master:
AWS
NewRelic
Git
Terraform
Ansible
Kibana
Bash
Python
MySQL

What we offer: 
This is a 100% remote position. Candidates must be self-motivated, focused and organized to succeed. 
 
- Be part of a team of smart, creative, and like-minded individuals 
- Work on exciting, high-impact projects 
- Learn and improve your skills to grow as a professional 
- Freedom to create and implement innovative ideas 
- Meet and collaborate with team members across the globe 
- Full-time and steady position with national holidays and vacation days 

Most of our development team is located in Europe. We are looking for candidates from Europe, the Middle East or Africa working hours. 
 
If you’re interested in joining us, please send your application and let’s talk.