Site Reliability Engineer - remote

Granicus
Posted 3 years ago
We Work Remotely
About this role:
  • Hiring Manager: Amit Behal - Director Cloud Operations - LinkedIn Profile

  • Salary Range: $100,000 - $120,000 +bonus (starting salary may differ by experience and/or location)

  • Interview process: 5-6 steps that can be done in 2 weeks (calendars permitting)
The following is a profile or persona of who we are looking for. If you have many of the characteristics below, and we want to learn more about all your skills, please apply so we can start a conversation.

Site Reliability Engineers (SRE) work to provide customer value by ensuring the service delivery, health, and reliability of Granicus applications. This is achieved by identifying problems and opportunities for improvement and then developing fixes, tooling and automated solutions to address these findings. The engineer will have a work stream that consists of operational support tasks that inform current or future development work to accelerate and automate recurring tasks. The engineer’s time will involve both manual intervention and development of automation and other fixes. The ideal candidate is required to have solid written and verbal skills, a strong technical mindset, a knack for solving problems and an attention to detail.

They partner with other operational technology teams on resolving infrastructure issues and completing strategic projects to help ensure Granicus meets established SLAs and ensuring maximum uptime. The engineer is a key member of our On-Call operations staff, monitoring and alerting for our applications 24x7x365. You will work with our ticketing application to proactively review, update and resolve issues and work assigned to the team.

You will be expected to use tools include: logging, monitoring, event management, notification, Runbook Automation, Root Cause Analysis.

You will use your expertise to tune and push our systems beyond their normal limit. You will troubleshoot issues across the entire stack: hardware, software, application and network. You will identify and drive opportunities to improve automation for the company;scope and create automation for management and visibility of our services.

You will need to spend 50% of your time on and around production support. Represent the SRE organization in design reviews and operational exercises for new and existing  services. Participate in on-call rotation and periodic conference calls with other specialists from other time zones.

Essential Functions:
  • Diagnose problems and develop solutions for problems related to software, configuration and infrastructure.
  • Understand application code, scripts and SQL statements to troubleshoot production issues
  • Develop and support the automation of routine operational activities
  • Escalate urgent problems to On-Call and Incident Manager
  • Provide documentation of processes involved with support duties and contribute to the knowledge database
  • Contribute in writing and in-person to root cause analysis meetings as part of process improvement
  • Work closely with peer teams to deliver high availability and optimum performance for customers
  • Tackle complex and varied issues on systems ranging from the archaic to cutting-edge.
  • Keeping your assigned application or service up and running or getting it back up and running quickly when failure occurs
  • Working closely with internal partners and teams to ensure that our infrastructure meets security, SLA, and performance requirements
  • Writing, updating, and using documentation, including runbooks/playbooks
  • Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
  • Debugging complex problems across an entire stack and creating solid solutions
  • Persistent testing of application and infrastructure resiliency over a variety of error conditions.
  • Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.

On a given day, you may:
  • Pair with engineers and review code to ensure a service degrades gracefully during expected failure modes
  • Improve our infrastructure-as-code practices to make it easier for engineers to launch well-architected services
  • Run load testing to ensure services meet our performance and capacity expectations
  • Facilitate a blameless learning review.
Sounds like a lot? Fortunately, you will be surrounded by a collaborative team with expertise in all these areas.

 Security Requirement:  

Responsiblefor Granicus information security by appropriately preserving the Confidentiality, Integrity, and Availability (CIA) of Granicus information assets in accordance with the company's information security program.

Skills &Requirements

Who You Are:
  • System Administrator, Software Engineer or operational/product support engineer in a dynamic 24X7 environment
  • Multi-tiered web application architecture
  • Background with the following:
  • 2+ years operating a public cloud environment (Azure, etc.)
  • 3+ years Engineering and SRE experience
  • Experience managing Microsoft Azure environments (VM’s, NSG’s, Resource Groups)
  • Supporting and troubleshooting applications and related issues
  • Experience with internet routing issues and troubleshooting
  • Writing effective shell or other scripting to automate procedures and workflow
  • Relational database software and concepts, specifically MS SQL
  • Hands on experience utilizing monitoring software tools such as Nagios, New Relic, Logic Monitor, ELK
  • You're hungry to learn new things. You're interested in learning about technologies both new and old, and helping your team learn about those technologies as well.
  • You like solving problems practically. Sometimes it makes sense to build something new. A lot of times it makes sense to make it good enough. Once in a while you just have to leave a comment apologizing to future SREs.
  • You are a proactive communicator. You strive to be articulate and empathetic in your interactions and believe in “working out loud” to share work early and helpfully.

Examples of Likely Performance Metrics
  • Documentation – The creation of useful documentation that can be used by colleagues and peers to resolve incidents and execute changes.   
  • Ticket Quality – The creation of Change and Incident tickets that accurately and completely document the body of work required to resolve incidents and execute changes.  
  • On Call Coverage – The management of Alert based Incidents generated during a given week, and the complete documentation of these Alerts, Incidents and Problems
  • Customer Satisfaction – The overall satisfaction of internal and external customers to the quality of completed work.  
  • Problem Management and Incident Review Process – Participation in the Incident Review and Problem Management process, and the completion of assigned tasks.  
  • Communication – the ability to successfully communicate ideas, Incidents, Changes and other operational elements in both a written and verbal context.