Site Reliability Engineers at UKG Technology and Innovation are hybrid software/system engineers that have a breadth of knowledge encompassing all aspects of service delivery. They develop software solutions to enhance, harden and support our service delivery processes. This can include building and managing CI/CD deployment pipelines, automated testing, capacity planning, performance analysis, monitoring, alerting, chaos engineering and auto remediation. Site Reliability Engineers must have a passion for learning and evolving with current technology trends. They strive to innovate and are relentless in their pursuit of a flawless customer experience. They have an “automate everything”mindset, helping our company deploy services with incredible speed, consistency and availability.
Primary/Essential Duties and Key Responsibilities:
- Engage in and improve the whole lifecycle of services from conception, to inception, including: system design consulting, and capacity planning
- Define and implement standards and best practices related to: System Architecture, Deployment, metrics, operational tasks
- Support services through activities such as monitoring availability, system health, and incident response
- Improve system performance, application delivery and efficiency through, automation, process refinement, post mortem reviews, and in-depth configuration analysis
- Engage in Communications across all areas of the organization
Required Qualifications:
- Engineering degree, or a related technical discipline, or equivalent work experience
- Experience with Cloud based applications
- Experience with Containerization Technologies
- Experience with Microsoft and Linux Technologies
- Experience with VMWare or other Virtual Server Software
- Experience coding in higher-level languages (e.g., Python, Javascript, C++, or Java)
- Experience in configuration and maintenance of applications such as web servers, load balancers, relational databases, storage systems and messaging systems
- Experience with MongoDB, MySQL, ElasticSearch, RabbitMQ, and others
- Experience with operating systems and TCP/IP network fundamentals
- Experience learning software, frameworks and APIs
- Ability and willingness to work evenings / nights on occasion.
- Ability to lead and work in projects
- Experience as a Site Reliability Engineer, Production Engineer, or equivalent
- Experience with distributed system design and architecture
- Experience building and managing CI/CD Pipelines
- Experience with public or private cloud platforms (e.g. GCP, Kubernetes, or Openstack)
- Experience with Production level monitoring and alerting with tools like Prometheus, Grafana, Datadog, etc.
Travel Requirements:
- 0-5%