Senior Site Reliability Engineer (SRE) - remote

Posted 3 years ago
Stack Overflow

Cribl Inc is looking for a Senior Site Reliability Engineer (SRE). We are a fast-growing, remote-first company with a mission to unlock the value of all observability data. At our core, we believe in shipping phenomenal products and doing good by our customers and communities. We provide our customers with a new and unprecedented level of observability, intelligence, and control over their real-time data. We're backed by Sequoia and CRV, and our products are deployed in some of the largest organizations in the world processing 100s of TB and PB of IT &Security data, and managed by Site Reliability Engineers, System Engineers, and Technical Operations teams.

Responsibilities:

  • You will work with engineers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability.
  • Improve the reliability of our systems by working with engineers to ensure that the software delivery pipeline is as efficient as possible.
  • Mentor our engineers to achieve more than they thought possible. You enjoy making other teams successful and are fulfilled through the success of others.
  • You will write and update documentation, including runbooks/playbooks
  • You will automate work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
  • You will debug complex problems across an entire stack and creating solid solutions

Minimum Requirements:

  • 7+ years experience with software engineering, software development, or system operations
  • Experience building, and operating large-scale production systems
  • Knowledge of Container technologies, Python, Go, Java/JS/TS & source control (Git, GitHub)
  • Experience working with container deployment and orchestration technologies with knowledge of fundamentals including service discovery, deployments, monitoring, scheduling, and load balancing.
  • Understanding of Systems programming (network stack, file system, OS services) and networking (L2 vs. L3, network architecture, VLANs)
  • Experience identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues.
  • You have skills to work across teams and functions to influence design, operations and deployment of available software.

Bonus Points/Preferred Skills:

  • Experience with development and deployment in a hosted cloud environment, preferably AWS &GCP.
  • Experience with running containerized environments and understanding of multi-tenancy and security implications.
  • Experience with optimized and scalable software that operates on a large number of nodes.
  • Experience with monitoring and observability tools and applications, such as Splunk, Data Dog or Elastic Search.
  • Experience automating infrastructure, testing, and deployments using tools like Cloud Formation, Chef, or Terraform and can explain the Infrastructure as Code paradigm

What we offer:

  • Competitive Salary
  • Stock Options
  • Medical, dental, and vision insurance
  • Flexible spending account (FSA)
  • 401(k) plan offered
  • Parental Leave
  • Professional Development and Career Growth
  • Generous Vacation and Holiday Policy, including 2 Floating Holidays to use for holidays you observe
  • Social Responsibility Employee Group that reflects our value-driven company culture

Diversity drives innovation, enables better decisions to support our customers, and inspires change for the better. We’re building a culture where differences are valued and welcomed. We work together to bring out the best in each other. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.