Fleet Reliability Engineer (Remote)

Posted 3 years ago
Stack Overflow

What you will do

The ‘balena fleet’is ever-growing and heterogeneous, with hundreds of thousands of devices of different types and architectures distributed across the globe. The mission of our Fleet Reliability Engineers is to enable our users to safely deploy, monitor, and manage the health of their IoT devices and ensure they continue to scale their own fleets and succeed with balena.

As a member of the team, you will be at the cutting-edge of support-driven development. You will be part operator and part product engineer. You will investigate issues, assist users with solving immediate problems and work at all levels of the stack to help us build compatibility between previous and new versions of our components and sustainably scale the devices connected to our backend and the backend itself.

You will also develop solutions to high-impact, high-complexity challenges affecting the entire meta-fleet and contribute to the platform roadmap with data from the field. On-device metrics, monitoring, data visualization, and debugging are all common territory. Examples of past projects include balenahup —our solution for managing host OS updates;and configizer —a solution for safely adjusting on-device configuration remotely.

Responsibilities

  • Identify user needs and patterns in feedback and understand the root causes of friction while keeping a global view of all of our customer's fleets
  • Lead the shift away from reactive support to preventative maintenance by making existing tools more robust and scalable and building new ones
  • Help brainstorm long-term solutions and own the implementation of new features and products for balena fleet owners including development, testing, deployment, and maintenance
  • Contribute to documentation and user-facing guides for your implementations
  • Be a source of advice for peers, learning and teaching how to best help users and customer monitor and debug their fleets of devices
  • Participate in customer support –educate balena users on best practices for going to production and scaling and managing their fleets

Requirements

  • Background in software development, infrastructure, and/or system operations
  • Experience writing high-quality, production-ready code and debugging complex issues
  • Working knowledge of Linux operating system internals and scripting
  • Ability to manage ambiguity, make critical trade-off decisions, and push projects to completion
  • Continuous improvement mindset, and desire to make self and others more effective
  • Willingness to constantly build on your product knowledge (through projects, tutorials, support shifts, etc.)
  • Excellent verbal and written communication skills, and fluency in English

Bonus points

  • Firm grasp of technologies like Typescript, Node.js, Bash, Go, and Docker
  • Strong understanding of networking concepts (load balancers, routers, etc.)
  • Experience developing internal tooling and automation
  • Familiarity with IoT, embedded computing, or the balena platform as a user/contributor
  • Contributions to OSS projects and community involvement
  • Background in leading projects and working across functions to build reliable products

Make sure to let us know if any of these items apply to you!