Being a Head of Reliability at balena
As a Head of Reliability, you will work with a team of SREs to ensure our services are available, resilient, and efficient. You will take an“Infrastructure as Product”approach towards enabling self-service for our developers and optimizing the experience for our end-users.
You will learn how our complex interdependent systems are built and run. You will review architecture for new features, refine designs, facilitate frictionless deployments to production, monitor availability, manage outages, and hold retrospectives. As you grow in the role, you will be empowered to implement innovative solutions for automating and streamlining the operation of the infrastructure powering the “balena fleet” and influence strategic decisions impacting the direction of our platform and company.
Responsibilities
Identify bottlenecks in services and failure patterns in production, and develop automated solutions to streamline operations
Define high-quality metrics for our infrastructure and continuously drive their improvement
Implement monitoring systems to collect health data, set error alerts, and increase app behavior visibility
Own the incident response process and leverage postmortem learnings to prevent similar future issues
Support balena developers with seamless, fault-tolerant deployments and production debugging
Conduct load tests to ensure applications are ready to handle projected traffic
Participate in on-call rotation and be a key resource for peers on support
Requirements
Strong technical background in software development, infrastructure and/or platform operations
Experience working with Docker containers and running production-grade Kubernetes clusters
Knowledge of modern software practices, such as instrumentation of applications for observability
Ability to manage ambiguity, push through friction, and independently make critical trade-off decisions
Drive to make yourself and others more effective through documentation and automation
Willingness to constantly build on your knowledge of the balena platform and new technologies
Excellent communication skills and fluency in English
Bonus points
Proficiency in at least one high-level scripting language (like Typescript or Javascript)
Familiarity with distributed systems, server load balancing, and high-availability architectures
Experience with cloud automation, APM and log management (we use Grafana, Prometheus, and Loki)
Good understanding of networking protocols (TCP/IP, HTTP, TLS), common failures, and mitigations
Background in leading teams and working across functions to build robust products
Experience with IoT, embedded SW, dev tools, or the balena platform as a user/contributor
Contributions to OSS projects and community involvement
Make sure to let us know if any of these items apply to you! If possible, please also share a sample of your work or examples of projects (URL or attachment).
