Senior Site Reliability Engineer

last updated April 8, 2021 9:11 UTC

HQ: Remote

Full-Time
Full-Stack Programming

more jobs in this category:

Bitnami is at the forefront of innovation that scales up to the largest production clouds, as well as down-to-laptop development environments. Millions of applications are launched every month with Bitnami technologies.

Our Site Reliability Engineering (SRE) team deploys microservices to clouds leveraging modern practices such as containers, Kubernetes and immutable infrastructure. The SRE team is responsible for the availability and performance of the production infrastructure as well as partnering with the other engineering teams to successfully build, deploy and manage Bitnami’s services. We are all about tools and automation, not toil and firefighting. If you enjoy working with the cloud, containers, automation and instrumentation, you should join our mission to bring awesome software to everyone.

You must bring an understanding of the IT business (typically gained by having built or worked extensively with a private or public cloud); a broad perspective of the cloud industry and where it is headed; and experience in building solutions that scale. You will be collaborating with engineers around the world to bring cutting-edge solutions to market. Working with all of the significant cloud providers and container infrastructures will provide you with challenges and opportunities rarely found elsewhere.

Responsibilities

Design and execute our Kubernetes clusters strategy to help our development teams deliver faster and more reliably
Drive adoption of Kubernetes and Kubernetes best practices across the company and industry
Create and/or provision reliable tools and infrastructure that enable rapid iteration amongst the product, research and development teams
Automate our infrastructure following the pattern Infrastructure as Code
Monitor, measure and troubleshoot infrastructure and services
Optimize business continuity capabilities and drive down incident recovery times
Capacity planning and management
Provide support during office hours
Mentor other members of the team (both inside and outside the SRE team)

Requirements

At least 5 years of experience deploying, monitoring and troubleshooting multi-tier SOA applications and distributed systems at scale
Software development with any or all these programming languages: Ruby, Go, Java, Javascript, Python
Instrumentation for status and trend monitoring experience (CloudWatch, Prometheus, Graphite, etc.)
Experience with modern application system log management (Syslog, SumoLogic, Fluentd, Loggly, Splunk, etc.)
Container or cloud orchestration experience with at least one scheduler (Kubernetes, Docker Swarm, Mesos, etc.)
Highly developed cloud literacy with strong knowledge of AWS, GCE and Azure
Broad experience with Linux kernel and shell, TCP/IP and HTTP
Designing networks and systems for security, encryption, performance and agility
Backup and restoration automation, business continuity planning and testing

Nice to Haves

Database administration experience with MySQL replication and high availability
Knowledge of networking and security best practices with software defined networks
Experience with big data, streaming and search systems like Cassandra, Hadoop, Spark, Kafka and ElasticSearch

Benefits/Perks

Competitive salary and stock options
Flexible time off policy; we believe everyone needs to recharge
Your choice of operating system and hardware
Annual trips to Spain (if working remotely)
Benefits vary based on location