Senior Site Reliability Engineer

last updated April 8, 2021 9:11 UTC

Bitnami

HQ: Remote

more jobs in this category:

  • -> Website & App Tester @ PingPong
  • -> Entry Level Content Writer @ Jerry
  • -> Code Challenge Reviewer - Review Code In Your Spare Time - £50 Per Hour @ Geektastic
  • -> Frontend Developer (React) @ Cake
  • -> Frontend Engineer @ Torc

Bitnami is at the forefront of innovation that scales up to the largest production clouds, as well as down-to-laptop development environments. Millions of applications are launched every month with Bitnami technologies.

Our Site Reliability Engineering (SRE) team deploys microservices to clouds leveraging modern practices such as containers, Kubernetes and immutable infrastructure. The SRE team is responsible for the availability and performance of the production infrastructure as well as partnering with the other engineering teams to successfully build, deploy and manage Bitnami’s services. We are all about tools and automation, not toil and firefighting. If you enjoy working with the cloud, containers, automation and instrumentation, you should join our mission to bring awesome software to everyone.

You must bring an understanding of the IT business (typically gained by having built or worked extensively with a private or public cloud); a broad perspective of the cloud industry and where it is headed; and experience in building solutions that scale. You will be collaborating with engineers around the world to bring cutting-edge solutions to market. Working with all of the significant cloud providers and container infrastructures will provide you with challenges and opportunities rarely found elsewhere.

Responsibilities

  • Design and execute our Kubernetes clusters strategy to help our development teams deliver faster and more reliably

  • Drive adoption of Kubernetes and Kubernetes best practices across the company and industry

  • Create and/or provision reliable tools and infrastructure that enable rapid iteration amongst the product, research and development teams

  • Automate our infrastructure following the pattern Infrastructure as Code

  • Monitor, measure and troubleshoot infrastructure and services

  • Optimize business continuity capabilities and drive down incident recovery times

  • Capacity planning and management

  • Provide support during office hours

  • Mentor other members of the team (both inside and outside the SRE team)

Requirements

  • At least 5 years of experience deploying, monitoring and troubleshooting multi-tier SOA applications and distributed systems at scale

  • Software development with any or all these programming languages: Ruby, Go, Java, Javascript, Python

  • Instrumentation for status and trend monitoring experience (CloudWatch, Prometheus, Graphite, etc.)

  • Experience with modern application system log management (Syslog, SumoLogic, Fluentd, Loggly, Splunk, etc.)

  • Container or cloud orchestration experience with at least one scheduler (Kubernetes, Docker Swarm, Mesos, etc.)

  • Highly developed cloud literacy with strong knowledge of AWS, GCE and Azure

  • Broad experience with Linux kernel and shell, TCP/IP and HTTP

  • Designing networks and systems for security, encryption, performance and agility

  • Backup and restoration automation, business continuity planning and testing

Nice to Haves

  • Database administration experience with MySQL replication and high availability

  • Knowledge of networking and security best practices with software defined networks

  • Experience with big data, streaming and search systems like Cassandra, Hadoop, Spark, Kafka and ElasticSearch

Benefits/Perks

  • Competitive salary and stock options

  • Flexible time off policy; we believe everyone needs to recharge

  • Your choice of operating system and hardware

  • Annual trips to Spain (if working remotely)

  • Benefits vary based on location

Shopping Cart
There are no products in the cart!
Total
 0.00
0