POSITION SUMMARY
We are looking for a motivated and talented Site Reliability Engineer to join us from our remote European team to help us monitor, develop, and scale the Cordial platform. Our goal is to provide our clients with a delightful experience in their day to day interaction with the platform and to create trust that the expected jobs and background processes will run without issue. You will work with our DevOps and Product teams to ensure that bugs are squashed, performance is optimized, and blind spots are revealed through comprehensive monitoring. This position is fully remote with no physical Cordial office located in Portugal.
YOU WILL
Utilize your knowledge of Web, App, Network, Server, Storage and Security technologies to administer, monitor and troubleshoot application and network components in our cloud based environment
Actively contribute to Infrastructure Design and Implementation discussions
Provide production support for the Product Development teams
Participate in an on-call rotation
Work with the team to develop and deploy monitoring and alerting architecture, and implement monitoring/logging solutions
Troubleshoot complex issues in a timely manner as necessary to maintain the performance and stability of our Production Application environment
Help build out SLOs and document and monitor SLAs
ABOUT YOU
3+ years UNIX/Linux Systems (Unix/Linux) & Network Administration (DNS, IPsec, VPN, Load Balancing, process tracing)
Experience with AWS (we use EC2, EKS)
Experience with monitoring, logging and alerting tools
Previous positions held as a SRE and/or DevOps role
- Software development experience
Experience with Docker/containers & Kubernetes
Comfortable working in a globally distributed team across time zones
Strong teamwork and communication skills
A genuine desire to learn new technologies and grow
Fluent in verbal and written English
BONUS
Experience with MongoDB
Experience deploying and/or maintaining Kubernetes/EKS clusters
Experience with Prometheus/Grafana/Datadog
Experience implementing SLOs, reliability targets, error budgets
To find out more about this job, please visit this link
