About the job
Join Salla as a Senior Site Reliability Engineer (SRE) where you will spearhead initiatives aimed at enhancing system reliability, manage complex incident responses, optimize platform performance, and mentor engineering teams in the development of resilient systems. Your role will also include participating in our on-call rotation to uphold our commitment to platform reliability.
Key Responsibilities
- Lead the response to high-severity incidents and facilitate post-incident analyses.
- Troubleshoot intricate issues spanning applications, infrastructure, and networks.
- Enhance Mean Time to Recovery (MTTR) through improved monitoring, alerting, and diagnostic tools.
- Engage in the on-call rotation to support our production systems.
Performance & Scalability
- Identify and address performance bottlenecks and scaling obstacles.
- Conduct load testing and strategic capacity planning for high-traffic scenarios.
Infrastructure & Operations
- Advance cloud-native infrastructure, deployment methodologies, and automation processes.
- Boost resilience, fault tolerance, and recovery mechanisms across systems.
Observability
- Create and enhance dashboards, alerts, metrics, logs, and traces.
- Establish Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to improve system visibility.
Tooling & Automation
- Craft tools that diminish operational toil and bolster reliability.
- Contribute to infrastructure-as-code practices, CI/CD pipelines, and GitOps workflows.
Collaboration
- Collaborate closely with engineering teams to ensure services are robust and production-ready.
- Mentor engineers in reliability, troubleshooting, and operational best practices.
Bonus Skills
- Experience with large-scale, high-traffic systems.
- Familiarity with fault-tolerant design, disaster recovery (DR), and high availability (HA) patterns.
- Knowledge of SLIs, SLOs, and error budgets.
Location Preference
- We prefer candidates located within GMT 0 to +6 time zones to facilitate team collaboration and on-call coverage.
Requirements
- Extensive experience with Kubernetes, service mesh technologies, and cloud platforms (AWS, GCP, or Azure).
- In-depth knowledge of Linux, networking, distributed systems, and load balancing.
- Practical experience with Terraform or similar Infrastructure-as-Code tools.
- Proficiency with observability platforms such as Prometheus, Grafana, Loki, Mimir, or equivalent.
- Strong skills in scripting or programming languages such as Python, Go, or Java.

