About the job
Senior Site Reliability Engineer (SRE)
Position Overview
The Senior Site Reliability Engineer (SRE) will play a vital role within our Site Reliability Engineering Center of Excellence (CoE). This position demands a proactive engineer who is adept at developing monitoring and observability solutions, diagnosing production challenges, and participating in 24/7 on-call operations.
This role emphasizes the application of reliability practices, the deployment of observability tools, and enhancing Mean Time to Recovery (MTTR) and Mean Time to Detection (MTTD) through automation. The SRE will work closely with Principal and Senior Staff SREs, adopting best practices and frameworks established by the CoE while directly contributing to the organization’s reliability objectives. This position reports to the Senior Manager of SRE.
Key Responsibilities
Execution & CoE Alignment
- Implement SRE frameworks, best practices, and playbooks provided by the CoE.
- Act as a hands-on engineer, contributing to observability, reliability, and incident response initiatives.
- Collaborate with senior SREs and leadership to maintain consistency in monitoring and incident processes.
- Engage in automation projects to enhance reliability and minimize manual interventions.
Observability & Monitoring
- Develop and maintain monitoring solutions using tools such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, and Graylog.
- Design and optimize dashboards, metrics, and alerts for proactive anomaly detection.
- Broaden observability coverage across infrastructure, applications, APIs, and databases.
Reliability Engineering & Automation
- Establish Service Level Indicators (SLIs), Service Level Objectives (SLOs), Service Level Agreements (SLAs), and error budgets in collaboration with product and platform teams.
- Contribute to reducing MTTD and MTTR through improved instrumentation and automation.
- Participate in capacity planning, resiliency testing, and scaling reviews.
- Support chaos engineering and reliability validation activities.
Incident & Problem Management
- Engage in incident response, including on-call rotations for 24/7 coverage.
- Assist with root cause analysis (RCA) and implement corrective actions.
- Ensure alignment with ITSM processes for incident, problem, and change management.
