About the job
At Groupon, we are a vibrant marketplace that connects customers with new experiences and services every day, empowering local businesses to thrive. With over a million merchant partners worldwide, we have successfully linked over 16 million customers to exceptional deals across various categories. In a market often dominated by e-commerce giants, we take pride in being one of the few platforms dedicated to driving the success of local businesses on a performance basis.
Groupon is undergoing a transformative journey, fueled by our relentless pursuit of results. Despite having thousands of employees across multiple continents, we foster a culture that inspires innovation, encourages risk-taking, and celebrates success. Here, your impact can be immediate, leveraging our scale and the speed of our transformation. We embody the "best of both worlds" ethos — large enough to provide resources and scale, yet small enough for individuals to have significant autonomy and make a meaningful impact.
Role Overview:
Are you ready to elevate your expertise and make a significant impact on the reliability and scalability of mission-critical systems? As a Lead/Principal Site Reliability Engineer (SRE Level V), you will be pivotal in ensuring the performance, availability, and resilience of our platforms. This role transcends mere system maintenance; you will spearhead initiatives that redefine operational excellence. Collaborating with diverse teams, you will implement cutting-edge technologies and best practices, nurturing a culture of reliability while mentoring fellow engineers. This is a unique opportunity for individuals passionate about solving complex challenges and shaping the future of platform reliability in a high-impact role.
Your Responsibilities:
- Design and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher.
- Lead automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools.
- Develop and optimize CI/CD pipelines to guarantee reliable, secure, and efficient software delivery.
- Create and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems utilizing Prometheus, Grafana, and the ELK stack.
- Collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs.
- Lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues.
- Design and execute performance testing, capacity planning, and scalability strategies for evolving workloads.

