About the job
Location: NYC Global HQ (Hybrid: 3 days in office)
DoubleVerify delivers digital performance solutions for advertisers and agencies, enabling independent verification, campaign optimization, and measurement of business impact. Since 2008, DV has partnered with Fortune 500 brands, agencies, publishers, and digital ad platforms to bring greater transparency and improved outcomes to digital advertising. More details are available at www.doubleverify.com.
Role overview
The Senior Site Reliability Engineer I will focus on strengthening the reliability, scalability, and performance of DoubleVerify's digital media measurement platforms. This hybrid position is based at the NYC Global HQ, with an expectation of three days per week in the office.
What you will do
- Enhance reliability, scalability, and performance for digital media measurement systems.
- Establish and refine observability practices, including setting up metrics, dashboards, and alerting to enable proactive reliability improvements.
- Reduce Mean Time to Recovery (MTTR) for critical incidents by automating processes, improving observability, and advancing monitoring capabilities.
- Lead incident response for high-severity (Sev1 and Sev2) events and drive resolutions.
- Maintain high availability across infrastructure and services in GCP, AWS, OCI, and on-premises environments.
- Guide technical projects from planning through deployment, collaborating with teams and keeping stakeholders informed.
- Design and deploy automation tools to reduce manual work and improve efficiency in deployment workflows, validation scripts, and self-service tooling.
- Utilize AI-assisted development tools for faster automation and troubleshooting. Build integrations and Monitoring Control Plane (MCP) servers to support monitoring platforms and AI-driven analysis.
- Apply Infrastructure-as-Code practices using Terraform, Helm charts, Python scripts, and configuration management tools for consistent, version-controlled deployments.
- Develop and maintain documentation, runbooks, and Standard Operating Procedures (SOPs) in Confluence to support consistent incident response.
