About the job
The Site Reliability Engineering (SRE) team at Pendo plays a crucial role in provisioning and overseeing cloud infrastructure throughout the development and production lifecycle for all product initiatives. We collaborate closely with developers and product managers to guarantee that our products are not only reliable and high-performing but also cost-efficient. Our platform leverages Google Kubernetes Engine (GKE) alongside various Google technologies including Memorystore, Cloud Datastore, PubSub, Cloud Functions, BigQuery, and Vertex AI, in addition to services from vendors like Amazon SES.
In the development phase, SREs ensure that developers have stable and efficient continuous integration and release pipelines, as well as development environments enabling swift delivery of new features. In production, SREs handle Tier 1 on-call duties and incident management, supporting a high-throughput platform that processes over 35 billion events daily. To maintain reliability for our customers, SREs work in tandem with developers and product managers to define service level objectives, analyze failure scenarios, and design systems that effectively balance cost with reliability. Additionally, SREs partner with the Information Security team to secure our cloud infrastructure, ensuring compliance with industry standards like SOC 2.
Key Responsibilities
- Develop high-quality infrastructure-as-code that automates the provisioning, deployment, scaling, and monitoring of Pendo’s infrastructure to ensure reliability and performance.
- Create maintainable code focused on product functionality, with an emphasis on operations, scalability, resilience, and monitoring.
- Collaborate with fellow engineers to ensure that new services are well-designed, properly monitored, and accompanied by clear SLIs and achievable SLOs.
- Troubleshoot production issues, quickly identify mitigation strategies, and implement preventive measures.
- Maintain and automate runbooks for manual tasks wherever feasible.
- Proactively monitor our capacity, quotas, and other performance limits to plan for growth effectively.
- Engage in a 24x7 on-call rotation to manage product availability issues and urgent customer support escalations.

