About the job
About the Team
The Scaling team at OpenAI is dedicated to designing, constructing, and managing essential infrastructure that propels research forward. Our mission is straightforward: to expedite the advancement of research toward Artificial General Intelligence (AGI). We achieve this by developing foundational systems that our researchers depend on, which range from fundamental infrastructure components to tailored applications for research. These systems are designed to scale with the growing complexity and volume of our workloads while maintaining reliability and user-friendliness.
About the Role
We are in search of a skilled Site Reliability Engineer to take ownership of our production-critical infrastructure from start to finish. This role focuses on managing data-intensive, low-latency workloads, particularly involving large-scale ClickHouse clusters, high-throughput Kafka pipelines, and dependable integrations with Snowflake. You will transform unclear operational challenges into actionable plans, deliver practical solutions swiftly, and refine them based on production feedback and iterations.
The ideal candidate will have the ability to independently establish and elevate operational standards across teams while remaining actively engaged with production systems.
Key Responsibilities
Oversee the lifecycle management of infrastructure, including provisioning, upgrades, scaling, and decommissioning with an Infrastructure as Code (IaC) approach.
Manage and scale ClickHouse clusters, focusing on sharding, replication, capacity planning, performance tuning, and maintenance.
Operate Kafka as the data ingestion backbone, enhancing throughput, lag management, backpressure handling, and failure recovery.
Enhance end-to-end latency and reliability for data-heavy serving and querying workloads.
Develop and sustain robust monitoring and alerting systems: SLIs/SLOs, dashboards, alert policies, and actionable runbooks.
Establish, implement, and continuously refine incident response protocols, on-call practices, and postmortem evaluations.
Manage backup/restore and disaster recovery strategies, including regular recovery drills.
Plan and execute safe rollouts across various environments (development, staging, production), including canary deployments and rollback strategies.
Collaborate daily with software engineers to embed reliability within design, implementation, and release processes.
Set the benchmark for operational readiness and runbook standards, driving their adoption across teams.
Enhance CI/CD pipelines and developer experience for improved speed and safety.
