OpenAI logoOpenAI logo

Site Reliability Engineer - Infrastructure for Analytics Platform

OpenAISan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Mid to Senior

Qualifications

QualificationsWe are looking for candidates who possess a strong background in site reliability engineering or a related field, with significant experience in managing data-heavy applications. Familiarity with ClickHouse and Kafka is essential, as is a solid understanding of cloud infrastructure and automation tools. Ideal candidates will have:Proven expertise in managing large-scale distributed systems. Experience with Infrastructure as Code (IaC) practices. Strong problem-solving skills and the ability to work independently. Excellent communication skills, both verbal and written. A passion for optimizing performance and reliability in production environments.

About the job

About the Team

The Scaling team at OpenAI is dedicated to designing, constructing, and managing essential infrastructure that propels research forward. Our mission is straightforward: to expedite the advancement of research toward Artificial General Intelligence (AGI). We achieve this by developing foundational systems that our researchers depend on, which range from fundamental infrastructure components to tailored applications for research. These systems are designed to scale with the growing complexity and volume of our workloads while maintaining reliability and user-friendliness.

About the Role

We are in search of a skilled Site Reliability Engineer to take ownership of our production-critical infrastructure from start to finish. This role focuses on managing data-intensive, low-latency workloads, particularly involving large-scale ClickHouse clusters, high-throughput Kafka pipelines, and dependable integrations with Snowflake. You will transform unclear operational challenges into actionable plans, deliver practical solutions swiftly, and refine them based on production feedback and iterations.

The ideal candidate will have the ability to independently establish and elevate operational standards across teams while remaining actively engaged with production systems.

Key Responsibilities

  • Oversee the lifecycle management of infrastructure, including provisioning, upgrades, scaling, and decommissioning with an Infrastructure as Code (IaC) approach.

  • Manage and scale ClickHouse clusters, focusing on sharding, replication, capacity planning, performance tuning, and maintenance.

  • Operate Kafka as the data ingestion backbone, enhancing throughput, lag management, backpressure handling, and failure recovery.

  • Enhance end-to-end latency and reliability for data-heavy serving and querying workloads.

  • Develop and sustain robust monitoring and alerting systems: SLIs/SLOs, dashboards, alert policies, and actionable runbooks.

  • Establish, implement, and continuously refine incident response protocols, on-call practices, and postmortem evaluations.

  • Manage backup/restore and disaster recovery strategies, including regular recovery drills.

  • Plan and execute safe rollouts across various environments (development, staging, production), including canary deployments and rollback strategies.

  • Collaborate daily with software engineers to embed reliability within design, implementation, and release processes.

  • Set the benchmark for operational readiness and runbook standards, driving their adoption across teams.

  • Enhance CI/CD pipelines and developer experience for improved speed and safety.

About OpenAI

OpenAI is at the forefront of artificial intelligence research and development. Our commitment to creating safe and beneficial AI technologies drives our innovative approaches and solutions. We empower researchers and engineers to push the boundaries of what is possible, fostering a collaborative environment that prioritizes ethical AI advancement.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.