Site Reliability Engineer - Infrastructure for Analytics Platform

OpenAISan Francisco

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Experience Level

Mid to Senior

Qualifications

QualificationsWe are looking for candidates who possess a strong background in site reliability engineering or a related field, with significant experience in managing data-heavy applications. Familiarity with ClickHouse and Kafka is essential, as is a solid understanding of cloud infrastructure and automation tools. Ideal candidates will have:Proven expertise in managing large-scale distributed systems. Experience with Infrastructure as Code (IaC) practices. Strong problem-solving skills and the ability to work independently. Excellent communication skills, both verbal and written. A passion for optimizing performance and reliability in production environments.

About the job

About the Team

The Scaling team at OpenAI is dedicated to designing, constructing, and managing essential infrastructure that propels research forward. Our mission is straightforward: to expedite the advancement of research toward Artificial General Intelligence (AGI). We achieve this by developing foundational systems that our researchers depend on, which range from fundamental infrastructure components to tailored applications for research. These systems are designed to scale with the growing complexity and volume of our workloads while maintaining reliability and user-friendliness.

About the Role

We are in search of a skilled Site Reliability Engineer to take ownership of our production-critical infrastructure from start to finish. This role focuses on managing data-intensive, low-latency workloads, particularly involving large-scale ClickHouse clusters, high-throughput Kafka pipelines, and dependable integrations with Snowflake. You will transform unclear operational challenges into actionable plans, deliver practical solutions swiftly, and refine them based on production feedback and iterations.

The ideal candidate will have the ability to independently establish and elevate operational standards across teams while remaining actively engaged with production systems.

Key Responsibilities

Oversee the lifecycle management of infrastructure, including provisioning, upgrades, scaling, and decommissioning with an Infrastructure as Code (IaC) approach.
Manage and scale ClickHouse clusters, focusing on sharding, replication, capacity planning, performance tuning, and maintenance.
Operate Kafka as the data ingestion backbone, enhancing throughput, lag management, backpressure handling, and failure recovery.
Enhance end-to-end latency and reliability for data-heavy serving and querying workloads.
Develop and sustain robust monitoring and alerting systems: SLIs/SLOs, dashboards, alert policies, and actionable runbooks.
Establish, implement, and continuously refine incident response protocols, on-call practices, and postmortem evaluations.
Manage backup/restore and disaster recovery strategies, including regular recovery drills.
Plan and execute safe rollouts across various environments (development, staging, production), including canary deployments and rollback strategies.
Collaborate daily with software engineers to embed reliability within design, implementation, and release processes.
Set the benchmark for operational readiness and runbook standards, driving their adoption across teams.
Enhance CI/CD pipelines and developer experience for improved speed and safety.

About OpenAI

OpenAI is at the forefront of artificial intelligence research and development. Our commitment to creating safe and beneficial AI technologies drives our innovative approaches and solutions. We empower researchers and engineers to push the boundaries of what is possible, fostering a collaborative environment that prioritizes ethical AI advancement.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

1 - 20 of 7,365 Jobs

Search for Site Reliability Engineer Ai Infrastructure

7,365 results

Select all on this page (20)

Apply

Site Reliability Engineer - AI Infrastructure

Andromeda Cluster

Full-time|Remote|Global Remote / San Francisco, CA

Site Reliability Engineer - AI InfrastructureLocation: Global Remote / San Francisco · Full-TimeAbout AndromedaAndromeda Cluster, established by Nat Friedman and Daniel Gross, aims to democratize access to advanced AI infrastructure for early-stage startups, previously exclusive to hyperscalers. Our journey began with a single managed cluster that quickly re…

Nov 6, 2025

Apply

Senior Site Reliability Engineer for AI Infrastructure

Andromeda

Full-time|Remote|Global Remote / San Francisco, CA

Join Andromeda as a Senior Site Reliability Engineer specializing in AI Infrastructure. In this pivotal role, you will be responsible for ensuring the reliability, scalability, and performance of our cutting-edge AI systems. Collaborate with cross-functional teams to design and implement robust infrastructure solutions that support our innovative AI initiatives. Your expertise will play a crucial role in maintaining optimal service availability and improving system performance.

Apr 9, 2026

Apply

Product Infrastructure Engineer - Site Reliability

Zyphra

Full-time|On-site|San Francisco

Zyphra is a cutting-edge artificial intelligence firm located in the heart of San Francisco, California.The Opportunity:As a Product Infrastructure Engineer specializing in Site Reliability, your primary focus will be on architecting and sustaining the frameworks that ensure Zyphra's infrastructure remains strong, observable, secure, and scalable. Your contributions will be pivotal in guaranteeing the dependability and reproducibility of machine learning workloads, managing deployment safety, and ensuring the long-term viability of our computational environments.Your Responsibilities:Enhancing and developing observability systems (monitoring, logging, alerting)Creating resilient build and deployment systems across both research and production settingsEstablishing secure release protocols with comprehensive audit trails and rollback capabilitiesCollaborating closely with ML engineers, DevOps, and infrastructure teams to optimize system reliability and performanceLeading incident response efforts, conducting root-cause analysis, and facilitating postmortems with a strong emphasis on learning and preventionThis position is perfect for individuals who are passionate about creating systems that empower other teams to be faster, safer, and more efficient.Qualifications:Proven experience in high-performance computing environments, such as machine learning clusters or GPU farmsStrong background in infrastructure as code tools (e.g., Ansible, Terraform)Familiarity with software release engineering tailored for ML/AI systems is advantageousExperience in designing reliable environments for experimental workloads and reproducible executionsUnderstanding of compliance and auditing standards related to deployment and system securityExperience with load testing, fault injection, and chaos engineering to strengthen systems under pressureA passion for developing tools that render infrastructure seamless and reliable for end usersPreferred Qualifications:Experience with infrastructure as code (e.g., Ansible, Terraform)Previous experience supporting ML/AI infrastructure, including GPU management and workload optimizationExposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

Aug 22, 2025

Apply

Site Reliability Engineer (SRE)

Baseten

Full-time|On-site|San Francisco Office

ABOUT BASETENBaseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.THE ROLEAs a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.EXAMPLE INITIATIVESAs part of our Infrastructure team, you will engage in exciting projects such as:Innovative multi-cloud capacity managementOptimizing inference on B200 GPUsImplementing multi-node inferenceUtilizing fractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.Establish standards and best practices to enhance reliability and performance across the infrastructure.Proactively identify and resolve reliability issues using monitoring and alerting systems.Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.Create automation scripts to streamline processes and reduce manual intervention.

Oct 9, 2025

Apply

Senior Site Reliability Engineer - Infrastructure Security

MongoDB, Inc.

Full-time|On-site|Austin; San Francisco; Seattle; United States

Join MongoDB as a Senior Site Reliability Engineer specializing in Infrastructure Security. In this pivotal role, you'll be at the forefront of ensuring the reliability and security of our cloud infrastructure. Your expertise will help us to design and maintain systems that are robust, efficient, and secure, providing critical support to our engineering teams.Your responsibilities will include monitoring system performance, implementing security protocols, and troubleshooting incidents to maintain high availability. You will collaborate with cross-functional teams to enhance our security posture, ensuring that our services are resilient and secure.

Mar 26, 2026

Apply

Software Engineer, Site Reliability (SRE)

Sierra

Full-time|On-site|San Francisco, CA

About UsAt Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.Your RoleIn your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.QualificationsBachelor's degree in Computer Science or a related field, or equivalent experience.Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).Proficiency in Terraform and modern DevOps practices.Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.Strong problem-solving skills with a focus on scalability and performance optimization.Excellent collaboration and communication skills, with the ability to work effectively in a team environment.

Oct 21, 2025

Apply

Site Reliability Engineer - Infrastructure for Analytics Platform

OpenAI

Full-time|On-site|San Francisco

About the TeamThe Scaling team at OpenAI is dedicated to designing, constructing, and managing essential infrastructure that propels research forward. Our mission is straightforward: to expedite the advancement of research toward Artificial General Intelligence (AGI). We achieve this by developing foundational systems that our researchers depend on, which range from fundamental infrastructure components to tailored applications for research. These systems are designed to scale with the growing complexity and volume of our workloads while maintaining reliability and user-friendliness.About the RoleWe are in search of a skilled Site Reliability Engineer to take ownership of our production-critical infrastructure from start to finish. This role focuses on managing data-intensive, low-latency workloads, particularly involving large-scale ClickHouse clusters, high-throughput Kafka pipelines, and dependable integrations with Snowflake. You will transform unclear operational challenges into actionable plans, deliver practical solutions swiftly, and refine them based on production feedback and iterations.The ideal candidate will have the ability to independently establish and elevate operational standards across teams while remaining actively engaged with production systems.Key ResponsibilitiesOversee the lifecycle management of infrastructure, including provisioning, upgrades, scaling, and decommissioning with an Infrastructure as Code (IaC) approach.Manage and scale ClickHouse clusters, focusing on sharding, replication, capacity planning, performance tuning, and maintenance.Operate Kafka as the data ingestion backbone, enhancing throughput, lag management, backpressure handling, and failure recovery.Enhance end-to-end latency and reliability for data-heavy serving and querying workloads.Develop and sustain robust monitoring and alerting systems: SLIs/SLOs, dashboards, alert policies, and actionable runbooks.Establish, implement, and continuously refine incident response protocols, on-call practices, and postmortem evaluations.Manage backup/restore and disaster recovery strategies, including regular recovery drills.Plan and execute safe rollouts across various environments (development, staging, production), including canary deployments and rollback strategies.Collaborate daily with software engineers to embed reliability within design, implementation, and release processes.Set the benchmark for operational readiness and runbook standards, driving their adoption across teams.Enhance CI/CD pipelines and developer experience for improved speed and safety.

Apr 28, 2026

Apply

Senior Site Reliability Engineer at Hyperbolic | San Francisco

Hyperbolic Labs

Full-time|On-site|San Francisco, CA

Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.

Mar 26, 2026

Apply

Site Reliability Engineer at Latent | San Francisco

Latent

Full-time|On-site|San Francisco

Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.

Dec 5, 2025

Apply

Senior/Staff Site Reliability Engineer

fal

Full-time|On-site|San Francisco

Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!

Feb 23, 2026

Apply

Senior Site Reliability Engineer - Supply

Mithril

Full-time|$170K/yr - $230K/yr|On-site|San Francisco Bay Area

Mithril is actively seeking skilled professionals for the position of Senior Site Reliability Engineer in our Supply division. We are open to candidates ranging from Senior to Principal level, with specific leveling contingent upon individual experience and proven expertise. Our ideal candidate will possess profound technical insight, strategic acumen, and a consistent record of delivering impactful results. We customize roles to align with each candidate’s distinct strengths and aspirations. About MithrilAt Mithril, we are revolutionizing the way AI enterprises access computational power. Our mission is to orchestrate global compute capacity, simplifying usage and optimizing it for AI workloads. We are developing a pioneering public cloud specifically designed for AI applications, making high-performance compute as accessible and reliable as flipping a switch.Founded just over a year ago from a Stanford PhD lab, Mithril has already attracted the support of premier investors, including Sequoia, Lightspeed, Jeff Dean, and Eric Schmidt, amassing $80 million in funding. We have already secured a strong customer base and are generating revenue.Current machine learning infrastructure is excessively complicated, often forcing engineers to grapple with hardware and capacity challenges when their focus should be on innovation and addressing significant problems. Our Omnicloud Platform simplifies this by abstracting hardware management and offering seamless access to computing resources, enabling engineers to concentrate on developing transformative AI solutions.We are not merely making compute accessible; we are ensuring it is versatile, scalable, and tailored to meet the unique needs of advanced AI. Our infrastructure marketplace combines state-of-the-art hardware with Mithril’s Omnicloud Platform and software capabilities, empowering AI companies to innovate at an accelerated pace. Working at MithrilAs a Senior Site Reliability Engineer at Mithril, you will play a crucial role in building the infrastructure that drives the future of AI. Your contributions will be vital not only for scaling our systems but also for ensuring their reliability and security at every level.You will be instrumental in developing and maintaining solutions that harness global compute resources, making them available to clients within the evolving AI/ML ecosystem—from high-performance clusters to foundational models, fine-tuning, inferencing, and agentic application workflows. You will tackle world-class technical challenges that enable cutting-edge AI workloads to utilize humanity's collective knowledge through planet-scale computing, helping our clients leverage AI's potential to enhance the world.If you are inspired by the challenge of scaling and securing systems while driving innovation in AI, we encourage you to apply.

Nov 20, 2025

Apply

Senior Site Reliability Engineer at Crusoe | San Francisco, CA

Crusoe

Full-time|$172K/yr - $209K/yr|On-site|San Francisco, CA - US

At Crusoe, our mission is to drive the future of energy and intelligence. We are developing the infrastructure that empowers ambitious AI creations without compromising on scale, speed, or sustainability.Join us in leading the AI revolution through sustainable technology. At Crusoe, you will be at the forefront of innovation, contributing to impactful projects and collaborating with a team dedicated to transforming cloud infrastructure responsibly.About This Role:As a Senior Site Reliability Engineer, you will play a crucial role in ensuring the operational excellence of Crusoe’s energy-efficient, AI-optimized GPU cloud. Your focus will be on maintaining stability, resilience, and performance, driving initiatives that enhance our cloud platform.This position is perfect for engineers who thrive in dynamic environments, relish the challenge of solving operational issues, and seek to advance their technical careers while enhancing incident response and reliability for a large-scale distributed platform.You will collaborate closely with senior SREs, infrastructure engineers, and platform teams to bolster reliability, minimize operational toil, and refine our incident management processes.What You’ll Be Working On:Work with cross-functional teams to establish and enhance availability metrics for our cloud infrastructure, including the development, tracking, and improvement of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).Assist in incident response by diagnosing and resolving service disruptions, while supporting post-incident processes through root cause analysis documentation and participation in reviews.Build, maintain, and monitor the health of our infrastructure using Crusoe’s observability tools (Prometheus, Grafana, Alertmanager, OpenTelemetry).Identify and communicate reliability risks and performance bottlenecks, along with early indicators of potential incidents that may impact service availability.Develop automation and tools to reduce operational toil, minimize manual processes, and improve service recovery and self-healing capabilities.Collaborate with compute, network, storage, and platform teams to enhance service resilience and strengthen disaster recovery preparedness.Engage in knowledge sharing and contribute to the development of operational best practices across the organization.

Dec 5, 2025

Apply

Site Reliability Engineer at Blaxel | San Francisco

Blaxel

Full-time|On-site|San Francisco

Join Our Team as a Site Reliability EngineerBlaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.Your ResponsibilitiesWorking closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.Who You AreExtensive technical expertise in site reliability engineering, with a passion for building scalable systems.

Mar 3, 2026

Apply

Site Reliability Engineer (SRE) at Mithril | San Francisco

Mithril

Full-time|$170K/yr - $230K/yr|On-site|Palo Alto / San Francisco Bay Area

Mithril develops AI infrastructure aimed at making GPU computing more accessible and affordable for enterprises, AI startups, and researchers. Clients include LG AI Research, Saronic, and the Broad Institute. The company was founded by a former Google DeepMind research scientist and a Stanford CS PhD. Mithril has secured $80M in seed and Series A funding from Sequoia Capital and Lightspeed Venture Partners. Over the past year, platform revenue has grown more than sixfold. Fast Company recognized Mithril as the 8th Most Innovative Company in Artificial Intelligence for 2026. The engineering team at Mithril is small, with each member making a significant impact. This Site Reliability Engineer (SRE) position is a foundational role focused on shaping how the platform scales across a multi-cloud environment. Role overview This SRE will play a central role in keeping Mithril's global GPU orchestration platform stable and high-performing. The responsibilities extend beyond day-to-day maintenance. The primary focus is on designing and building automation, observability, and tooling to help manage advanced compute resources across multiple cloud providers. The goal is to ensure customers have fast and dependable access to infrastructure. Collaboration with Mithril's founding team is central to this job. The SRE will help set service level objectives (SLOs), orchestrate capacity, and make influential infrastructure decisions, gaining visibility into both technical and commercial aspects of the business. What makes this SRE role unique This position differs from many early-stage SRE roles that focus mainly on on-call rotations and incident response. Here, the emphasis is on building infrastructure that actively shapes Mithril's marketplace. The systems developed will determine how supply is sourced, allocated, and monitored across providers, directly affecting customer experience and company revenue. The role offers genuine ownership, a fast feedback loop with leadership, and the opportunity to define how infrastructure engineering evolves as Mithril grows. Core responsibilities About 70–75% of the work centers on platform reliability and infrastructure automation. Reliability & SLOs Implement and manage service level indicators (SLIs) and service level objectives (SLOs) for Mithril's API layer and internal orchestration services to maintain high reliability and performance.

Apr 22, 2026

Apply

Site Reliability Engineer

Cognition

Full-time|On-site|San Francisco Bay Area

Join Our TeamAt Cognition, we are at the forefront of applied AI innovation, developing cutting-edge software agents that redefine the engineering landscape. Our flagship products, Devin, the pioneering AI software engineer, and Windsurf, an AI-native IDE, embody our commitment to creating AI that collaborates with engineers as a true partner.Our team is composed of elite talent including competitive programming champions, visionary founders, and researchers from top AI institutions such as Scale AI, Palantir, Cursor, Google DeepMind, and more.Your MissionAs a Site Reliability Engineer, you will play a crucial role in ensuring the reliability of our user-focused products, which are utilized by hundreds of thousands of developers daily. Your mission is to preemptively address potential issues and swiftly resolve any incidents that may arise, maintaining a seamless experience for our users.You will be responsible for overseeing production reliability and enhancing our platform engineering practices, encompassing SLOs, incident response, and on-call duties, alongside CI/CD pipelines, deployment infrastructure, and developer tools. At Cognition, we believe in integrating reliability into our systems rather than treating it as an afterthought, and we strive to cultivate a culture that reflects this philosophy.Your AchievementsProduction Reliability: Establish and manage SLOs, SLIs, and error budgets for our products. Develop robust monitoring, alerting, and observability systems to maintain a transparent view of service health.Incident Management: Spearhead incident response with precision and promptness. Conduct blameless postmortems to derive actionable insights from outages, and create effective runbooks and tools to enhance on-call sustainability.Platform Engineering: Oversee deployment pipelines and internal developer tools, ensuring rapid, reliable shipping of code while minimizing unnecessary toil for engineers.Infrastructure as Code: Manage cloud infrastructure via code, creating reproducible, auditable environments that can scale with product demands and mitigate configuration drift.Capacity Planning: Analyze growth trends, anticipate resource requirements, and ensure our infrastructure is always ahead of user demand, optimizing system performance proactively.Security and Reliability: Integrate security protocols with reliability practices to create a robust framework that safeguards our infrastructure.

Oct 13, 2025

Apply

Site Reliability Engineer, Frontier Systems Infrastructure

OpenAI

Full-time|On-site|San Francisco

About Our TeamThe Frontier Systems team at OpenAI is at the forefront of technological innovation, responsible for designing, deploying, and maintaining state-of-the-art supercomputers that power our most advanced model training initiatives. We transform innovative data center designs into fully functional systems and develop the necessary software to support extensive frontier model training.Our mission is to ensure the stability and efficiency of these hyperscale supercomputers, providing an uninterrupted environment for the training of frontier models.About the OpportunityWe are seeking passionate engineers to manage the next generation of compute clusters that fuel OpenAI’s leading-edge research. This role merges distributed systems engineering with practical infrastructure expertise across our expansive data centers. You will be tasked with scaling Kubernetes clusters to unprecedented levels, automating bare-metal deployments, and creating software solutions that simplify interactions across a multitude of nodes in various data centers.You will operate at the confluence of hardware and software, where speed and reliability are of utmost importance. Prepare to oversee dynamic operations, swiftly diagnose and resolve critical issues, and continuously enhance automation and system uptime.Key Responsibilities:Deploy and scale substantial Kubernetes clusters, implementing automation for provisioning, bootstrapping, and lifecycle management.Create software abstractions that integrate multiple clusters, delivering a seamless interface for training workloads.Oversee node deployment from bare metal to firmware upgrades, ensuring swift and repeatable processes at scale.Enhance operational metrics, striving to minimize cluster restart times (e.g., reducing from hours to minutes) and expedite firmware or OS upgrades.Integrate networking and hardware health systems to ensure comprehensive reliability across servers, switches, and data center infrastructure.Develop monitoring and observability systems that proactively identify issues and maintain cluster stability under peak loads.Be prepared to perform at the level of a software engineer in execution and problem-solving.You May Be a Great Fit If You:Possess extensive experience in operating or scaling Kubernetes clusters or similar container orchestration systems.

Nov 3, 2025

Apply

Senior Site Reliability Engineer

alembic

Full-time|On-site|San Francisco HQ

About the RoleJoin alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.

Dec 22, 2025

Apply

Senior Site Reliability Engineer

Hive

Full-time|On-site|San Francisco

About HiveHive stands at the forefront of cloud-based AI innovation, providing cutting-edge solutions that enable organizations to understand, search, and generate content. Our platform is relied upon by some of the world's most prestigious and forward-thinking companies. We empower developers with an extensive suite of state-of-the-art, pre-trained AI models that handle billions of API requests each month. In addition to our robust model offerings, we deliver comprehensive software applications backed by proprietary AI models and datasets, unlocking transformative applications in various sectors such as content moderation, brand protection, sponsorship measurement, and context-based advertising.With over $120 million in funding from esteemed investors like General Catalyst, 8VC, Glynn Capital, Bain & Company, and Visa Ventures, Hive has cultivated a vibrant global team of over 250 employees across our San Francisco, Seattle, and Delhi offices. If you’re passionate about shaping the future of AI, we invite you to join our dynamic team!DevOps and Systems TeamIn response to our distinctive machine learning demands, we have developed our own data centers focusing on distributed high-performance computing with GPU integration. While we harness the power of these data centers, our infrastructure remains hybrid, leveraging public cloud solutions when advantageous. As we scale our machine learning models for commercial use, we are expanding our DevOps and Site Reliability team to ensure the reliability of our enterprise SaaS offerings. Our ideal candidate thrives in dynamic environments, embraces automation, and believes that every task can be automated and every server can scale. You take pride in enhancing performance across all layers of our stack and are committed to never performing the same task manually twice.

Apr 20, 2022

Apply

Infrastructure & Site Reliability Engineer at Atomic Semi | San Francisco

Atomic Semi

Full-time|$125K/yr - $195K/yr|On-site|San Francisco Office

About Atomic SemiAtomic Semi is pioneering the development of a compact and agile semiconductor fabrication facility.With today’s technology, alongside a few innovative simplifications, we are capable of realizing this vision. We will create our own tools, allowing for rapid iterations and enhancements.Our goal is to assemble a small, exceptional team of hands-on engineers to drive this initiative forward. Our team is composed of experts in mechanical, electrical, hardware, computer, and process engineering. We will manage the entire stack, from atoms to architecture, with a forward-thinking approach that pushes the boundaries of technology.Our philosophy emphasizes that smaller, faster, and self-built systems are superior.We are confident that our team and lab can create anything we envision. Equipped with 3D printers, diverse microscopes, e-beam writers, and general fabrication tools, we are committed to inventing whatever tools we may need along the way.Founded by Sam Zeloof and Jim Keller, Atomic Semi combines Sam's garage chip-making prowess with Jim's extensive 40-year leadership in the semiconductor industry.About the RoleWe are in search of an Infrastructure & Site Reliability Engineer to design, construct, deploy, and oversee the on-premises backend infrastructure that drives our rapid semiconductor fabrication process.This multifaceted role encompasses all elements of backend infrastructure and services.Our infrastructure philosophy prioritizes minimalism, clarity, on-site operations, and proximity to hardware. Expect a focus on bare-metal Linux, systemd, and single-file binaries rather than extensive use of Docker, cloud services, or Kubernetes. Proficiency in Rust, Go, and Python will be beneficial.We welcome candidates from various experience levels—ranging from outstanding early-career engineers to seasoned professionals. We are not fixed on a specific background; what is paramount is your proven ability to build real systems, enthusiasm for hands-on engineering, and a strong display of engineering excellence. If you are passionate about performance engineering, developing complex features from the ground up, and swiftly mastering new domains, this is an exciting opportunity for you.A portfolio or GitHub account is generally required to apply: demonstrate the projects you’ve undertaken!

Feb 13, 2026

Apply

Staff/Lead Site Reliability Engineer (SRE)

HeartFlow, Inc.

Full-time|$200.8K/yr - $250.9K/yr|On-site|San Francisco, California

About HeartFlow HeartFlow, Inc. is a medical technology company focused on improving the diagnosis and management of coronary artery disease. Our flagship product, the AI-powered HeartFlow FFRCT Analysis, provides a non-invasive, color-coded 3D view of a patient’s coronary arteries. Clinicians use our platform to identify blockages, assess blood flow, and analyze atherosclerosis, all in alignment with ACC/AHA Chest Pain Guidelines. HeartFlow’s technology supports care teams in the US, UK, Europe, Japan, and Canada, and has already impacted over 500,000 patients worldwide. As a publicly traded company (NASDAQ: HTFL), HeartFlow continues to expand its product line and modernize its platform to support the next generation of life-saving medical technologies. Role Overview: Staff/Lead Site Reliability Engineer (SRE) HeartFlow is searching for an experienced Site Reliability Engineer to join the cloud-native infrastructure team in San Francisco, California. This role works closely with Platform engineers and development teams to maintain and improve the reliability, scalability, observability, and performance of critical systems. What You Will Do Collaborate with Platform and development teams to ensure system reliability and performance Automate complex operational processes and reduce manual work Establish and promote standards for production excellence Support ongoing Platform Modernization initiatives Who We’re Looking For Extensive experience as a Site Reliability Engineer or in a similar role Strong background in cloud-native infrastructure Interest in automation, reliability, and scalable systems Comfort working with cross-functional engineering teams Location This position is based in San Francisco, California.

Apr 14, 2026

Create account — see all 7,365 results

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, or location & role pages.