Software Engineer Agent Infrastructure jobs in San Francisco – Page 3 | RoboApply Jobs

Software Engineer Agent Infrastructure jobs in San Francisco· Page 3

Results 41–60 of 5,802 for “Software Engineer Agent Infrastructure” in San Francisco.

5,802 jobs found

41 - 60 of 5,802 Jobs
Apply
Fable Security logo
Full-time|$160K/yr - $225K/yr|Hybrid|San Francisco, CA (Hybrid)

About Fable SecurityIn today’s digital landscape, AI-driven threats and human errors represent the most significant risks to enterprise security. Cybercriminals exploit human behavior, contributing to 70% of security breaches. At Fable, we empower individuals to transform from potential targets to active defenders with innovative tools.Fable is at the forefr…

Apr 6, 2026
Apply
Asana logo
Full-time|$202K/yr - $230K/yr|Hybrid|San Francisco

At Asana, we are committed to proactively mitigating security risks by developing core libraries, platforms, and frameworks that ensure robust security across our organization. We are seeking a highly skilled Senior Software Engineer to become a vital member of our newly established Security Development team. This team is dedicated to crafting resilient, secure-by-default solutions aimed at safeguarding Asana's infrastructure and products. Rather than acting as gatekeepers, we create guardrails that empower our engineering teams to operate swiftly and securely. Your role will involve engineering scalable preventative controls and developing platforms and frameworks that eliminate systemic risks. This position is based in our San Francisco office with a hybrid work schedule, requiring in-office attendance on Mondays, Tuesdays, and Thursdays, while Wednesdays offer a work-from-home option. Fridays may also allow for remote work, depending on your projects and team dynamics. If you are selected for an interview, your recruiter will provide additional insights on the in-office expectations.

Feb 23, 2026
Apply
Julius logo
Full-time|On-site|San Francisco, CA

Compensation: Competitive base salary + substantial equityBenefits: Health & dental insurance, gym reimbursement, daily team lunches, 401(K)About JuliusAt Julius, we're pioneering advancements in applied AI by developing cutting-edge coding agents. Our platform executes approximately 1 million lines of code every 36 hours, serving over 1 million users and generating 3 million+ visualizations. We manage all code in isolated remote containers. As a revenue-generating entity, we are backed by AI Grant and founders with remarkable backgrounds from companies like Vercel, Notion, Perplexity, Palantir, Replit, Zapier, Intercom, and Dropbox.The RoleJoin us in building and scaling the robust code-execution platform that powers Julius, across both cloud and on-prem environments. We orchestrate over 500,000 containers/month and the demand is growing rapidly. You will take ownership of reliability, performance, and security within our multi-tenant compute environment.Your ResponsibilitiesDesign and manage a secure, multi-tenant container infrastructure that ensures quick startup and intelligent autoscaling.Implement on-prem/private cloud deployments using Helm and Terraform, integrating SSO, network controls, and audit logging.Enhance observability (metrics, traces, logs) with well-defined SLOs and lead incident response initiatives.Optimize images, scheduling, networking, and costs, while developing fair-use and rate-limiting controls.Your QualificationsStrong experience with production Kubernetes and container internals (Docker/containerd); solid understanding of networking principles.Familiarity with cloud environments (AWS/GCP/Azure) and Infrastructure as Code (Terraform/Helm).Proficiency in monitoring and logging tools (Prometheus, Grafana, OpenTelemetry, ELK/Vector).Understanding of security best practices for containerized, multi-tenant systems.Preferred QualificationsExperience with gVisor, Kata, Firecracker; Cilium/eBPF; GPU scheduling; serverless autoscaling (KEDA/Knative/Karpenter).Proven experience delivering on-prem or air-gapped enterprise software solutions.A passion for AI, with experience building side projects involving LLMs.Why Join Julius?Be part of a small, senior team where your contributions will have a massive impact. Tackle challenging infrastructure problems at a meaningful scale.

Aug 11, 2025
Apply
OpenAI logo
Full-time|On-site|San Francisco

About the TeamJoin the innovative Frontier Systems team at OpenAI, where we design, implement, and maintain the world's largest supercomputers, essential for advancing our most groundbreaking model training initiatives.We transform data center blueprints into operational systems while crafting the software necessary for executing large-scale frontier model trainings.Our mission is to establish, stabilize, and ensure the reliability and efficiency of these hyperscale supercomputers throughout the training of our frontier models.About the RoleWe are seeking passionate engineers to manage the next generation of compute clusters that underpin OpenAI’s frontier research.This position merges distributed systems engineering with practical infrastructure work across our expansive data centers. You will scale Kubernetes clusters to unprecedented levels, automate bare-metal setups, and create the software layer that simplifies the complexity of numerous nodes across various data centers.Your work will be at the crossroads of hardware and software, where speed and reliability are paramount. Be prepared to oversee dynamic operations, swiftly identify and resolve pressing issues, and constantly elevate the standards for automation and uptime.In this role, you will:Provision and scale extensive Kubernetes clusters, including automation for deployment, bootstrapping, and lifecycle managementCreate software abstractions that integrate multiple clusters and provide a cohesive interface for training workloadsOversee node deployment from bare metal to firmware upgrades, ensuring rapid, repeatable setups at scaleEnhance operational metrics by reducing cluster restart times (e.g., from hours to minutes) and expediting firmware and OS upgrade cyclesIntegrate networking and hardware health systems to ensure end-to-end reliability across servers, switches, and data center infrastructureDevelop monitoring and observability systems to identify issues early and maintain cluster stability under high loadsYou might thrive in this role if you:Have extensive experience operating or scaling Kubernetes clusters or similar container orchestration systems in high-growth or hyperscale environmentsPossess strong programming skills in languages relevant to cloud and infrastructure management

Nov 7, 2024
Apply
Ivo logo
Full-time|On-site|San Francisco, California

Join the Crew of Ivo!At Ivo, we are more than just engineers; we are the pioneers of the digital seas! Our crew has set sail with groundbreaking innovations that have reshaped the landscape of legal tech:• An AI agent that seamlessly integrates with MS Word to enhance your documents [2023]• Transitioning from traditional embedding models to agentic RAG for superior performance [2023]• Advancing large-scale LLM-driven legal fact extraction [2024]• A legal assistant capable of accurately searching vast contract databases [2024]• Clustering legal documents from the same lineage [2025]• Implementing automatic deviation analysis to uncover hidden risks in extensive contract databases [2025]• Merging contracts with amendments to create comprehensive “composite” contracts (one of our clients shed tears of joy upon seeing this) [2025]The Role of an Infrastructure EngineerAs an Infrastructure Engineer, you will be the architect of Ivo's platform, ensuring its robustness and scalability.Your mission includes:• Taking ownership of our environment's future, with ample room for creative system design.• Managing numerous customer deployments—every client deserves a unique setup, from containers to databases.• Instrumenting our systems to identify performance bottlenecks and errors.• Aggregating metrics, logs, and health checks into user-friendly dashboards and alerts.• Leading the charge during infrastructure incidents.• Accelerating our CI/CD system (currently a sluggish ~12 minutes—let's speed that up!).If you share our passion for LLMs and thrive in a dynamic environment, we want you to help us push the boundaries of DevOps:• Innovating real-time LLM evaluations to ensure the accuracy of our outputs.• Building upon our existing infrastructure to enhance performance and reliability.Set sail with us at Ivo, where your technical skills will help chart the course for the future of legal technology!

Mar 5, 2026
Apply
Benchling logo
Full-time|$148K/yr - $200K/yr|On-site|San Francisco, CA

Biotechnology is transforming our world, influencing everything from the medicines we consume to the crops we cultivate and the materials we utilize daily. To keep pace with the rapid advancements in science, we require cutting-edge technology.At Benchling, our mission is to harness the potential of biotechnology. The most pioneering biotech firms rely on Benchling’s R&D Cloud to facilitate the creation of innovative products and accelerate their journey to milestones and market readiness.Join us in bringing state-of-the-art software solutions to the forefront of modern science.ROLE OVERVIEWWe are seeking a talented Backend Software Engineer to join our Infrastructure Engineering team, where you will build and maintain the foundational platform that powers our product offerings. This role spans various infrastructure disciplines, including cloud infrastructure on AWS, services based on Kubernetes, and the operational tools necessary to ensure system reliability. Collaboration is key as you will work closely with product engineering teams to better understand their requirements and enhance the developer experience throughout Benchling.We are looking for a motivated early-career engineer with strong foundational skills and a desire to grow. The ideal candidate will be enthusiastic about learning, eager to contribute across diverse areas, and ready to take on increasing responsibilities. Given our operation in a regulated environment, your work will focus on building reliable, secure, and auditable systems.RESPONSIBILITIESDevelop, sustain, and enhance core infrastructure and platform services utilized by our product engineering teams.Collaborate with product teams to establish requirements, design efficient pathways, and minimize deployment and operational challenges.Contribute to our Kubernetes-based platform, including service configuration, traffic management, and platform tooling.Design and manage AWS infrastructure and automation, emphasizing scalability, cost efficiency, and resilience.Enhance observability and operational readiness through metrics, logging, tracing, dashboards, and alert systems.Engage in an on-call rotation, manage incident responses, and drive process improvements to avert future occurrences.Produce clear technical documentation and engage in design discussions and reviews; focus on incremental system improvements for maintainability and reliability.

Feb 24, 2026
Apply
OpenAI logo
Full-time|On-site|San Francisco

About the TeamJoin OpenAI's Privacy Engineering team, where we operate at the vital crossroads of Security, Privacy, Legal, and Core Infrastructure. Our mission is to develop cutting-edge data infrastructure and systems that empower our privacy, legal, and security teams to operate securely, swiftly, and at scale. We adhere to principles of defensibility by default, enabling impactful research, and fostering a robust security culture in preparation for transformative technologies.About the RoleWe are seeking a talented Software Engineer to design and implement technical systems that facilitate legal compliance workflows, including secure data processing and document review. In this role, you will collaborate closely with Legal, Security, IT, and engineering teams to translate legal processes into actionable technical workflows. This position is perfect for an engineer passionate about large-scale data challenges and who understands the meticulousness required in ensuring compliance.Located in the vibrant city of San Francisco, we offer relocation assistance for qualified candidates.Key Responsibilities:Design and maintain scalable data storage pipelines.Develop search and discovery services (e.g., Spark/Databricks, index layers, metadata catalogs) tailored to partner team requirements.Automate secure data transfers, including encryption, checksumming, and auditing exports to reviewers.Establish secure compute environments that balance usability with stringent security controls.Implement monitoring and KPIs to ensure accountability of data holds and productions.Work cross-functionally to document SOPs, threat models, and chain-of-custody documentation that can withstand scrutiny.Ideal Candidates Will:Possess practical experience in building or operating large-scale data-lake or backup systems (Azure, AWS, GCP).Be proficient with Terraform or Pulumi, CI/CD processes, and capable of converting ad-hoc legal requests into repeatable pipelines.Be comfortable working with discovery workflows (legal holds, enterprise document collections, secure review) or eager to quickly gain expertise.Effectively communicate technical concepts—from storage governance to block-ID APIs—to interdisciplinary teams such as Legal and Engineering.

Apr 24, 2025
Apply
Andromeda Cluster logo
Full-time|Remote|North America Remote / San Francisco, CA

Join Our Team as a Software Engineer - AI InfrastructureLocation: North America Remote / San Francisco · Full-TimeAt Andromeda Cluster, we are dedicated to democratizing access to advanced AI infrastructure that was once only available to hyperscalers. Founded by industry leaders Nat Friedman and Daniel Gross, we have evolved from a singular managed cluster to a global platform that connects top AI labs, data centers, and cloud providers around the world. Our orchestration layer efficiently manages training and inference tasks globally, enhancing flexibility and efficiency in this rapidly expanding sector. We aim to create a global marketplace for AI computing, empowering AGI with the same fluidity as global financial markets.As we continue to grow, we are on the lookout for talented individuals in the fields of AI infrastructure, research, and engineering.Your RoleIn the position of Infrastructure Product Engineer, you will be integral in constructing the foundational framework of Andromeda’s platform. Your challenge will be to simplify complex, real-world infrastructure issues into scalable product solutions that our customers will benefit from.Key ResponsibilitiesArchitect and develop essential platform components, focusing on infrastructure orchestration, provisioning, and lifecycle management solutions.Create robust APIs, services, and control planes that abstract diverse infrastructure types, including VMs, Kubernetes, bare metal, and schedulers.Convert customer usage patterns into actionable product requirements, delivering impactful features and enhancements.Design automation and internal tools to mitigate manual and ad-hoc operational tasks.Improve platform reliability, performance, and observability, focusing on sustainable enhancements rather than quick fixes.Collaborate with other teams to establish clear ownership boundaries between platform features and customer-specific solutions.Write clean, maintainable, and well-documented code with a focus on long-term sustainability.Engage in technical design discussions and contribute to the architectural advancements of our platform.

Feb 18, 2026
Apply
Parafin logo
Full-time|On-site|San Francisco, CA

Join our dynamic team at Parafin as a Senior Software Engineer specializing in Infrastructure. In this pivotal role, you will design, develop, and maintain robust infrastructure solutions that support our scalable applications. Your expertise will help us enhance system performance, reliability, and security.We are looking for innovative thinkers who thrive in a collaborative environment. You will work closely with cross-functional teams to implement cutting-edge technologies that drive our product forward.

Apr 3, 2026
Apply
Figma logo
Full-time|$153K/yr - $376K/yr|Remote|San Francisco, CA • New York, NY • United States

At Figma, we are expanding our team of dedicated creatives and innovators committed to making design accessible for everyone. Our platform empowers teams to transform ideas into reality—whether you're brainstorming, prototyping, converting designs into code, or utilizing AI for enhancements. From concept to product, Figma enables teams to optimize workflows, accelerate processes, and collaborate in real-time from anywhere in the world. If you're passionate about shaping the future of design and teamwork, we invite you to join us!The Data Platform team at Figma is responsible for constructing and managing the essential systems that drive analytics, AI/ML initiatives, and data-informed decision-making across our organization. We cater to a wide array of stakeholders, including AI researchers, machine learning engineers, data scientists, product engineers, and business teams that depend on data for insights and strategic planning. Our team is tasked with owning and scaling critical platforms such as the Snowflake data warehouse, ML Datalake, orchestration and pipeline infrastructure, and extensive data ingestion and processing systems, overseeing all data transactions that occur within these platforms.Despite our small size, we tackle significant, high-impact challenges. In the upcoming years, we are focused on developing the data infrastructure layer for Figma's AI-driven products, enhancing cost and performance efficiencies across our data stack, scaling our ingestion and reverse ETL capabilities for new product applications, and reinforcing data quality, reliability, and compliance at every level. If you are enthusiastic about creating scalable, high-performance data platforms that empower teams across Figma, we would love to connect with you!This is a full-time role that can be performed from one of our US hubs or remotely within the United States.

Apr 7, 2026
Apply
Sierra logo
Full-time|On-site|San Francisco, CA

Join Sierra as a Software Engineer specializing in Agent Architecture. In this role, you will be responsible for designing and developing innovative software solutions that empower our agents and enhance their capabilities. Collaborate with cross-functional teams to ensure seamless integration of our software with existing systems.

Mar 27, 2026
Apply
Speechify logo
Full-time|On-site|San Francisco, CA, USA

Join Speechify as a Software Engineer specializing in Data Infrastructure and Acquisition. In this role, you will be critical in designing, developing, and optimizing data systems to support our innovative applications that enhance the learning experience for users worldwide. Collaborate with cross-functional teams to create robust data solutions that drive decision-making and improve overall product performance.

Apr 30, 2026
Apply
Databricks logo
Full-time|On-site|San Francisco, California

Databricks is looking for a Senior Software Engineer focused on Compute Infrastructure in San Francisco, California. This position centers on building and improving compute architecture to support greater performance and scalability across Databricks' platform. What you will do Develop and optimize compute infrastructure to handle demanding data processing and analytics workloads. Work closely with teams from different disciplines to deliver reliable, high-quality solutions for customers. Impact Your contributions will help define how data processing and analytics evolve at Databricks. The work directly supports customers’ ability to scale and perform complex tasks in the cloud. Who we’re looking for Strong background in cloud technologies and compute systems. Enjoys tackling complex technical challenges. Collaborative approach to problem-solving with cross-functional teams.

Apr 28, 2026
Apply
Hover logo
Full-time|$194K/yr - $239K/yr|On-site|san_francisconew_york

At Hover, we empower individuals to conceptualize, enhance, and safeguard the spaces they cherish. Utilizing proprietary AI and over a decade's worth of real property data, we provide answers to pivotal questions such as, 'What will it look like?' and 'What will it cost?' Our platform offers homeowners, contractors, and insurance professionals accurately measured, interactive 3D models of properties — all achievable from a smartphone scan in mere minutes.Driven by curiosity and purpose, we maintain a strong commitment to our customers, communities, and one another. We believe that diverse perspectives foster the best ideas, and we take pride in nurturing an inclusive, high-performance culture that encourages growth, accountability, and excellence. Supported by premier investors like Google Ventures and Menlo Ventures, and trusted by industry leaders such as Travelers, State Farm, and Nationwide, we are revolutionizing how individuals perceive and interact with their environments.About the RoleAs a Senior Software Engineer specializing in Infrastructure, you will delve into cloud infrastructure challenges unique to a company focused on 3D data, computer vision, and machine learning. Your enthusiasm for building internal tools and your talent for crafting elegant solutions to complex issues will be crucial in this role.Our Infrastructure team is responsible for everything beyond the application binary, serving as a critical partner to the rest of the engineering department. Through automation, we aim to streamline processes, ensuring that the simplest path is also the fastest and most secure. We manage and optimize all cloud infrastructure components including our Kubernetes environment, databases, networks, storage, and caching systems. Collaborating with engineering peers, we establish consistent solutions to common architectural challenges, particularly those involving rich geospatial and machine learning workloads. We are well-versed in best practices for cloud architecture and CI/CD, leveraging application development as a means to implement these practices.Your ContributionsYou will play a pivotal role in developing straightforward solutions to intriguing problems, thereby enhancing the foundation upon which our engineering teams build. Collaborating closely with engineers across the organization, you will help make their applications faster, easier to manage, and more reliable in production. Your work will span frontend, backend, computer vision, data, security, and machine learning teams to scale new ideas into production effectively. Given the small and highly collaborative nature of our team, you can expect a varied and impactful workload, which may include:Designing scalable cloud architectureEnhancing CI/CD pipelines and developer tooling

Mar 11, 2026
Apply
Ivo, Inc. logo
Full-time|$325K/yr - $405K/yr|On-site|San Francisco

About Ivo, Inc. Ivo, Inc. is based in San Francisco and builds advanced tools for the legal and document management space. The team has delivered recent projects such as: An AI agent for MS Word that streamlines document editing (2023) Agentic RAG for improved embedding model precision (2023) Large-scale LLMs for legal fact extraction (2024) A legal assistant for searching extensive contract databases with accuracy (2024) Clustering techniques for related legal documents (2025) Automatic deviation analysis to uncover risks in large contract sets (2025) Innovative contract merging to create composite contract series for clients (2025) Role Overview: Infrastructure Staff Software Engineer This role shapes the foundation of Ivo’s platform. The Infrastructure Engineer will design, build, and maintain the systems that power our products and support our engineering team. What You Will Do Design and build scalable infrastructure for Ivo’s platform Manage multiple customer deployments, ensuring each client has dedicated containers, databases, and VPCs Instrument systems to identify and resolve performance bottlenecks and errors Aggregate metrics, logs, and health checks into dashboards and alerting systems Lead response to infrastructure incidents and participate in on-call rotations as needed Optimize CI/CD pipelines to reduce deployment times from approximately 12 minutes DevOps and LLM Innovation Ivo values engineers who are eager to experiment and improve. Areas of exploration include: Building real-time LLM evaluation tools to monitor output accuracy Developing autonomous agents to detect and fix production issues before they escalate Contributing new ideas that advance our mission and platform reliability

Apr 14, 2026
Apply
Thinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our vision is to enhance human potential by advancing collaborative general intelligence. We are dedicated to creating a future where individuals have the resources and knowledge to harness AI for their specific objectives and aspirations.Our team comprises scientists, engineers, and innovators who have developed some of the most popular AI products, including ChatGPT and Character.ai, as well as influential open-weight models like Mistral, along with highly regarded open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking a talented engineer to enhance our data infrastructure. You will become part of a dynamic, high-impact team tasked with designing and scaling the foundational infrastructure for distributed training pipelines, multimodal data catalogs, and sophisticated processing systems that manage petabytes of data.Our infrastructure is pivotal; it serves as the foundation for every groundbreaking achievement. You will collaborate directly with researchers to expedite experiments, develop novel datasets, optimize infrastructure efficiency, and derive essential insights from our data repositories.If you are passionate about distributed systems, large-scale data mining, and open-source tools such as Spark, Kafka, Beam, Ray, and Delta Lake, and enjoy building innovative solutions from scratch, we encourage you to apply.Note: This is an evergreen role that we keep open continuously for expressions of interest. We receive a high volume of applications, and while there may not always be an immediate position that aligns perfectly with your skills and experience, we encourage you to apply. We regularly review applications and reach out as new opportunities arise. You are welcome to reapply after gaining more experience, but please refrain from applying more than once every six months. We may also post for specific roles for particular projects or team needs, and in those cases, you are welcome to apply directly in addition to this evergreen role.

Nov 27, 2025
Apply
OpenAI logo
Full-time|Hybrid|San Francisco

About Our TeamAt OpenAI, our Data Platform team is at the heart of our innovative approaches to data management, powering essential product, research, and analytics workflows. We manage some of the largest Spark compute fleets in production, architect data lakes and metadata systems on Iceberg and Delta, and envision exabyte-scale architectures. Our high-throughput streaming platforms utilize Kafka and Flink, while our orchestration is powered by Airflow. We also support machine learning feature engineering tools such as Chronon. Our mission is to provide secure, reliable, and efficient data access at scale, thereby enhancing intelligent, AI-assisted data workflows.Join us in building and maintaining these core platforms that are foundational to OpenAI's products, research, and analytics capabilities.We are not just scaling infrastructure; we are transforming the way people engage with data. Our vision includes intelligent interfaces and AI-powered workflows that make data interactions faster, more reliable, and intuitive.About the PositionIn this role, you will focus on constructing and managing data infrastructure that supports extensive compute fleets and storage systems optimized for high performance and scalability. You will be instrumental in designing, developing, and operating the next generation of data infrastructure at OpenAI. Your responsibilities will encompass scaling and securing big data compute and storage platforms, building and maintaining high-throughput streaming systems, ensuring low-latency data ingestion, and facilitating secure, governed data access for machine learning and analytics. You will also prioritize reliability and performance at extreme scales.You will have complete ownership of the full lifecycle: from architecture to implementation, production operations, and on-call responsibilities.You should be experienced with platforms such as Spark, Kafka, Flink, Airflow, Trino, or Iceberg. Familiarity with infrastructure tools like Terraform, along with expertise in debugging large-scale distributed systems, is essential. A passion for addressing data infrastructure challenges in the AI domain is a must.This role is based in San Francisco, CA. We offer a hybrid work model requiring 3 days in the office each week and provide relocation assistance for new hires.Responsibilities:Design, build, and maintain data infrastructure systems including distributed compute, data orchestration, distributed storage, streaming infrastructure, and machine learning infrastructure, ensuring they are scalable, reliable, and secure.Ensure our data platform can scale significantly while maintaining reliability and efficiency.Enhance company productivity by empowering your fellow engineers and teammates through innovative data solutions.

Jun 27, 2024
Apply
OpenAI logo
Full-time|Hybrid|San Francisco

Join the Fleet Infrastructure team at OpenAI, where you will play a pivotal role in managing and enhancing one of the world's largest and most efficient GPU fleets, dedicated to powering OpenAI's advanced model training and deployment initiatives. Your contributions will range from:Developing user-friendly scheduling and quota systems to maximize GPU utilization.Creating automated solutions for seamless Kubernetes cluster provisioning and upgrades, ensuring a robust and low-maintenance platform.Building service frameworks and deployment systems that support diverse research workflows.Enhancing model startup times through high-performance snapshot delivery, leveraging advanced blob storage and hardware caching techniques.And much more!About the RoleAs a Software Engineer in Fleet Infrastructure, you will design, develop, deploy, and maintain essential infrastructure systems that facilitate model training and deployment on a massive GPU fleet. This role presents an exciting opportunity to influence a critical system that supports OpenAI's mission to responsibly advance AI capabilities, all while working in a fast-paced environment with tight deadlines.Positioned in San Francisco, CA, we embrace a hybrid work model, encouraging three days in the office each week, along with offering relocation assistance for new hires.In this role, you will:Design, implement, and manage components of our compute fleet, focusing on job scheduling, cluster management, snapshot delivery, and CI/CD systems.Collaborate closely with research and product teams to understand and meet workload requirements effectively.Work alongside hardware, infrastructure, and business teams to deliver a service characterized by high utilization and reliability.

Feb 13, 2025
Apply
OpenAI logo
Full-time|On-site|San Francisco

About the TeamAt OpenAI, we are on a mission to develop safe and beneficial artificial general intelligence. Our models are integrated into innovative products such as ChatGPT and various APIs. To ensure these systems are swift, reliable, and economically viable, we require top-tier infrastructure that stands out in the industry.The Caching Infrastructure team plays a pivotal role by creating a robust caching layer that supports numerous critical applications at OpenAI. Our goal is to deliver a high-availability, multi-tenant caching platform capable of auto-scaling with workload demands, reducing tail latency, and accommodating a wide array of use cases.We seek an experienced engineer who can design and scale this essential infrastructure. The ideal candidate will possess extensive experience in distributed caching systems (e.g., Redis, Memcached), a solid understanding of networking fundamentals, and expertise in Kubernetes-based service orchestration.

Jul 18, 2025
Apply
OpenAI logo
Full-time|On-site|San Francisco

Team and Platform Focus The Compute Infrastructure team at OpenAI designs, builds, and maintains the systems that support AI research at scale. This work brings together accelerators, CPUs, networking, storage, data centers, orchestration software, agent infrastructure, developer tools, and observability. The aim is to create a reliable, unified experience for researchers and product teams across the company. Projects span the full stack: capacity planning, cluster lifecycle management, bare-metal automation, and distributed systems. The team manages Kubernetes scheduling, system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and improvements to the developer experience. Even small improvements in communication, scheduling, hardware efficiency, or debugging can significantly accelerate research. OpenAI matches engineers to areas within Compute Infrastructure that align with their skills and interests. Role Overview This Software Engineer role centers on building and evolving the compute platform that supports OpenAI’s research and products. Candidates may bring expertise in low-level systems, high-performance computing, distributed infrastructure, reliability, CaaS, agent infrastructure, developer platforms, tooling, or infrastructure user experience. The most important qualities are strong analytical skills, the ability to write resilient code, and a collaborative approach that helps colleagues move faster and with more confidence. What You Will Work On Working close to hardware or at the user interaction layer Developing CaaS and agent infrastructure Managing control and data planes that connect the system Bringing new supercomputing capabilities online Optimizing training workloads through profiler traces and benchmarks Improving NCCL and collective communication Analyzing GPUs, NICs, topology, firmware, thermal dynamics, and failure modes Designing abstractions to unify diverse clusters into a single platform Areas of Expertise No one is expected to cover every area listed. Some engineers focus on system performance, kernel or runtime behavior, large-scale networking protocols, RDMA, NCCL, GPU hardware, benchmarking, scheduling, or hardware reliability. Others improve the platform’s usability through APIs, tools, workflows, and developer experience. The team values strong engineering judgment and a drive to advance the field.

Apr 27, 2026

Sign in to browse more jobs

Create account — see all 5,802 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.