Staff Technical Program Manager Infrastructure As A Service jobs in San Francisco – Browse 5,409 openings on RoboApply Jobs

Staff Technical Program Manager Infrastructure As A Service jobs in San Francisco

Open roles matching “Staff Technical Program Manager Infrastructure As A Service” with location signals for San Francisco. 5,409 active listings on RoboApply Jobs.

5,409 jobs found

1 - 20 of 5,409 Jobs
Apply
Crusoe logo
Full-time|On-site|San Francisco, CA - US

Crusoe is looking for a Staff Technical Program Manager focused on Infrastructure as a Service (IaaS) in San Francisco, CA. This position takes a central role in guiding cross-functional teams as they build and launch technical solutions for the company. Role overview The Staff Technical Program Manager leads efforts across departments, making sure projects …

Apr 28, 2026
Apply
Crusoe logo
Full-time|Hybrid|San Francisco, CA - US

Join Crusoe as a Senior Staff Technical Program Manager and play a pivotal role in driving our technical initiatives. In this influential position, you will lead cross-functional teams, oversee project timelines, and ensure the successful delivery of innovative solutions. Your expertise will guide our technical strategy, enabling us to achieve our mission of revolutionizing the energy landscape through cutting-edge technology.

Mar 31, 2026
Apply
OpenAI logo
Full-time|Hybrid|San Francisco

About Our TeamThe Compute Infrastructure Team at OpenAI manages a robust fleet of GPUs and extensive compute clusters that support the models powering ChatGPT and our API. This team also accommodates the training demands for our upcoming models. We specialize in operating a state-of-the-art GPU fleet, offering a cohesive platform for various OpenAI teams to effortlessly execute production-level Applied AI and research training tasks.Our mission is to harness the potential of AI responsibly, ensuring its benefits are shared while prioritizing safety over unrestrained growth.Role OverviewAs a Technical Program Manager on our engineer-centric TPM team, you will take charge of the comprehensive delivery of large-scale GPU clusters, collaborating closely with engineers to initiate clusters across external providers and partners. You will manage a diverse portfolio that encompasses hardware, networking, power, and cooling—steering execution, risk management, and establishing clear alignment from operational teams to leadership, all aimed at delivering scalable, production-ready capacity.This position is located in San Francisco, CA, operating under a hybrid work model requiring three days in the office weekly. We also provide relocation assistance for new hires.Key ResponsibilitiesOversee the complete delivery of new Compute SKUs and large-scale GPU clusters within an external partner network while aiding capacity planning for both training and inference workloads.Drive multi-threaded program initiatives involving hardware, networking, power, and cooling—taking ownership of plans, interdependencies, and critical pathways.Collaborate with chip providers to mitigate risks associated with long-term onboarding to new hardware platforms, engaging with teams across kernels, communications, hardware, and scheduling.Develop and implement program mechanisms such as roadmaps, milestones, risk registers, and runbooks to ensure predictable delivery at scale.Work alongside engineering teams to enhance cluster turn-up reliability, repeatability, and automation, thereby decreasing the time-to-serve for new capacities.Facilitate cross-functional readiness involving security, finance, operations, and product/research stakeholders to ensure the launch of production-ready compute capabilities.Manage integrations and transitions among teams and partners to guarantee seamless execution, transparent communication, and prompt issue resolution.Identify operational bottlenecks and systemic deficiencies, driving sustainable improvements across tooling, processes, and partner interactions.

Mar 12, 2026
Apply
Pinterest, Inc. logo
Full-time|$145.7K/yr - $300.1K/yr|Remote|San Francisco, CA, US; Remote, US

Pinterest’s Infrastructure Governance team guides the company’s infrastructure strategy to match product growth. This group manages cost, capacity, and long-term sustainability, collaborating with engineering, product teams, and leaders responsible for key investments. The team develops governance frameworks, increases transparency around infrastructure and AI spending, and redesigns complex operational workflows using AI-driven methods. Role overview The Senior Technical Program Manager, Infrastructure FinOps, is an individual contributor role for someone experienced in both infrastructure governance and financial operations (FinOps). This position focuses on building scalable processes, improving visibility into infrastructure and AI investments, and managing workflows that support Pinterest’s expansion. What you will do Create and refine mechanisms for infrastructure cost management and operational efficiency Drive transparency and reporting around infrastructure and AI investments Oversee and improve workflows that help Pinterest scale sustainably Collaborate with cross-functional partners in engineering, product, and business teams What we look for Strong background in infrastructure governance and FinOps Ability to manage ambiguity and influence partners across functions Technical judgment and experience creating scalable solutions Location This role can be based in San Francisco, CA or remote within the United States.

Apr 22, 2026
Apply
Vapi logo
Full-time|On-site|San Francisco

About Vapi:At Vapi, we are revolutionizing communication by making voice the primary interface for human interaction.Our platform offers unparalleled configurability for deploying voice agents.In just two years, we have attracted over 600,000 developers, with more than 2,000 joining daily.Experience Vapi now!Why We Need You:We handle millions of calls daily, with thousands occurring concurrently.Every call generates a new audio packet every 20 milliseconds, requiring responses in under 1 second.We are scaling this operation to manage hundreds of millions of calls.This challenge is exciting and incredibly rewarding.Your Responsibilities:30 Days: Get acquainted with our multi-cluster, multi-cloud infrastructure.60 Days: Launch a new service such as Anycast Global Router.90 Days: Take ownership of a domain, such as GPU inference clusters.Your Profile:You have experience from Series B to F funding stages.You have successfully scaled large, resilient, and high-performance systems.Bonus points if you've founded your own startup!Why Choose Vapi:Generational Impact: Create the human interface for every business.Ownership Culture: 70% of our team are previous founders.Supportive Team: Our founders, Jordan and Nikhil, bring that friendly Canadian spirit.Top Investors: Backed by Y Combinator, KP Seed, and Bessemer Series A.What We Provide:Equity Ownership: Competitive salary with excellent equity options.Health Coverage: Comprehensive medical, dental, and vision plans.Team Bonding: We enjoy spending time together, including quarterly off-site events.Flexible Time Off: Take the time you need to recharge.

Jul 29, 2025
Apply
Waymo LLC logo
Full-time|$226K/yr - $278K/yr|Hybrid|Mountain View, CA, US ; San Francisco, CA, US

Waymo is at the forefront of autonomous driving technology, dedicated to becoming the world's most trusted driver. Originating from the Google Self-Driving Car Project in 2009, our focus has been on developing the Waymo Driver—The World’s Most Experienced Driver™—aimed at enhancing mobility access and significantly reducing traffic-related fatalities. The Waymo Driver is the backbone of our fully autonomous ride-hailing service and can be adapted for various vehicle platforms and applications. With over ten million rider-only trips and a remarkable experience of over 100 million miles driven autonomously on public roads, alongside extensive simulation across more than 15 U.S. states, we are poised for the future.Waymo’s Technical Program Managers are pivotal in executing our roadmap through strategic cross-functional planning, clarity, and proactive risk management. We thrive in navigating complex technical and operational challenges without predefined guidelines, acting with thoughtful urgency to drive meaningful conversations and outcomes. Our team collaborates intimately with all facets of Waymo to structure, oversee, and propel the work necessary for real-world deployments of the Waymo Driver across diverse platforms and regions.This position follows a hybrid work model and reports directly to the Director of Technical Program Management.Your Responsibilities:Lead the management of international expansion programs, establishing strong and scalable frameworks to introduce Waymo in new countries.Advance the new market framework to support future growth into additional regions.Own and drive critical cross-functional activities necessary for market entry and launch in new countries, focusing on vehicle, software, regulatory, compliance, policy, external engagement, and partner readiness.Coordinate deliverables across both internal and external teams, optimizing workflows and processes as necessary.Proactively identify and mitigate business-critical risks and dependencies.Keep stakeholders informed and aligned, representing programs to senior leadership and escalating issues as needed to eliminate obstacles.Your Qualifications:Over 10 years of experience managing intricate cross-functional programs.Proven track record in driving international expansion initiatives with a comprehensive understanding of market-entry functions (e.g., Product, Engineering, Legal, External Engagement).Exceptional leadership capabilities, with strong communication skills and the ability to inspire and align teams.

Apr 10, 2026
Apply
magic.dev logo
Full-time|On-site|San Francisco

At Magic, our mission is to create safe AGI that propels humanity forward in addressing the world’s most critical challenges. We believe that the key to achieving safe AGI lies in automating research and code generation to enhance models and resolve alignment issues more effectively than humans alone. Our unique approach integrates frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and inference-time computation to realize this vision.Role OverviewAs a vital member of our Supercomputing Platform & Infrastructure team, you will be instrumental in designing, constructing, and managing the extensive GPU infrastructure that underpins Magic’s model training and inference processes.A key aspect of your role will involve leveraging Terraform-driven infrastructure-as-code methodologies to build and maintain our infrastructure, ensuring reproducibility, reliability, and operational clarity across clusters comprising thousands of GPUs.Magic’s long-context models exert continuous demands on compute, networking, and storage systems. The infrastructure must support long-running distributed jobs, high-throughput data movement, and stringent availability requirements, necessitating designs that are automated, observable, and resilient. You will take ownership of the systems and IaC foundations that facilitate these capabilities.This position has the potential to expand into broader responsibilities encompassing supercomputing platform architecture, influencing how Magic scales GPU clusters and enhances infrastructure reliability as model workloads expand.Key ResponsibilitiesDesign and manage large-scale GPU clusters for model training and inference.Construct and sustain infrastructure utilizing Terraform across both cloud and hybrid environments.Develop modular, scalable IaC frameworks for provisioning compute, networking, and storage resources.Enhance deployment reproducibility, maintain environment consistency, and ensure operational safety.Optimize networking and storage architectures for high-throughput AI workloads.Automate fault detection and recovery mechanisms across distributed clusters.Diagnose complex cross-layer issues involving hardware, drivers, networking, storage, operating systems, and cloud environments.Enhance observability, monitoring, and reliability of essential platform systems.QualificationsStrong foundation in systems engineering principles.Extensive hands-on experience with Terraform, including module design, state management, environment isolation, and large-scale implementations.

Jan 25, 2024
Apply
Reflection AI logo
Full-time|On-site|San Francisco

Our MissionAt Reflection AI, our goal is to develop open superintelligence and make it universally accessible.We are pioneering open weight models tailored for individuals, agents, enterprises, and even entire nations. Our diverse team comprises talented AI researchers and industry veterans from prestigious organizations such as DeepMind, OpenAI, Google Brain, Meta, Character.AI, Anthropic, and many more.Role OverviewConstruct and enhance distributed training systems that drive the pre-training of cutting-edge models.Collaborate with research teams to design and execute extensive training runs for foundational models.Create infrastructure that facilitates efficient training across thousands of GPUs leveraging contemporary distributed training frameworks.Enhance training throughput, stability, and efficiency for extensive model training tasks.Work closely with pre-training researchers to convert experimental concepts into scalable, production-ready training systems.Boost performance of distributed training tasks through optimization of communication, memory management, and GPU utilization.Develop and maintain training pipelines that accommodate large-scale datasets, checkpointing, and iterative experiments.Identify and resolve performance bottlenecks within distributed training systems, including model parallelism, GPU communication, and training runtime environments.Contribute to the creation of systems that promote swift experimentation and iteration on novel training methods.

Mar 24, 2026
Apply
Reflection AI logo
Full-time|On-site|San Francisco

About the Role Reflection AI is hiring a Member of Technical Staff focused on Infrastructure Security in San Francisco. This position plays a key part in protecting the company’s infrastructure from security threats. What You Will Do Work with teams across the company to design, implement, and monitor security protocols and systems Help safeguard digital assets by maintaining the integrity and security of infrastructure

Apr 16, 2026
Apply
Parallel logo
Full-time|On-site|San Francisco or Palo Alto

About UsAt Parallel, we are a pioneering web infrastructure company dedicated to empowering businesses across various sectors, including sales, marketing, insurance, and software development. Our innovative products enable organizations to create cutting-edge AI agents with robust and flexible programmatic access to the web.Having successfully raised $130 million from esteemed investors such as Kleiner Perkins, Index Ventures, and Spark Capital, our mission is to reshape the web for AI applications. We are assembling a talented team of engineers, designers, marketers, and operational experts to help us achieve this vision.Job Overview: As a member of our technical staff, you will play a crucial role in building, operating, and scaling our infrastructure, particularly around large language models. Your responsibilities will include ensuring system reliability and cost-efficiency as we expand, anticipating potential bottlenecks, evolving our architecture to meet growing demands, and developing the tools that enhance engineering productivity.About You: You possess a deep understanding of distributed systems, cloud platforms, performance optimization, and scalable architecture. You are adept at balancing trade-offs between cost, reliability, and speed, and you are passionate about enabling teams to innovate rapidly and confidently while supporting products that serve millions of users seamlessly.

Aug 14, 2025
Apply
OpenAI logo
Full-time|On-site|San Francisco

Role overview OpenAI seeks a Technical Program Manager to focus on Token-as-a-Service initiatives in San Francisco. This position guides technical projects that shape the future of AI technology. Managing complex programs and fostering coordination among multiple teams are central to this role. What you will do Oversee technical programs that support Token-as-a-Service efforts Collaborate with engineering, product, and other departments to keep projects on track Contribute to product strategies that align with OpenAI’s commitment to responsible AI development

Apr 21, 2026
Apply
Reflection AI logo
Full-time|On-site|San Francisco

Our VisionAt Reflection AI, we are on a mission to develop open superintelligence and democratize its access for everyone.Our team, hailing from renowned organizations like DeepMind, OpenAI, Google Brain, Meta, Character.AI, and Anthropic, is dedicated to creating open weight models that cater to individuals, enterprises, and even nations.Role OverviewDesign, construct, and manage state-of-the-art GPU infrastructure for high-throughput model inference and mid-training processes.Develop systems that facilitate synthetic data generation and reinforcement learning pipelines at scale.Create high-performance inference platforms capable of serving and evaluating models across thousands of GPUs.Optimize throughput, latency, and GPU utilization for large language model inference and deployment tasks.Construct infrastructure that enhances reinforcement learning pipelines, including large-scale rollout generation, evaluation, and policy enhancement loops.Collaborate closely with research teams to support distributed reinforcement learning workloads and extensive model evaluation infrastructure.Enhance model execution performance through kernel-level optimization, model parallelism strategies, and GPU runtime improvements.Develop distributed systems that enable large-scale synthetic data generation and reinforcement learning-driven training workflows.Identify and address performance bottlenecks across inference runtimes, GPU kernels, networking, and distributed computing systems.

Mar 24, 2026
Apply
OpenAI logo
Full-time|On-site|San Francisco

About the TeamThe Stargate team at OpenAI is dedicated to constructing the physical infrastructure that drives our most advanced AI systems. We are at the forefront of designing, deploying, and managing cutting-edge data center infrastructure, expanding rapidly to meet the growing needs of AI technology. Our efforts synergize hardware, networking, facilities, supply chain, and deployment execution, ensuring seamless integration and functionality.Our mission is to convert compute requirements into reliable, scalable, and deployable systems that can manage the complexities of frontier AI workloads.About the RoleWe are looking for a passionate Hardware Operations Technical Program Manager to lead the execution of AI infrastructure hardware programs throughout their lifecycle.In this pivotal role, you will take ownership of cross-functional program execution, which includes hardware readiness, supplier coordination, deployment planning, rack-level integration, manufacturing operations, logistics, field deployment, and operational handoff. You will collaborate closely with teams in hardware engineering, data center engineering, networking, supply chain, manufacturing, deployment, and operations to ensure that critical infrastructure programs transition smoothly from design to production readiness.This position is ideally suited for an individual who can navigate both technical and programmatic aspects, comprehending hardware systems, identifying operational hurdles, fostering accountability across teams, and establishing scalable processes for high-volume infrastructure deployment.Key ResponsibilitiesLead end-to-end Hardware Operations readiness initiatives for AI infrastructure systems, encompassing servers, racks, networking hardware, power and cooling interfaces, and related data center infrastructure.Create and implement scalable hardware operations processes, workflows, and support models covering deployment, repair operations, diagnostics, break/fix, escalation management, and ongoing operations.Oversee cross-functional execution of Hardware Operations readiness initiatives, ensuring that operational capabilities, tooling, documentation, staffing models, and workflows are established ahead of production deployment and operational handoff.Collaborate with Hardware Engineering, Manufacturing, Supply Chain, Data Center Operations, Network Operations, Deployment, Reliability Engineering, and external suppliers to ensure alignment on operational requirements, supportability, and readiness milestones.Develop operational scorecards, reporting frameworks, and metric algorithms to track progress and success.

Apr 30, 2026
Apply
Databricks logo
Full-time|On-site|Mountain View, California; San Francisco, California

Join Databricks as a Senior Staff Technical Program Manager specializing in Reliability, where you will play a pivotal role in enhancing our product's reliability and performance. In this position, you will lead cross-functional teams, ensuring successful project delivery and managing technical challenges.You will collaborate closely with engineering, product management, and operations teams to define project scopes, timelines, and deliverables, while continuously improving processes and workflows. Your expertise will drive our commitment to delivering high-quality products that meet customer needs.

Mar 2, 2026
Apply
Airbnb, Inc. logo
Full-time|On-site|United States

Join Airbnb as a Senior Technical Program Manager specializing in our Hosting Services. In this pivotal role, you will lead cross-functional initiatives, driving strategic projects from conception to execution. You will collaborate with engineering teams, product managers, and stakeholders to ensure alignment and successful delivery of high-impact programs.We seek a results-oriented leader who excels in problem-solving and is passionate about improving user experiences. Your ability to manage complex projects and foster a collaborative environment will be crucial in shaping the future of our hosting solutions.

Apr 1, 2026
Apply
Discord Inc. logo
Full-time|$248K/yr - $279K/yr|On-site|San Francisco Bay Area

At Discord, we are excited to welcome over 200 million users each month, primarily drawn by our vibrant gaming community. With over 90% of our users engaging in various games, they collectively spend an astounding 1.5 billion hours on our platform monthly. Discord is poised to play a pivotal role in the future of gaming, and we are dedicated to enhancing the experience of our users before, during, and after their gaming sessions.As a Staff Technical Program Manager within our Engineering organization, you will leverage your technical expertise to oversee company-wide initiatives, facilitate cross-team collaborations, and expedite the execution of our roadmap. You will play a crucial role in the Consumer Revenue and Revenue Infra teams, which are central to delivering premium experiences through products like Nitro, Shop, and Orbs. Your contributions will be essential in providing valuable premium services to our users while ensuring compliance and maintaining the quality of our free offerings. This position reports to the Senior Manager of TPM, and you will be responsible for steering revenue projects and compliance initiatives. If you are an innovative thinker with a knack for building strong partnerships, we invite you to apply!

Mar 10, 2026
Apply
Decagon logo
Full-time|On-site|San Francisco

Role overview Decagon seeks a Technical Program Manager based in San Francisco to coordinate work across multiple teams and deliver new technical solutions. This position guides projects from initial planning through completion, ensuring schedules stay on track and technical milestones are achieved. What you will do Lead cross-functional teams to meet project goals Track project progress and adjust plans when necessary Share updates and collect feedback from stakeholders Clarify and verify technical requirements throughout each project Help advance Decagon’s strategic programs through strong program management

Apr 24, 2026
Apply
Databricks logo
Full-time|On-site|Mountain View, California; San Francisco, California

Role overview The Staff Technical Program Manager for the Unity Catalog team at Databricks will guide and coordinate complex technical projects. This position is based in Mountain View or San Francisco, California. The focus is on supporting the Unity Catalog team’s mission through effective project leadership and organization. What you will do Lead cross-functional teams to deliver technical projects for Unity Catalog Coordinate project timelines and keep efforts aligned with Databricks’ strategic goals Encourage strong communication and collaboration among engineering, product, and other partner groups Impact Strong leadership in this role will help drive innovation and uphold high standards throughout the Unity Catalog organization.

Apr 23, 2026
Apply
Chroma logo
Full-time|On-site|San Francisco, CA

At Chroma, we are at the forefront of AI data infrastructure, providing top-tier retrieval solutions that empower developers worldwide.Join us as we navigate the nascent stages of AI technology, and become part of a team that values curiosity and dedication to mastering your craft.There is significant work ahead, and we invite you to contribute to our mission.

Sep 9, 2024
Apply
Postman, Inc. logo
Full-time|$256K/yr - $276K/yr|On-site|San Francisco, California, United States

Who Are We?Postman is the leading API platform worldwide, empowering over 45 million developers and 500,000 organizations, including 98% of the Fortune 500. We're committed to fostering an API-first world by simplifying the API lifecycle and enhancing collaboration, enabling users to create superior APIs with increased speed.Headquartered in San Francisco, we have expanded our presence with offices in Boston, New York, Austin, Tokyo, London, and Bangalore—the birthplace of Postman. As a privately held company, we are backed by esteemed investors such as Battery Ventures, BOND, Coatue, CRV, Insight Partners, and Nexus Venture Partners. Discover more about us at postman.com or connect with us on X via @getpostman.P.S: We highly encourage you to explore The "API-First World" graphic novel for insights into our vision and the larger narrative.The OpportunityAs a Member of Technical Staff focusing on AI Infrastructure, you'll be instrumental in developing and maintaining the core systems and distributed infrastructure crucial for AI model post-training, inference, and data pipelines. Your role will involve close collaboration with engineering and research teams to ensure the performance, scalability, and reliability of our essential AI systems.What You’ll DoDesign and implement large-scale, distributed AI infrastructure and services.Enhance performance for GPU/xPU accelerators and cloud environments.Develop tools for observability, reliability, and scalability of AI workloads.Collaborate with cross-functional teams to define AI infrastructure requirements and roadmap.Contribute to architectural design and ensure system longevity.About YouExperience with GenAI infrastructure systems, distributed systems, cloud computing, and high-performance infrastructure.Proficient in programming languages such as Python, Go, or equivalent.Understanding of scaling challenges specific to AI workloads and accelerators.

Mar 19, 2026

Sign in to browse more jobs

Create account — see all 5,409 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.