Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Entry Level
About the job
OpenAI is seeking a Performance Modeling Engineer based in San Francisco. This role centers on building and improving models that enhance the performance and efficiency of AI systems. The work directly supports the technical backbone of OpenAI’s products.
Key responsibilities
Develop and refine models aimed at optimizing the performance of AI systems.
Collaborate with engineers and data scientists to tackle technical challenges as they arise.
Contribute to projects that improve the efficiency of large-scale AI infrastructure.
Role overview
This position offers the chance to work on foundational technology that underpins OpenAI’s products. The focus is on practical improvements and close teamwork with technical colleagues to advance the capabilities and efficiency of AI at scale.
OpenAI is seeking a Performance Modeling Engineer based in San Francisco. This role centers on building and improving models that enhance the performance and efficiency of AI systems. The work directly supports the technical backbone of OpenAI’s products. Key responsibilities Develop and refine models aimed at optimizing the performance of AI systems. Collab…
Full-time|Remote|San Francisco, CA, US; Remote, US
Join Pinterest as a Principal Engineer for our Compute Platform, where you'll play a crucial role in driving the architecture and implementation of scalable systems that power our services. You will lead a team of talented engineers, guiding them in building innovative solutions that enhance our platform's performance and reliability.In this position, you will have the opportunity to collaborate with cross-functional teams, mentor junior engineers, and contribute to the development of best practices and high-quality code. If you are passionate about technology and eager to make an impact, we would love to hear from you!
Join Crusoe as a Senior Engineering Manager in Compute where you will play a pivotal role in leading cutting-edge engineering teams. You will be responsible for overseeing the development and execution of our innovative computing solutions, ensuring performance and reliability across various platforms.Your leadership will guide teams toward achieving engineering excellence, fostering a collaborative environment, and driving strategic initiatives. This is an opportunity to make a significant impact within a rapidly growing company at the forefront of technology.
Role Overview OpenAI is hiring a ChatGPT Performance Engineer in San Francisco. This role focuses on improving the performance and efficiency of ChatGPT’s advanced AI models. The position works closely with cross-functional teams to identify and implement solutions that make ChatGPT faster and more reliable for users around the world. What You Will Do Optimize the speed, reliability, and scalability of ChatGPT’s platforms. Collaborate with engineers and other teams to solve technical challenges. Develop and refine systems to support a seamless user experience globally. Impact This work directly shapes the future of AI at OpenAI, helping deliver a dependable and efficient ChatGPT experience to millions of users.
Full-time|$225K/yr - $315K/yr|Remote|San Francisco
About UsAt Lavendo, we are pioneering an infrastructure that most engineers only dream of. We operate an AI-centric cloud platform that integrates expansive GPU clusters, high-speed networking, and cloud-native tools, catering to enterprises, innovative startups, and leading research teams. Our mission is straightforward: empower our clients to efficiently train and execute complex AI and simulation workloads without the need to construct their own supercomputers.As a publicly traded company, we are rapidly expanding, with R&D centers across North America, Europe, and the Middle East. Our culture emphasizes engineering excellence: minimal bureaucracy, significant ownership, and a focus on tackling challenging infrastructure problems while witnessing the impact of our work on real customer operations.Your Role as HPC Specialist Solutions ArchitectIn this pivotal role, you will be the go-to expert for customers looking to establish or enhance advanced GPU and HPC environments in the cloud. This includes multi-rack clusters, high-speed interconnects, intricate scheduling, and strict SLAs regarding throughput and latency.As an HPC Specialist Solutions Architect, you will design and optimize cutting-edge platforms for AI training, extensive simulations, and data-intensive workloads. You will collaborate closely with NVIDIA's latest hardware (Hopper, Blackwell, and future versions), NVLink/NVSwitch topologies, and InfiniBand/RoCE fabrics, having a substantial influence on the evolution of our platform and reference architectures. If you thrive on translating workloads into optimized clusters and maximizing performance, this is the ideal position for you.Your ResponsibilitiesCluster Design: Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm. Your considerations will include node types, GPU topology, queues, partitions, and failure scenarios.Infrastructure Optimization: Integrate NVIDIA Hopper and Blackwell-class GPUs with NVLink/NVSwitch and InfiniBand/RoCE, ensuring the hardware layout aligns with the communication patterns of the workloads.Automation: Deploy and manage GPU and Network Operators to standardize drivers, CUDA, firmware, and high-speed networking across extensive fleets, rather than managing on a box-by-box basis.Supercomputer Cloud Functionality: Design and validate cloud-native HPC environments that emulate supercomputer capabilities.
Full-time|$83K/yr - $104K/yr|On-site|San Francisco
DigitalOcean is looking for a Hardware Sustaining Engineer in San Francisco to help improve and maintain the hardware that powers our global cloud platform. This role supports the Infra::Machines::Design team, focusing on the server fleet and data center infrastructure. What you will do Optimize and troubleshoot large-scale data center hardware to keep systems running smoothly for our customers. Enhance and support the hardware and firmware that form the backbone of DigitalOcean’s server infrastructure. Work with a collaborative team to address new challenges as we grow our data center presence and expand cloud capabilities. Explore and evaluate advanced technologies to strengthen our hardware support. Who we’re looking for Experience with data center hardware maintenance and troubleshooting. Interest in scaling infrastructure and supporting cloud services. Comfort working in a collaborative, growth-focused environment. Enthusiasm for learning and applying new technologies. About DigitalOcean DigitalOcean builds simple, scalable cloud solutions for developers and businesses. Our teams value learning, teamwork, and making a real impact for customers worldwide.
We are seeking a talented Performance Engineer to join our dynamic team at usm2. This is an exciting opportunity for local professionals who are passionate about optimizing system performance and enhancing user experience. As a Performance Engineer, you will play a crucial role in analyzing performance metrics, identifying bottlenecks, and implementing solutions to ensure our applications run smoothly and efficiently.
Full-time|Hybrid|Hybrid - San Francisco, California
About the Role Oura Health Inc. is looking for an Engineering Program Manager focused on Hardware Sensing. This hybrid position is based in San Francisco, California. The role centers on guiding cross-functional teams as they develop and deliver new hardware sensing technologies. Success depends on clear coordination, strong project oversight, and attention to both timelines and quality. What You Will Do Manage hardware sensing projects from planning through execution. Work closely with engineering teams to clarify technical requirements and ensure deliverables meet expectations. Keep stakeholders informed and aligned by facilitating regular communication and updates. Spot risks early and put plans in place to reduce or avoid them. Location This is a hybrid role based in San Francisco, California.
At Rylo, we are revolutionizing the way you capture and share your experiences. Our state-of-the-art camera is designed to record your surroundings with breathtaking clarity and stability, eliminating the hassle of traditional video capture. Created by a team of visionary engineers from Instagram and Apple, our innovative stabilization software and user-friendly smartphone app ensure that every shot you take is a masterpiece. With Rylo, you can focus on enjoying the moment while we handle the technicalities of creating stunning videos.Experience Rylo in actionAs a Software Engineer specializing in Computational Photography, you will play a crucial role in enhancing the core algorithms that power the Rylo camera and future products. Your work will fundamentally enhance the photography and cinematography experience, focusing on improving image quality and developing groundbreaking computational photography features. You will engage in the complete lifecycle of algorithm development, from design and implementation to quality evaluation and performance optimization, culminating in successful deployment.Your collaboration with software engineers, hardware engineers, and designers will allow you to push the boundaries of consumer camera technology.
Lumafield develops X-Ray CT scanners designed to make advanced imaging more accessible and affordable. The company’s cloud software provides engineers with detailed visualization tools, helping them analyze complex products and make informed decisions. Role overview This full-time, on-site Hardware Systems Engineer position is based in San Francisco. The role centers on leading hardware development for industrial CT scanners. Collaboration with researchers and designers is a key part of the job, with a focus on improving product development for a range of industries. What you will do Lead hardware systems development for industrial CT scanners Design and manage electrical architecture Develop firmware and oversee system integration Work hands-on to transform concepts into working products Collaborate with cross-functional teams to address customer needs Team and collaboration The engineering team includes experienced researchers and designers who value curiosity and rigor. The group is impact-driven and backed by leading venture capital firms. Location This role requires working on-site at Lumafield’s San Francisco office.
About Our TeamAt OpenAI, our Hardware organization is pioneering the development of cutting-edge silicon and system-level solutions tailored to meet the distinctive needs of advanced AI workloads. We are dedicated to building the next generation of AI silicon, collaborating closely with software engineers and research partners to co-design hardware that integrates seamlessly with our AI models. Our mission includes not only delivering high-quality, production-grade silicon for OpenAI's supercomputing infrastructure but also creating custom design tools and methodologies that foster innovation and enable hardware optimized specifically for AI applications.About the RoleWe are on the lookout for a talented Research Hardware Co-Design Engineer to operate at the intersection of model research and silicon/system architecture. In this role, you will play a critical part in shaping the numerics, architecture, and technological strategies for the future of OpenAI's silicon in collaboration with both Research and Hardware teams.Your responsibilities will include diagnosing discrepancies between theoretical performance and real-world measurements, writing quantization kernels, assessing the risks associated with numerics through model evaluations, quantifying system architecture trade-offs, and implementing innovative numeric RTL. This is a hands-on position for individuals who are passionate about tackling challenging problems, seeking practical solutions, and driving them to production. Strong prioritization and transparent communication skills are vital for success in this role.Location: San Francisco, CA (Hybrid: 3 days/week onsite)Relocation assistance available.Key Responsibilities:Enhance our roofline simulator to monitor evolving workloads and deliver analyses that quantify the impact of architectural decisions, supporting technology exploration.Identify and resolve discrepancies between performance simulations and actual measurements; effectively communicate root causes, bottlenecks, and incorrect assumptions.Develop emulation kernels for low-precision numerics and lossy compression techniques, equipping Research with the insights needed to balance efficiency with model quality.Prototype numeric modules by advancing RTL through synthesis; either hand off innovative numeric solutions cleanly or occasionally take ownership of an RTL module from start to finish.Proactively engage with new ML workloads, prototype them using rooflines and/or functional simulations, and initiate evaluations of new opportunities or risks.Gain a holistic understanding of the transition from ML science to hardware optimization, breaking down this comprehensive objective into actionable short-term deliverables.Foster collaborative relationships across diverse teams with varying goals and expertise, ensuring that progress remains unimpeded.Clearly articulate design trade-offs with explicit assumptions and rationale.
At Orbital Industries, we are pioneering an AI-driven industrial revolution by crafting innovative hardware from the atomic level. Our mission is to spearhead advancements in critical technologies that will safeguard our planet for future generations.Our initial focus involves developing essential hardware for AI data centers, enhancing both their performance and sustainability. Every product we create is designed in conjunction with our AI platform, merging AI-driven hardware engineering with AI-optimized material science to achieve groundbreaking real-world performance.With an ambitious vision ahead, we are on the lookout for exceptional individuals across diverse teams including AI research, operations, advanced materials, mechanical engineering, chemical engineering, and manufacturing.Joining Orbital means being part of closely-knit, vertically integrated teams. We seek candidates who are passionate about physical technology, curious about AI, and eager to expand their knowledge.Role OverviewWe are in search of a Talent Partner who will oversee hardware and engineering recruitment across our North American locations. This role is centered on sourcing exceptional technical talent while establishing scalable recruiting processes to support the ongoing growth of our technology division.The position involves formulating talent strategies across Orbital's key domains and executing them effectively. A robust background in in-house recruitment is essential for this role.Key ResponsibilitiesTechnical RecruitingManage the complete recruiting cycle in complex hardware and engineering environments.Identify the qualities that define an elite engineer and uphold Orbital’s exceptional hiring standards.Capture and engage the top 1% of technical talent across the US, Canada, and beyond.Stakeholder ManagementForge strong relationships with senior stakeholders including C-suite executives, founders, and hiring managers.Ensure a world-class experience for both hiring managers and candidates.Talent IdentificationDevelop talent strategies aimed at graduates at all levels (BSc, MSc, PhD) and cultivate partnerships that promote long-term technical growth.Foster engaged communities of technical talent that contribute to our innovative ecosystem.
Full-time|$130K/yr - $190K/yr|On-site|San Francisco
Job Category: AI & RoboticsAbout Avala AIAvala AI is a pioneering AI Data Infrastructure company at the forefront of real-world AI and its integration with the labor economy. We excel in delivering high-quality data labeling, comprehensive dataset management, and insightful data visualization, providing 4D labeling solutions tailored for autonomous vehicles, humanoid robots, and drone applications. Our mission is to empower AI-driven sectors—ranging from AV companies to robotics innovators and drone enterprises—by equipping them with the essential data infrastructure to propel the next generation of intelligent systems while offering dignified digital employment opportunities globally.The RoleIn your capacity as a 3D Computer Vision Engineer at Avala AI, you will be responsible for designing and implementing cutting-edge solutions for both offline and online 3D reconstruction and scene understanding, ensuring robustness, accuracy, and performance. You will collaborate on a world-class spatial computing platform deployed extensively in autonomous vehicles, advanced robotic systems, and drone technologies. Your contributions will advance the capabilities of real-world AI while utilizing the latest advancements in deep learning and 3D computer vision techniques.What You’ll DoSpatial Computing & Reconstruction: Innovate through the application of NeRFs, Diffusion Models, Gaussian Splatting, Multiview Stereo, TSDF Fusion, Structure from Motion, and SLAM methodologies.Mission-Critical Perception: Develop robust 3D perception systems and scene understanding frameworks that enhance safety and operational performance across various robotics and AV applications.4D Data Labeling & Visualization: Work collaboratively with cross-functional teams to enhance and expand Avala’s 4D labeling platform for automobiles, humanoid robots, and drones.Software Engineering Best Practices: Apply strong coding, testing, and deployment methodologies to ensure rapid, safe, and efficient development of innovative solutions.Boundary-Pushing Innovation: Actively explore new methodologies and technologies that advance the field of 3D vision, neural rendering, and large-scale data processing.
At Genmo, we are at the forefront of advancing artificial intelligence through innovative research in video generation. Our mission is to construct open, cutting-edge models that will ultimately contribute to the realization of Artificial General Intelligence (AGI). As part of our dynamic team, you will play a pivotal role in redefining the future of AI and expanding the horizons of video creation.We are looking for a skilled GPU Performance Engineer who can extract maximum performance from our H100 infrastructure and fine-tune our model serving stack to achieve unparalleled efficiency. If you are passionate about optimizing performance, particularly at the microsecond level, and thrive on pushing hardware to its limits, this is the perfect opportunity for you.Key ResponsibilitiesUtilize advanced profiling tools such as Nsight Systems and nvprof to analyze and enhance GPU workloads.Develop high-performance CUDA and Triton kernels to optimize essential model functions.Reduce cold start latency from seconds to mere milliseconds in our serving infrastructure.Optimize memory access patterns, implement kernel fusion, and maximize GPU utilization.Collaborate closely with machine learning engineers to optimize model implementations.Diagnose and resolve performance issues throughout the application and hardware stack.Implement custom memory pooling and allocation strategies to enhance performance.Promote performance optimization techniques and foster a culture of excellence across teams.
Full-time|On-site|San Francisco, Seattle, New York, Toronto
Join Stripe as a Staff Software Engineer in our Stream Compute team, where you will play a pivotal role in building scalable solutions that power the financial infrastructure of the internet. As a member of our innovative engineering team, you will leverage your expertise to design and implement robust software solutions that enhance the performance and reliability of our streaming data capabilities.
Team and Platform Focus The Compute Infrastructure team at OpenAI designs, builds, and maintains the systems that support AI research at scale. This work brings together accelerators, CPUs, networking, storage, data centers, orchestration software, agent infrastructure, developer tools, and observability. The aim is to create a reliable, unified experience for researchers and product teams across the company. Projects span the full stack: capacity planning, cluster lifecycle management, bare-metal automation, and distributed systems. The team manages Kubernetes scheduling, system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and improvements to the developer experience. Even small improvements in communication, scheduling, hardware efficiency, or debugging can significantly accelerate research. OpenAI matches engineers to areas within Compute Infrastructure that align with their skills and interests. Role Overview This Software Engineer role centers on building and evolving the compute platform that supports OpenAI’s research and products. Candidates may bring expertise in low-level systems, high-performance computing, distributed infrastructure, reliability, CaaS, agent infrastructure, developer platforms, tooling, or infrastructure user experience. The most important qualities are strong analytical skills, the ability to write resilient code, and a collaborative approach that helps colleagues move faster and with more confidence. What You Will Work On Working close to hardware or at the user interaction layer Developing CaaS and agent infrastructure Managing control and data planes that connect the system Bringing new supercomputing capabilities online Optimizing training workloads through profiler traces and benchmarks Improving NCCL and collective communication Analyzing GPUs, NICs, topology, firmware, thermal dynamics, and failure modes Designing abstractions to unify diverse clusters into a single platform Areas of Expertise No one is expected to cover every area listed. Some engineers focus on system performance, kernel or runtime behavior, large-scale networking protocols, RDMA, NCCL, GPU hardware, benchmarking, scheduling, or hardware reliability. Others improve the platform’s usability through APIs, tools, workflows, and developer experience. The team values strong engineering judgment and a drive to advance the field.
About Our TeamThe Training Runtime team is at the forefront of developing a cutting-edge distributed machine learning training runtime, enabling everything from pioneering research to large-scale model deployments. Our mission is to empower researchers while facilitating growth into frontier-scale operations. We are crafting a cohesive, modular runtime that adapts to researchers’ evolving needs as they progress along the scaling curve.Our focus is anchored in three key areas: optimizing high-performance, asynchronous data movement that is aware of tensor and optimizer states; building robust, fault-tolerant training frameworks that incorporate comprehensive state management, resilient checkpointing, deterministic orchestration, and advanced observability; and managing distributed processes for enduring, job-specific, and user-defined workflows.We aim to seamlessly integrate proven large-scale capabilities into a developer-friendly runtime, enabling teams to iterate rapidly and operate reliably across various scales. Our success is gauged by both the enhancement of training throughput (the speed of model training) and researcher throughput (the pace at which ideas transform into experiments and products).About the RoleAs a Training Performance Engineer, you will be instrumental in driving efficiency enhancements throughout our distributed training architecture. Your responsibilities will include analyzing extensive training runs, pinpointing utilization gaps, and engineering optimizations that maximize throughput and system uptime. This position merges a profound understanding of systems with practical performance engineering—analyzing GPU kernel performance, collective communication throughput, and investigating I/O bottlenecks, while also implementing model sharding techniques for large-scale training.Your efforts will ensure our clusters operate at peak performance, enabling OpenAI to develop larger and more sophisticated models within existing compute budgets.This position is located in San Francisco, CA, utilizing a hybrid work model with three days in the office each week, and we offer relocation assistance for new hires.Key Responsibilities:Analyze end-to-end training runs to detect performance bottlenecks across computation, communication, and storage.Enhance GPU utilization and throughput for large-scale distributed model training.Collaborate with runtime and systems engineers to boost kernel efficiency, scheduling, and collective communication performance.Implement model graph transformations to enhance overall throughput.Develop tools for monitoring and visualizing metrics such as MFU, throughput, and uptime across clusters.
Role overview The Hardware Program Manager at meter leads complex hardware initiatives from initial planning through final delivery. This position ensures each project phase aligns with company objectives and remains on schedule. The role is based in San Francisco. What you will do Work closely with teams across disciplines to guide hardware products from concept to launch Manage project plans, set timelines, and track deliverables Monitor project progress and make adjustments to keep deadlines on track Contribute to the development of hardware solutions that enhance meter’s product offerings Location This role requires working onsite in San Francisco.
About Our TeamAt OpenAI, the Fleet team is integral to maintaining the robust computing environment that fuels our groundbreaking research and innovative product development. We manage extensive systems encompassing data centers, GPUs, networking, and more, ensuring optimal performance, availability, and efficiency. Our efforts empower OpenAI’s models to function seamlessly at scale, supporting both internal R&D and external offerings like ChatGPT. We emphasize safety, reliability, and responsible AI deployment over unrestrained growth.About the RoleAs a Software Engineer on the Fleet Hardware team, you will play a crucial role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for research training progress and service stability since even minor disruptions can lead to significant setbacks. With the increasing complexity of supercomputers, the pressure to maintain operational integrity has never been higher.This is a unique opportunity to be at the forefront of technology, pioneering solutions for troubleshooting advanced systems on a large scale. You will work with cutting-edge technologies and innovate solutions to ensure the health and efficiency of our supercomputing infrastructure.Our team empowers skilled engineers with a significant degree of autonomy and ownership, enabling them to drive impactful change. This role requires a keen focus on comprehensive system investigations and the development of automated solutions. We seek individuals who dive deep into problems, conduct thorough investigations, and create automation for large-scale detection and remediation.In this Role, You Will:Design and maintain automation systems for provisioning and managing server fleets.Develop tools to monitor server health, performance, and lifecycle events.Collaborate with teams across clusters, networking, and infrastructure.Partner with external operators to uphold high-quality standards.Identify and resolve performance bottlenecks and inefficiencies.Continuously enhance automation to minimize manual tasks.
At Rylo, we are revolutionizing the way video is captured and shared. Our innovative camera technology records stunning, immersive video effortlessly, eliminating the hassle of traditional filming and editing. Built by a talented team of former engineers from Instagram and Apple, our advanced stabilization software and intuitive smartphone app allow users to create beautiful videos without the stress of perfect framing or stability. Simply hit record and enjoy the magic of producing flawless video content after the fact.Watch Rylo in actionJoin us at Rylo and become part of a dynamic team focused on developing cutting-edge consumer camera technology. You will have a unique chance to drive software advancements and influence hardware design, paving the way for innovative computational photography solutions.
OpenAI is seeking a Performance Modeling Engineer based in San Francisco. This role centers on building and improving models that enhance the performance and efficiency of AI systems. The work directly supports the technical backbone of OpenAI’s products. Key responsibilities Develop and refine models aimed at optimizing the performance of AI systems. Collab…
Full-time|Remote|San Francisco, CA, US; Remote, US
Join Pinterest as a Principal Engineer for our Compute Platform, where you'll play a crucial role in driving the architecture and implementation of scalable systems that power our services. You will lead a team of talented engineers, guiding them in building innovative solutions that enhance our platform's performance and reliability.In this position, you will have the opportunity to collaborate with cross-functional teams, mentor junior engineers, and contribute to the development of best practices and high-quality code. If you are passionate about technology and eager to make an impact, we would love to hear from you!
Join Crusoe as a Senior Engineering Manager in Compute where you will play a pivotal role in leading cutting-edge engineering teams. You will be responsible for overseeing the development and execution of our innovative computing solutions, ensuring performance and reliability across various platforms.Your leadership will guide teams toward achieving engineering excellence, fostering a collaborative environment, and driving strategic initiatives. This is an opportunity to make a significant impact within a rapidly growing company at the forefront of technology.
Role Overview OpenAI is hiring a ChatGPT Performance Engineer in San Francisco. This role focuses on improving the performance and efficiency of ChatGPT’s advanced AI models. The position works closely with cross-functional teams to identify and implement solutions that make ChatGPT faster and more reliable for users around the world. What You Will Do Optimize the speed, reliability, and scalability of ChatGPT’s platforms. Collaborate with engineers and other teams to solve technical challenges. Develop and refine systems to support a seamless user experience globally. Impact This work directly shapes the future of AI at OpenAI, helping deliver a dependable and efficient ChatGPT experience to millions of users.
Full-time|$225K/yr - $315K/yr|Remote|San Francisco
About UsAt Lavendo, we are pioneering an infrastructure that most engineers only dream of. We operate an AI-centric cloud platform that integrates expansive GPU clusters, high-speed networking, and cloud-native tools, catering to enterprises, innovative startups, and leading research teams. Our mission is straightforward: empower our clients to efficiently train and execute complex AI and simulation workloads without the need to construct their own supercomputers.As a publicly traded company, we are rapidly expanding, with R&D centers across North America, Europe, and the Middle East. Our culture emphasizes engineering excellence: minimal bureaucracy, significant ownership, and a focus on tackling challenging infrastructure problems while witnessing the impact of our work on real customer operations.Your Role as HPC Specialist Solutions ArchitectIn this pivotal role, you will be the go-to expert for customers looking to establish or enhance advanced GPU and HPC environments in the cloud. This includes multi-rack clusters, high-speed interconnects, intricate scheduling, and strict SLAs regarding throughput and latency.As an HPC Specialist Solutions Architect, you will design and optimize cutting-edge platforms for AI training, extensive simulations, and data-intensive workloads. You will collaborate closely with NVIDIA's latest hardware (Hopper, Blackwell, and future versions), NVLink/NVSwitch topologies, and InfiniBand/RoCE fabrics, having a substantial influence on the evolution of our platform and reference architectures. If you thrive on translating workloads into optimized clusters and maximizing performance, this is the ideal position for you.Your ResponsibilitiesCluster Design: Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm. Your considerations will include node types, GPU topology, queues, partitions, and failure scenarios.Infrastructure Optimization: Integrate NVIDIA Hopper and Blackwell-class GPUs with NVLink/NVSwitch and InfiniBand/RoCE, ensuring the hardware layout aligns with the communication patterns of the workloads.Automation: Deploy and manage GPU and Network Operators to standardize drivers, CUDA, firmware, and high-speed networking across extensive fleets, rather than managing on a box-by-box basis.Supercomputer Cloud Functionality: Design and validate cloud-native HPC environments that emulate supercomputer capabilities.
Full-time|$83K/yr - $104K/yr|On-site|San Francisco
DigitalOcean is looking for a Hardware Sustaining Engineer in San Francisco to help improve and maintain the hardware that powers our global cloud platform. This role supports the Infra::Machines::Design team, focusing on the server fleet and data center infrastructure. What you will do Optimize and troubleshoot large-scale data center hardware to keep systems running smoothly for our customers. Enhance and support the hardware and firmware that form the backbone of DigitalOcean’s server infrastructure. Work with a collaborative team to address new challenges as we grow our data center presence and expand cloud capabilities. Explore and evaluate advanced technologies to strengthen our hardware support. Who we’re looking for Experience with data center hardware maintenance and troubleshooting. Interest in scaling infrastructure and supporting cloud services. Comfort working in a collaborative, growth-focused environment. Enthusiasm for learning and applying new technologies. About DigitalOcean DigitalOcean builds simple, scalable cloud solutions for developers and businesses. Our teams value learning, teamwork, and making a real impact for customers worldwide.
We are seeking a talented Performance Engineer to join our dynamic team at usm2. This is an exciting opportunity for local professionals who are passionate about optimizing system performance and enhancing user experience. As a Performance Engineer, you will play a crucial role in analyzing performance metrics, identifying bottlenecks, and implementing solutions to ensure our applications run smoothly and efficiently.
Full-time|Hybrid|Hybrid - San Francisco, California
About the Role Oura Health Inc. is looking for an Engineering Program Manager focused on Hardware Sensing. This hybrid position is based in San Francisco, California. The role centers on guiding cross-functional teams as they develop and deliver new hardware sensing technologies. Success depends on clear coordination, strong project oversight, and attention to both timelines and quality. What You Will Do Manage hardware sensing projects from planning through execution. Work closely with engineering teams to clarify technical requirements and ensure deliverables meet expectations. Keep stakeholders informed and aligned by facilitating regular communication and updates. Spot risks early and put plans in place to reduce or avoid them. Location This is a hybrid role based in San Francisco, California.
At Rylo, we are revolutionizing the way you capture and share your experiences. Our state-of-the-art camera is designed to record your surroundings with breathtaking clarity and stability, eliminating the hassle of traditional video capture. Created by a team of visionary engineers from Instagram and Apple, our innovative stabilization software and user-friendly smartphone app ensure that every shot you take is a masterpiece. With Rylo, you can focus on enjoying the moment while we handle the technicalities of creating stunning videos.Experience Rylo in actionAs a Software Engineer specializing in Computational Photography, you will play a crucial role in enhancing the core algorithms that power the Rylo camera and future products. Your work will fundamentally enhance the photography and cinematography experience, focusing on improving image quality and developing groundbreaking computational photography features. You will engage in the complete lifecycle of algorithm development, from design and implementation to quality evaluation and performance optimization, culminating in successful deployment.Your collaboration with software engineers, hardware engineers, and designers will allow you to push the boundaries of consumer camera technology.
Lumafield develops X-Ray CT scanners designed to make advanced imaging more accessible and affordable. The company’s cloud software provides engineers with detailed visualization tools, helping them analyze complex products and make informed decisions. Role overview This full-time, on-site Hardware Systems Engineer position is based in San Francisco. The role centers on leading hardware development for industrial CT scanners. Collaboration with researchers and designers is a key part of the job, with a focus on improving product development for a range of industries. What you will do Lead hardware systems development for industrial CT scanners Design and manage electrical architecture Develop firmware and oversee system integration Work hands-on to transform concepts into working products Collaborate with cross-functional teams to address customer needs Team and collaboration The engineering team includes experienced researchers and designers who value curiosity and rigor. The group is impact-driven and backed by leading venture capital firms. Location This role requires working on-site at Lumafield’s San Francisco office.
About Our TeamAt OpenAI, our Hardware organization is pioneering the development of cutting-edge silicon and system-level solutions tailored to meet the distinctive needs of advanced AI workloads. We are dedicated to building the next generation of AI silicon, collaborating closely with software engineers and research partners to co-design hardware that integrates seamlessly with our AI models. Our mission includes not only delivering high-quality, production-grade silicon for OpenAI's supercomputing infrastructure but also creating custom design tools and methodologies that foster innovation and enable hardware optimized specifically for AI applications.About the RoleWe are on the lookout for a talented Research Hardware Co-Design Engineer to operate at the intersection of model research and silicon/system architecture. In this role, you will play a critical part in shaping the numerics, architecture, and technological strategies for the future of OpenAI's silicon in collaboration with both Research and Hardware teams.Your responsibilities will include diagnosing discrepancies between theoretical performance and real-world measurements, writing quantization kernels, assessing the risks associated with numerics through model evaluations, quantifying system architecture trade-offs, and implementing innovative numeric RTL. This is a hands-on position for individuals who are passionate about tackling challenging problems, seeking practical solutions, and driving them to production. Strong prioritization and transparent communication skills are vital for success in this role.Location: San Francisco, CA (Hybrid: 3 days/week onsite)Relocation assistance available.Key Responsibilities:Enhance our roofline simulator to monitor evolving workloads and deliver analyses that quantify the impact of architectural decisions, supporting technology exploration.Identify and resolve discrepancies between performance simulations and actual measurements; effectively communicate root causes, bottlenecks, and incorrect assumptions.Develop emulation kernels for low-precision numerics and lossy compression techniques, equipping Research with the insights needed to balance efficiency with model quality.Prototype numeric modules by advancing RTL through synthesis; either hand off innovative numeric solutions cleanly or occasionally take ownership of an RTL module from start to finish.Proactively engage with new ML workloads, prototype them using rooflines and/or functional simulations, and initiate evaluations of new opportunities or risks.Gain a holistic understanding of the transition from ML science to hardware optimization, breaking down this comprehensive objective into actionable short-term deliverables.Foster collaborative relationships across diverse teams with varying goals and expertise, ensuring that progress remains unimpeded.Clearly articulate design trade-offs with explicit assumptions and rationale.
At Orbital Industries, we are pioneering an AI-driven industrial revolution by crafting innovative hardware from the atomic level. Our mission is to spearhead advancements in critical technologies that will safeguard our planet for future generations.Our initial focus involves developing essential hardware for AI data centers, enhancing both their performance and sustainability. Every product we create is designed in conjunction with our AI platform, merging AI-driven hardware engineering with AI-optimized material science to achieve groundbreaking real-world performance.With an ambitious vision ahead, we are on the lookout for exceptional individuals across diverse teams including AI research, operations, advanced materials, mechanical engineering, chemical engineering, and manufacturing.Joining Orbital means being part of closely-knit, vertically integrated teams. We seek candidates who are passionate about physical technology, curious about AI, and eager to expand their knowledge.Role OverviewWe are in search of a Talent Partner who will oversee hardware and engineering recruitment across our North American locations. This role is centered on sourcing exceptional technical talent while establishing scalable recruiting processes to support the ongoing growth of our technology division.The position involves formulating talent strategies across Orbital's key domains and executing them effectively. A robust background in in-house recruitment is essential for this role.Key ResponsibilitiesTechnical RecruitingManage the complete recruiting cycle in complex hardware and engineering environments.Identify the qualities that define an elite engineer and uphold Orbital’s exceptional hiring standards.Capture and engage the top 1% of technical talent across the US, Canada, and beyond.Stakeholder ManagementForge strong relationships with senior stakeholders including C-suite executives, founders, and hiring managers.Ensure a world-class experience for both hiring managers and candidates.Talent IdentificationDevelop talent strategies aimed at graduates at all levels (BSc, MSc, PhD) and cultivate partnerships that promote long-term technical growth.Foster engaged communities of technical talent that contribute to our innovative ecosystem.
Full-time|$130K/yr - $190K/yr|On-site|San Francisco
Job Category: AI & RoboticsAbout Avala AIAvala AI is a pioneering AI Data Infrastructure company at the forefront of real-world AI and its integration with the labor economy. We excel in delivering high-quality data labeling, comprehensive dataset management, and insightful data visualization, providing 4D labeling solutions tailored for autonomous vehicles, humanoid robots, and drone applications. Our mission is to empower AI-driven sectors—ranging from AV companies to robotics innovators and drone enterprises—by equipping them with the essential data infrastructure to propel the next generation of intelligent systems while offering dignified digital employment opportunities globally.The RoleIn your capacity as a 3D Computer Vision Engineer at Avala AI, you will be responsible for designing and implementing cutting-edge solutions for both offline and online 3D reconstruction and scene understanding, ensuring robustness, accuracy, and performance. You will collaborate on a world-class spatial computing platform deployed extensively in autonomous vehicles, advanced robotic systems, and drone technologies. Your contributions will advance the capabilities of real-world AI while utilizing the latest advancements in deep learning and 3D computer vision techniques.What You’ll DoSpatial Computing & Reconstruction: Innovate through the application of NeRFs, Diffusion Models, Gaussian Splatting, Multiview Stereo, TSDF Fusion, Structure from Motion, and SLAM methodologies.Mission-Critical Perception: Develop robust 3D perception systems and scene understanding frameworks that enhance safety and operational performance across various robotics and AV applications.4D Data Labeling & Visualization: Work collaboratively with cross-functional teams to enhance and expand Avala’s 4D labeling platform for automobiles, humanoid robots, and drones.Software Engineering Best Practices: Apply strong coding, testing, and deployment methodologies to ensure rapid, safe, and efficient development of innovative solutions.Boundary-Pushing Innovation: Actively explore new methodologies and technologies that advance the field of 3D vision, neural rendering, and large-scale data processing.
At Genmo, we are at the forefront of advancing artificial intelligence through innovative research in video generation. Our mission is to construct open, cutting-edge models that will ultimately contribute to the realization of Artificial General Intelligence (AGI). As part of our dynamic team, you will play a pivotal role in redefining the future of AI and expanding the horizons of video creation.We are looking for a skilled GPU Performance Engineer who can extract maximum performance from our H100 infrastructure and fine-tune our model serving stack to achieve unparalleled efficiency. If you are passionate about optimizing performance, particularly at the microsecond level, and thrive on pushing hardware to its limits, this is the perfect opportunity for you.Key ResponsibilitiesUtilize advanced profiling tools such as Nsight Systems and nvprof to analyze and enhance GPU workloads.Develop high-performance CUDA and Triton kernels to optimize essential model functions.Reduce cold start latency from seconds to mere milliseconds in our serving infrastructure.Optimize memory access patterns, implement kernel fusion, and maximize GPU utilization.Collaborate closely with machine learning engineers to optimize model implementations.Diagnose and resolve performance issues throughout the application and hardware stack.Implement custom memory pooling and allocation strategies to enhance performance.Promote performance optimization techniques and foster a culture of excellence across teams.
Full-time|On-site|San Francisco, Seattle, New York, Toronto
Join Stripe as a Staff Software Engineer in our Stream Compute team, where you will play a pivotal role in building scalable solutions that power the financial infrastructure of the internet. As a member of our innovative engineering team, you will leverage your expertise to design and implement robust software solutions that enhance the performance and reliability of our streaming data capabilities.
Team and Platform Focus The Compute Infrastructure team at OpenAI designs, builds, and maintains the systems that support AI research at scale. This work brings together accelerators, CPUs, networking, storage, data centers, orchestration software, agent infrastructure, developer tools, and observability. The aim is to create a reliable, unified experience for researchers and product teams across the company. Projects span the full stack: capacity planning, cluster lifecycle management, bare-metal automation, and distributed systems. The team manages Kubernetes scheduling, system optimization, high-performance networking, storage, fleet health, reliability, workload profiling, benchmarking, and improvements to the developer experience. Even small improvements in communication, scheduling, hardware efficiency, or debugging can significantly accelerate research. OpenAI matches engineers to areas within Compute Infrastructure that align with their skills and interests. Role Overview This Software Engineer role centers on building and evolving the compute platform that supports OpenAI’s research and products. Candidates may bring expertise in low-level systems, high-performance computing, distributed infrastructure, reliability, CaaS, agent infrastructure, developer platforms, tooling, or infrastructure user experience. The most important qualities are strong analytical skills, the ability to write resilient code, and a collaborative approach that helps colleagues move faster and with more confidence. What You Will Work On Working close to hardware or at the user interaction layer Developing CaaS and agent infrastructure Managing control and data planes that connect the system Bringing new supercomputing capabilities online Optimizing training workloads through profiler traces and benchmarks Improving NCCL and collective communication Analyzing GPUs, NICs, topology, firmware, thermal dynamics, and failure modes Designing abstractions to unify diverse clusters into a single platform Areas of Expertise No one is expected to cover every area listed. Some engineers focus on system performance, kernel or runtime behavior, large-scale networking protocols, RDMA, NCCL, GPU hardware, benchmarking, scheduling, or hardware reliability. Others improve the platform’s usability through APIs, tools, workflows, and developer experience. The team values strong engineering judgment and a drive to advance the field.
About Our TeamThe Training Runtime team is at the forefront of developing a cutting-edge distributed machine learning training runtime, enabling everything from pioneering research to large-scale model deployments. Our mission is to empower researchers while facilitating growth into frontier-scale operations. We are crafting a cohesive, modular runtime that adapts to researchers’ evolving needs as they progress along the scaling curve.Our focus is anchored in three key areas: optimizing high-performance, asynchronous data movement that is aware of tensor and optimizer states; building robust, fault-tolerant training frameworks that incorporate comprehensive state management, resilient checkpointing, deterministic orchestration, and advanced observability; and managing distributed processes for enduring, job-specific, and user-defined workflows.We aim to seamlessly integrate proven large-scale capabilities into a developer-friendly runtime, enabling teams to iterate rapidly and operate reliably across various scales. Our success is gauged by both the enhancement of training throughput (the speed of model training) and researcher throughput (the pace at which ideas transform into experiments and products).About the RoleAs a Training Performance Engineer, you will be instrumental in driving efficiency enhancements throughout our distributed training architecture. Your responsibilities will include analyzing extensive training runs, pinpointing utilization gaps, and engineering optimizations that maximize throughput and system uptime. This position merges a profound understanding of systems with practical performance engineering—analyzing GPU kernel performance, collective communication throughput, and investigating I/O bottlenecks, while also implementing model sharding techniques for large-scale training.Your efforts will ensure our clusters operate at peak performance, enabling OpenAI to develop larger and more sophisticated models within existing compute budgets.This position is located in San Francisco, CA, utilizing a hybrid work model with three days in the office each week, and we offer relocation assistance for new hires.Key Responsibilities:Analyze end-to-end training runs to detect performance bottlenecks across computation, communication, and storage.Enhance GPU utilization and throughput for large-scale distributed model training.Collaborate with runtime and systems engineers to boost kernel efficiency, scheduling, and collective communication performance.Implement model graph transformations to enhance overall throughput.Develop tools for monitoring and visualizing metrics such as MFU, throughput, and uptime across clusters.
Role overview The Hardware Program Manager at meter leads complex hardware initiatives from initial planning through final delivery. This position ensures each project phase aligns with company objectives and remains on schedule. The role is based in San Francisco. What you will do Work closely with teams across disciplines to guide hardware products from concept to launch Manage project plans, set timelines, and track deliverables Monitor project progress and make adjustments to keep deadlines on track Contribute to the development of hardware solutions that enhance meter’s product offerings Location This role requires working onsite in San Francisco.
About Our TeamAt OpenAI, the Fleet team is integral to maintaining the robust computing environment that fuels our groundbreaking research and innovative product development. We manage extensive systems encompassing data centers, GPUs, networking, and more, ensuring optimal performance, availability, and efficiency. Our efforts empower OpenAI’s models to function seamlessly at scale, supporting both internal R&D and external offerings like ChatGPT. We emphasize safety, reliability, and responsible AI deployment over unrestrained growth.About the RoleAs a Software Engineer on the Fleet Hardware team, you will play a crucial role in ensuring the reliability and uptime of OpenAI’s compute fleet. Minimizing hardware failures is essential for research training progress and service stability since even minor disruptions can lead to significant setbacks. With the increasing complexity of supercomputers, the pressure to maintain operational integrity has never been higher.This is a unique opportunity to be at the forefront of technology, pioneering solutions for troubleshooting advanced systems on a large scale. You will work with cutting-edge technologies and innovate solutions to ensure the health and efficiency of our supercomputing infrastructure.Our team empowers skilled engineers with a significant degree of autonomy and ownership, enabling them to drive impactful change. This role requires a keen focus on comprehensive system investigations and the development of automated solutions. We seek individuals who dive deep into problems, conduct thorough investigations, and create automation for large-scale detection and remediation.In this Role, You Will:Design and maintain automation systems for provisioning and managing server fleets.Develop tools to monitor server health, performance, and lifecycle events.Collaborate with teams across clusters, networking, and infrastructure.Partner with external operators to uphold high-quality standards.Identify and resolve performance bottlenecks and inefficiencies.Continuously enhance automation to minimize manual tasks.
At Rylo, we are revolutionizing the way video is captured and shared. Our innovative camera technology records stunning, immersive video effortlessly, eliminating the hassle of traditional filming and editing. Built by a talented team of former engineers from Instagram and Apple, our advanced stabilization software and intuitive smartphone app allow users to create beautiful videos without the stress of perfect framing or stability. Simply hit record and enjoy the magic of producing flawless video content after the fact.Watch Rylo in actionJoin us at Rylo and become part of a dynamic team focused on developing cutting-edge consumer camera technology. You will have a unique chance to drive software advancements and influence hardware design, paving the way for innovative computational photography solutions.