Performance Reliability Engineer jobs in Sunnyvale – Browse 576 openings on RoboApply Jobs

Performance & Reliability Engineer

Cerebras SystemsSunnyvale, CA; Toronto, Ontario, Canada

On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Experience Level

Experience

About the job

Cerebras Systems is at the forefront of AI technology, developing the world’s largest AI chip that is 56 times larger than conventional GPUs. Our innovative wafer-scale architecture delivers the computational power of dozens of GPUs within a single chip, simplifying programming and enhancing performance. This unique capability enables Cerebras to provide unparalleled training and inference speeds, allowing machine learning practitioners to execute large-scale ML applications seamlessly without the complexities of managing extensive GPU or TPU infrastructures.

Cerebras serves a diverse clientele, including top-tier model labs, global enterprises, and pioneering AI-native startups. OpenAI has recently partnered with Cerebras to leverage 750 megawatts of power, significantly enhancing key workloads through ultra high-speed inference.

Our cutting-edge wafer-scale architecture has made Cerebras Inference the fastest Generative AI inference solution globally, achieving speeds over ten times faster than GPU-based hyperscale cloud inference services. This revolutionary speed is transforming the user experience of AI applications, facilitating real-time iteration and boosting intelligence through enhanced computational capabilities.

About The Role

We invite you to join Cerebras as a Performance & Reliability Engineer within our dynamic Co-Design and Next Generation Team. Our groundbreaking CS-3 system has established benchmarks for high-performance ML training and inference solutions, utilizing a chip the size of a dinner plate with 44GB of on-chip memory that exceeds traditional hardware capabilities. In this role, you will focus on characterizing and optimizing the performance and reliability of state-of-the-art AI models operating on Cerebras' revolutionary hardware.

Responsibilities

Characterize and enhance the performance and reliability of advanced ML hardware/software systems, focusing on minimizing power and thermal fluctuations.
Analyze ML workloads, software kernels, and hardware architecture for their power and performance impacts, synthesizing high-level insights across these layers.
Develop innovative software solutions to enhance system performance and efficiency.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

1 - 20 of 576 Jobs

Select all on this page (20)

Apply

Performance & Reliability Engineer

Cerebras Systems

Full-time|On-site|Sunnyvale, CA; Toronto, Ontario, Canada

Feb 17, 2026

Apply

Fleet Reliability Engineer

Applied Intuition

Full-time|On-site|Sunnyvale, California, United States

As a Fleet Reliability Engineer at Applied Intuition, you will be at the forefront of ensuring the reliability and performance of our advanced fleet systems. Your expertise will play a crucial role in the development and deployment of our cutting-edge technology, optimizing fleet operations to guarantee safety and efficiency.

Mar 25, 2026

Apply

Senior Site Reliability Engineer

Illumio

Full-time|On-site|Sunnyvale, California - HQ

Illumio’s Senior Site Reliability Engineer role is based at the company’s Sunnyvale, California headquarters. This is an on-site position, requiring presence in the office five days a week. Role overview This position focuses on building and maintaining reliable, scalable infrastructure for Illumio’s applications and services, with an emphasis on Azure cloud solutions. The Senior SRE supports both SaaS and on-premises offerings, working closely with engineering teams to ensure operational resilience and security across hybrid environments. What you will do Design, deploy, and maintain highly available infrastructure on Azure for Illumio’s products. Automate provisioning and configuration management using Infrastructure as Code tools such as Terraform or ARM templates. Develop and manage CI/CD pipelines to improve software delivery and deployment processes. Monitor system and application health using Azure monitoring and logging tools, and optimize for performance and availability. Lead incident response, perform root cause analysis, and document findings to drive continuous improvement. Collaborate with development teams to design scalable, reliable architectures and provide guidance on cloud-native best practices. Engineering at Illumio The engineering team values autonomy, ownership, and collaboration. Work centers on advancing cybersecurity with scalable SaaS services and solutions for on-premises environments. The team emphasizes disciplined engineering, quality, and a supportive culture.

Apr 22, 2026

Apply

Engineering Manager, Kernel Reliability

Cerebras Systems

Full-time|On-site|Sunnyvale CA or Toronto Canada

Cerebras Systems is at the forefront of AI technology, having developed the world's largest AI chip, which is 56 times larger than traditional GPUs. Our innovative wafer-scale architecture delivers the AI computing power equivalent to dozens of GPUs on a single chip, simplifying programming to a single device. This revolutionary design enables Cerebras to provide unmatched training and inference speeds, empowering machine learning practitioners to seamlessly execute large-scale ML applications without the complexities of managing multiple GPUs or TPUs.Our clientele includes elite model labs, global corporations, and pioneering AI-native startups. Notably, OpenAI recently entered into a multi-year partnership with Cerebras to deploy 750 megawatts of scale, significantly enhancing key workloads with ultra high-speed inference.Thanks to our groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution globally, achieving speeds over 10 times faster than GPU-based hyperscale cloud inference services. This substantial speed boost is transforming user experiences in AI applications by enabling real-time iterations and enhancing intelligence through additional agentic computation.The RoleWe are seeking a highly technical and hands-on Engineering Manager to lead our on-field Kernel Reliability team. You will guide a high-performing team in addressing a critical challenge: enhancing the reliability of our advanced compute clusters along with the associated inference, training, and internal production services. In this influential role, you will define the technical vision while remaining closely engaged with the code, crafting scalable solutions for our rapidly expanding system production and software service offerings. If you possess proven expertise in software or hardware reliability, diagnostic tool development, or failure analysis and debugging, we invite you to connect with us.ResponsibilitiesProvide hands-on technical leadership, owning the technical vision and roadmap for kernel-centric reliability concerning both internal and customer-facing systems.

Feb 17, 2026

Apply

Software Engineer - Kernel Reliability

Cerebras Systems

Full-time|On-site|Sunnyvale CA or Toronto Canada

Cerebras Systems is revolutionizing the AI landscape with the world's largest AI chip, which is 56 times larger than traditional GPUs. Our innovative wafer-scale architecture delivers the computational power of multiple GPUs on a single chip, simplifying programming and enabling unparalleled training and inference speeds. This technology allows our users to run extensive machine learning applications seamlessly, eliminating the complexities associated with managing numerous GPUs or TPUs.Our clientele includes leading model labs, global corporations, and pioneering AI startups. Recently, OpenAI announced a multi-year collaboration with Cerebras, aiming to deploy 750 megawatts of power, significantly enhancing their workloads with ultra-fast inference capabilities.With our groundbreaking wafer-scale architecture, Cerebras Inference provides the fastest Generative AI inference solution globally, outperforming GPU-based hyperscale cloud services by over tenfold. This remarkable speed enhancement is transforming user experiences in AI applications, facilitating real-time iterations and amplifying intelligence through advanced computational capabilities.About The RoleWe are in search of a highly technical and hands-on Software Engineer to join our Kernel Reliability team. In this pivotal role, you will address the crucial task of enhancing the reliability of our advanced compute clusters, along with the inference, training, and internal production services. You will work closely with the code to develop solutions that scale alongside our rapidly evolving production systems and software services. If you possess strong foundations in systems, debugging, and failure analysis and have a passion for creating tools and solving complex reliability challenges, we would love to connect with you. New graduates are encouraged to apply.

Mar 5, 2026

Apply

Site Reliability Engineer II at Illumio | Sunnyvale, California

Illumio

Full-time|On-site|Sunnyvale, California - HQ

Join Us on Our Mission!At Illumio, we are pioneering the way organizations combat ransomware and data breaches. Our innovative breach containment platform, driven by the Illumio AI Security Graph, enables businesses to effectively identify and mitigate threats across hybrid multi-cloud environments, preventing attacks from escalating into severe crises.As a recognized leader in the Forrester Wave™ for Microsegmentation, Illumio's solutions empower organizations to adopt Zero Trust models, enhancing cyber resilience for the critical infrastructure that sustains the global economy.On-Site Work:This position requires 5 days a week on-site presence at our Sunnyvale, CA headquarters.Our Vision:Our Engineering team thrives on a culture of visionary leadership, autonomy, and ownership, fostering an innovative environment that propels us through the dynamic landscape of cybersecurity.By joining our team, you will contribute to the forefront of Zero Trust Segmentation, utilizing an advanced technology stack that encompasses diverse operating systems, distributed applications, and cutting-edge UI/visualization tools.Together, we are shaping the future of cybersecurity, committed to developing world-class products guided by diverse perspectives and a shared dedication to innovation amidst unprecedented cyber threats.Your Role:As a Site Reliability Engineer II, you will oversee our multi-cloud infrastructure on platforms such as Azure, AWS, and/or GCP. Your responsibilities will include designing new cloud services and applications, collaborating closely with Engineering, SRE/OPS, and Security teams to transition these projects from development to production.Daily tasks will involve enhancing the reliability and scalability of Illumio's SaaS products while driving continuous improvement initiatives.We seek candidates with a strong passion for cloud technology, automation, and collaboration, as well as a solid understanding of the Azure cloud platform and related DevOps practices.

Feb 7, 2026

Apply

Site Reliability Engineer II at Illumio | Sunnyvale, California

Illumio

Full-time|On-site|Sunnyvale, California - HQ

Join Us in Securing the Future!At Illumio, we are pioneers in ransomware and breach containment, transforming how organizations defend against cyberattacks and fortifying operational resilience. Our innovative Illumio AI Security Graph powers a breach containment platform that swiftly identifies and neutralizes threats across hybrid multi-cloud environments, preventing minor issues from escalating into catastrophic events.As a recognized leader in the Forrester Wave™ for Microsegmentation, we enable Zero Trust, bolstering the cyber resilience of the infrastructures, systems, and organizations that keep the world functioning smoothly.Location: This role requires on-site presence in our Sunnyvale, CA headquarters five days a week.Vision of Our Team:Our Engineering team flourishes within a culture that champions visionary leadership, autonomy, and ownership. This dynamic synergy propels us forward in the constantly evolving realm of cybersecurity.As a member of our team, you will be at the forefront of Zero Trust Segmentation, working with an advanced technology stack that encompasses operating systems, distributed applications, and immersive UI/visualization tools.We're not just shaping the future of cybersecurity; we’re committed to developing world-class products led by diverse perspectives, backgrounds, and an unwavering commitment to innovation amidst unprecedented cybersecurity challenges.Your Role:As a Site Reliability Engineer II, you will oversee and optimize our multi-cloud infrastructure across Azure, AWS, and/or GCP. You will have the opportunity to design new services and applications in the cloud, guiding them from development to production while collaborating closely with Engineering, SRE/Operations, and Security teams.Your daily responsibilities will include enhancing the reliability and scalability of Illumio's SaaS offerings and spearheading continuous improvement initiatives.The ideal candidate is driven by a passion for cloud technology, automation, and collaboration, coupled with a solid foundation in Azure cloud platforms and relevant DevOps practices.Design, deploy, and maintain robust cloud infrastructure solutions on Azure, AWS, and/or GCP to support our applications and services.Implement Infrastructure as Code (IaC) principles using tools such as Terraform, ARM templates, or CloudFormation to automate provisioning and configuration management.Develop and maintain CI/CD pipelines for automated software delivery and deployment, utilizing tools like Azure DevOps, AWS CodePipeline, or Jenkins.Monitor system performance and availability, ensuring optimal operational efficiency.

Mar 23, 2026

Apply

Senior Site Reliability Engineer for AI/ML Innovations

Intuitive Surgical, Inc.

Full-time|On-site|Sunnyvale

Join our dynamic team as a Senior Site Reliability Engineer focused on AI/ML solutions. In this role, you will leverage your expertise to enhance the reliability, scalability, and performance of our cutting-edge AI-driven products. You will work collaboratively with cross-functional teams to design, implement, and maintain robust systems that support our mission to revolutionize surgical technology.

Dec 25, 2025

Apply

Staff ML Performance Engineer - Training Efficiency

Wayve Technologies

Full-time|On-site|Sunnyvale

Join Wayve Technologies as a Staff Machine Learning Performance Engineer, specializing in Training Efficiency. In this pivotal role, you will be responsible for enhancing the performance of our machine learning models and algorithms, ensuring they operate at peak efficiency. You will collaborate with cross-functional teams to develop innovative solutions that improve training processes, optimize model performance, and drive impactful results in autonomous vehicle technology.

Feb 27, 2026

Apply

Software Engineer - System Performance for Robot Software

Wayve

Full-time|On-site|Sunnyvale

Join Wayve, a pioneering company at the forefront of robotic software development, as a Software Engineer specializing in System Performance. In this role, you will be instrumental in optimizing our advanced robotic systems to enhance their efficiency and reliability. Collaborate with a talented team to push the boundaries of what is possible in the field of robotics.

Mar 30, 2026

Apply

High Performance Computing Software Engineer - Supercomputing

Institute of Foundation Models

Full-time|On-site|Sunnyvale, CA

Join Our Innovative Team at the Institute of Foundation ModelsAt IFM, we are pioneers in developing, understanding, and managing foundation models. Our mission is to advance research, cultivate the next generation of AI innovators, and contribute significantly to a knowledge-driven economy. As a member of our esteemed team, you will engage in the forefront of cutting-edge foundation model training, collaborating with top-tier researchers, data scientists, and engineers. Together, we will address the most significant and impactful challenges in AI development. You will play a crucial role in creating revolutionary AI solutions that have the potential to transform entire industries. Your strategic and innovative problem-solving abilities will be essential in establishing MBZUAI as a global leader in high-performance computing for deep learning, facilitating discoveries that will inspire future AI pioneers. The Role IFM is developing the foundational compute infrastructure that will drive future breakthroughs in AI and computational science. We are seeking a High Performance Computing Software Engineer to collaborate in designing, developing, and operating the software systems that manage our extensive AI workloads. In this position, you will work at the crossroads of high-performance computing and machine learning. You will be part of a dedicated team focused on creating the software stack that supports the training of advanced ML models using over 1000 GPUs, while ensuring our infrastructure remains robust, efficient, and user-friendly.

Apr 3, 2026

Apply

Software Engineer - High Performance Computing at SpaceX | Sunnyvale, CA

Space Exploration Technologies Corp.

Full-time|On-site|Sunnyvale, CA

Join SpaceX as a Software Engineer specializing in High Performance Computing (HPC) and contribute to pioneering advancements in space technology. Your expertise will play a crucial role in optimizing computational resources for our innovative satellite communications platform, Starlink.This position requires a collaborative mindset and a passion for problem-solving as you work alongside a talented team to enhance the efficiency and performance of our systems.

Apr 30, 2026

Apply

Sr GPU Performance Software Engineer II

CoreWeave

On-site|On-site|Sunnyvale, CA / Bellevue, WA

Join CoreWeave as a Senior GPU Performance Software Engineer II, where you will take the lead in transforming our GPU performance testing platform. As an influential member of our engineering team, you'll design and implement scalable solutions that enhance the reliability and performance of our global infrastructure. Collaborate with cross-functional teams to deliver measurable improvements in latency and throughput, ensuring an exceptional experience for our customers.

Feb 10, 2026

Apply

Senior Software Engineer, High Performance Computing at SpaceX | Sunnyvale, CA

Space Exploration Technologies Corp.

Full-time|On-site|Sunnyvale, CA

Join the SpaceX team as a Senior Software Engineer specializing in High Performance Computing (HPC) for our innovative Starlink project. You will be responsible for developing and optimizing software solutions that enhance the performance and efficiency of our satellite internet constellation.Your role will involve collaborating with cross-functional teams to design, implement, and maintain high-performance computing systems that meet the rigorous demands of our satellite operations. This is an opportunity to work at the forefront of technology and contribute to a mission that aims to revolutionize global internet access.

Apr 30, 2026

Apply

Reliability Manager for Photonic Integrated Components

dstaff

Full-time|On-site|Sunnyvale

We are seeking a talented and motivated Reliability Manager to join our team at dstaff, focusing on Photonic Integrated Components. In this role, you will lead initiatives to enhance the reliability of our products and ensure that they meet the highest quality standards. You will collaborate with cross-functional teams to conduct reliability testing, analyze data, and implement improvements.

May 14, 2015

Apply

Senior Performance Analyst - Inference at Cerebras Systems | Sunnyvale, CA

Cerebras Systems

Full-time|On-site|Sunnyvale, CA

Cerebras Systems is at the forefront of AI innovation, creating the world's largest AI chip that is 56 times larger than traditional GPUs. Our unique wafer-scale architecture delivers the computational power of numerous GPUs on a single chip, simplifying programming while providing unparalleled training and inference speeds. This revolutionary approach enables users to run extensive machine learning applications effortlessly, eliminating the complexity of managing multiple GPUs or TPUs.Cerebras serves a diverse clientele, including leading model labs, major global enterprises, and pioneering AI-native startups. Recently, OpenAI announced a multi-year partnership with Cerebras, aiming to deploy 750 megawatts of scale that will redefine key workloads with ultra-high-speed inference.Our groundbreaking wafer-scale architecture ensures that Cerebras Inference provides the fastest Generative AI inference solution globally, achieving speeds that are over ten times faster than GPU-based hyperscale cloud services. This significant enhancement in performance is transforming the user experience of AI applications, facilitating real-time iteration and boosting intelligence through enhanced computational capabilities.About The RoleWe are seeking a Senior Performance Analyst to join our dynamic Product team. As a specialist in state-of-the-art inference performance, you will be the go-to expert on how Cerebras measures up against alternative inference providers in terms of pricing and performance. This role combines performance benchmarking from foundational principles with competitive intelligence. The position revolves around two key pillars:Performance BenchmarkingYou will develop, execute, and sustain reproducible benchmarks that assess Cerebras inference performance for actual customer workloads. This includes metrics such as tokens per second, time to first token, latency under concurrency, and total cost of ownership (TCO).Competitive AnalysisYou will analyze market trends and competitor offerings to position Cerebras effectively within the inference landscape.

Apr 13, 2026

Apply

Manager of Wi-Fi Performance and Simulation

dstaff

Full-time|On-site|Sunnyvale

We are seeking a dynamic and experienced Manager of Wi-Fi Performance and Simulation to join our team at dstaff. In this pivotal role, you will be responsible for leading our Wi-Fi performance and simulation initiatives, ensuring optimal network performance and customer satisfaction.

May 3, 2015

Apply

Senior Distributed Systems Engineer

Institute of Foundation Models

Full-time|On-site|Sunnyvale, CA

About the Institute of Foundation ModelsThe Institute of Foundation Models (IFM) specializes in designing and operating large-scale GPU supercomputing systems aimed at training cutting-edge foundation models. Our philosophy places emphasis on the interdependence of performance, fault tolerance, and scalability across various components, including model architecture, communication systems, runtime, and hardware topology.This position is pivotal to our mission — enhancing communication performance, distributed reliability, and cross-layer optimization for extensive training workloads.The MissionWe seek a highly skilled engineer to collaboratively design and optimize the communication stack for large-scale distributed training, with a focus on hybrid parallelism and Mixture-of-Experts (MoE) workloads. This is a systems-level engineering role centered on performance enhancement, distributed debugging, and communication-runtime co-design.· Design and optimize expert-parallel and hybrid-parallel communication patterns· Drive high-performance hierarchical collectives for MoE workloads· Co-design runtime orchestration with communication topology awareness· Mitigate tail latency and enhance determinism across thousands of GPUs· Architect fault-tolerant distributed execution that withstands real-world cluster failuresCore Technical Scope· Communication-compute overlap and topology-aware collective optimization· In-depth debugging of NCCL, RDMA, and custom communication layers· Implementing hybrid expert parallel strategies in modern large-scale MoE systems· Developing elastic and resilient distributed job orchestration concepts· Conducting congestion analysis and routing optimization across InfiniBand/RoCE fabrics· Executing microbenchmarking and performance modeling for communication-intensive workloadsExpected Technical Depth· Expertise in hybrid expert parallel communication strategies

Mar 3, 2026

Apply

Principal Engineer, AI Inference Reliability

Cerebras Systems

Full-time|Remote|Remote Office; Sunnyvale CA or Toronto Canada

Cerebras Systems is at the forefront of AI innovation, manufacturing the largest AI chip in the world, which is 56 times bigger than conventional GPUs. Our cutting-edge wafer-scale architecture provides the computational power equivalent to dozens of GPUs on a single chip, simplifying programming to the level of a single device. This pioneering approach enables us to offer unmatched training and inference speeds, allowing machine learning practitioners to smoothly execute large-scale ML applications without the complexity of managing numerous GPUs or TPUs. Our clientele includes leading model laboratories, major global corporations, and innovative AI-native startups. Notably, OpenAI has recently partnered with Cerebras to leverage 750 megawatts of scale, revolutionizing critical workloads with ultra-high-speed inference. Our advanced wafer-scale architecture makes Cerebras Inference the fastest Generative AI inference solution available, outperforming GPU-based hyperscale cloud inference services by over tenfold. This remarkable speed enhancement is reshaping the user experience of AI applications, enabling real-time iterations and enhanced intelligence through additional agentic computation.In late 2024, we launched Cerebras Inference, setting a new standard for Generative AI inference speed. Since its launch, we have rapidly scaled our services to meet the rising demand from AI labs, enterprises, and a vibrant developer community.In October 2025, we celebrated our Series G funding round, successfully raising $1.1 billion USD to accelerate the growth of our product offerings and services to satisfy global AI demand.About the TeamThe Cerebras Inference team is dedicated to delivering the most efficient, secure, and reliable enterprise-grade AI service. We design and manage expansive distributed systems that facilitate AI inference with unparalleled speed and efficiency. Join us in scaling our inference capabilities to new heights!

Feb 17, 2026

Apply

Sr. Software Engineer - Perf and Benchmarking

CoreWeave

On-site|On-site| Sunnyvale, CA / Bellevue, WA

Join CoreWeave as a Senior Engineer on our Benchmarking & Performance team, where you will play a vital role in our expansive performance data warehouse. You will be responsible for ingesting, storing, transforming, and analyzing performance events across our global infrastructure. Your work will contribute to publishing industry-leading end-to-end performance benchmarks like MLPerf. As an owner of your projects, you will drive designs, elevate engineering standards, and deliver tangible enhancements in latency, throughput, and reliability across numerous services. Collaborating with product, orchestration, and hardware teams, you will help evolve our Kubernetes-native platform to meet stringent P99 SLAs at scale.

Feb 10, 2026

1 2 3.29

Create account — see all 576 results

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, or location & role pages.