About the job
Cerebras Systems revolutionizes the AI landscape with the creation of the world’s largest AI chip, a remarkable 56 times larger than conventional GPUs. Our innovative wafer-scale architecture delivers the computational power of numerous GPUs on a single chip, simplifying programming efforts for users. This unique approach enables Cerebras to achieve unparalleled training and inference speeds, empowering machine learning practitioners to seamlessly execute large-scale ML applications without the complexities of managing hundreds of GPUs or TPUs.
Our clientele includes leading model laboratories, global enterprises, and pioneering AI-native startups. Notably, OpenAI recently announced a multi-year partnership with Cerebras to deploy 750 megawatts of scale, significantly enhancing key workloads with ultra-high-speed inference.
Thanks to our groundbreaking wafer-scale architecture, Cerebras Inference provides the fastest Generative AI inference solution globally, exceeding the performance of GPU-based hyperscale cloud inference services by over ten times. This significant speed enhancement transforms the user experience of AI applications, facilitating real-time iterations and augmented intelligence through additional agentic computation.
About The Role
We are on the lookout for a highly skilled and experienced AI Infrastructure Operations Engineer to oversee and manage our state-of-the-art machine learning compute clusters. In this role, you will have the unique opportunity to work with the world’s largest computer chip, the Wafer-Scale Engine (WSE), and the systems that leverage its extraordinary power.
You will play a pivotal role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our expanding AI initiatives. This position requires an in-depth understanding of Linux-based systems, expertise in containerization technologies, and experience in monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with a strong background in large-scale compute infrastructure who is reliable and committed to customer success.
