About the job
Join Our Team as a Site Reliability Engineer
Blaxel is seeking a highly skilled Site Reliability Engineer to enhance the reliability, performance, and scalability of our cutting-edge AI infrastructure platform.
In this role, you will develop and manage the essential systems that support scalable agentic AI. Your primary goal: maintain our ultra-low-latency, stateful, serverless compute engine, ensuring it remains robust as we handle billions of agent requests from the world's most advanced AI teams.
This position is deeply technical and execution-oriented. You will take charge of our reliability framework, encompassing observability, performance optimization, incident management, infrastructure health, and the automation processes that ensure seamless operations. We are looking for innovators who can design new reliability systems, advance automation capabilities, and continuously adapt the platform to accommodate next-generation AI workloads. If you are a builder who excels in managing critical infrastructure at scale, we want to hear from you.
Your Responsibilities
Working closely with our founders, infrastructure team, and development team—leveraging AI for maximum efficiency—you will architect and manage the systems that keep Blaxel fast, resilient, and secure.
Design, operate, and iteratively enhance the core infrastructure that drives our 25ms cold-start compute engine.
Develop and refine our observability stack (metrics, traces, logs), ensuring proactive issue detection.
Establish, monitor, and drive SLOs/SLIs across vital system components to ensure world-class reliability.
Lead incident response with precision: conduct root cause analyses, post-mortems, and implement systemic solutions.
Design and deploy self-healing, automated operational systems to minimize manual work and scale operations.
Collaborate across compute, networking, storage, and sandboxed execution layers to optimize performance under intense workloads.
Create automation tools—often utilizing AI agents—to enhance operations, debugging, capacity planning, and failure predictions.
Test and stress our systems to their limits: engage in load testing, chaos engineering, and performance benchmarking.
Champion security best practices at the infrastructure level, from sandboxed compute to network isolation.
Collaborate with platform engineers to ensure reliability is an integral part of new features from inception.
Who You Are
Extensive technical expertise in site reliability engineering, with a passion for building scalable systems.

