About the job
At Crusoe, we're on a mission to revolutionize the relationship between energy and intelligence. Our aim is to craft a robust infrastructure that empowers individuals to pursue ambitious AI projects while ensuring sustainability, speed, and scalability.
Join us at the forefront of the AI revolution, leveraging sustainable technology to drive significant innovations. Be part of a team that is redefining responsible cloud infrastructure and making a real impact.
About the Role:
As a vital member of our Site Reliability Engineering team, you will guarantee the reliability and scalability of Crusoe's AI-optimized cloud platform. We are seeking a Senior Staff Site Reliability Engineer with a robust background in distributed systems and extensive hands-on experience with large language models to assist in building and operating managed AI services at scale. This pivotal role is essential for delivering highly available, efficient, and cost-effective AI infrastructure that supports compute-intensive, latency-sensitive workloads for our clients.
Key Responsibilities:
Design and manage reliable managed AI services focused on scaling LLM workloads.
Create automation and reliability tools to enhance distributed AI pipelines and inference services.
Establish, evaluate, and enhance SLIs/SLOs across AI workloads to ensure performance and reliability targets are consistently achieved.
Collaborate with AI, platform, and infrastructure teams to refine large-scale training and inference clusters.
Automate observability by developing telemetry and performance tuning strategies for latency-sensitive AI services.
Investigate and resolve reliability challenges in distributed AI systems using telemetry, logs, and profiling.
Contribute to the architecture of next-generation distributed systems specifically designed for AI-first environments.

