About the job
At Crusoe, our mission is to drive the future of energy and intelligence. We are developing the infrastructure that empowers ambitious AI creations without compromising on scale, speed, or sustainability.
Join us in leading the AI revolution through sustainable technology. At Crusoe, you will be at the forefront of innovation, contributing to impactful projects and collaborating with a team dedicated to transforming cloud infrastructure responsibly.
About This Role:
As a Senior Site Reliability Engineer, you will play a crucial role in ensuring the operational excellence of Crusoe’s energy-efficient, AI-optimized GPU cloud. Your focus will be on maintaining stability, resilience, and performance, driving initiatives that enhance our cloud platform.
This position is perfect for engineers who thrive in dynamic environments, relish the challenge of solving operational issues, and seek to advance their technical careers while enhancing incident response and reliability for a large-scale distributed platform.
You will collaborate closely with senior SREs, infrastructure engineers, and platform teams to bolster reliability, minimize operational toil, and refine our incident management processes.
What You’ll Be Working On:
Work with cross-functional teams to establish and enhance availability metrics for our cloud infrastructure, including the development, tracking, and improvement of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Assist in incident response by diagnosing and resolving service disruptions, while supporting post-incident processes through root cause analysis documentation and participation in reviews.
Build, maintain, and monitor the health of our infrastructure using Crusoe’s observability tools (Prometheus, Grafana, Alertmanager, OpenTelemetry).
Identify and communicate reliability risks and performance bottlenecks, along with early indicators of potential incidents that may impact service availability.
Develop automation and tools to reduce operational toil, minimize manual processes, and improve service recovery and self-healing capabilities.
Collaborate with compute, network, storage, and platform teams to enhance service resilience and strengthen disaster recovery preparedness.
Engage in knowledge sharing and contribute to the development of operational best practices across the organization.
