Crusoe logoCrusoe logo

Senior Site Reliability Engineer at Crusoe | San Francisco, CA

CrusoeSan Francisco, CA - US
On-site Full-time $172K/yr - $209K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

The ideal candidate will have a strong background in site reliability engineering, operations, or a related field, with experience in monitoring and maintaining large-scale distributed systems. Proficiency in programming and scripting languages, as well as knowledge of cloud infrastructure and automation tools, is crucial for success in this role.

About the job

At Crusoe, our mission is to drive the future of energy and intelligence. We are developing the infrastructure that empowers ambitious AI creations without compromising on scale, speed, or sustainability.

Join us in leading the AI revolution through sustainable technology. At Crusoe, you will be at the forefront of innovation, contributing to impactful projects and collaborating with a team dedicated to transforming cloud infrastructure responsibly.

About This Role:

As a Senior Site Reliability Engineer, you will play a crucial role in ensuring the operational excellence of Crusoe’s energy-efficient, AI-optimized GPU cloud. Your focus will be on maintaining stability, resilience, and performance, driving initiatives that enhance our cloud platform.

This position is perfect for engineers who thrive in dynamic environments, relish the challenge of solving operational issues, and seek to advance their technical careers while enhancing incident response and reliability for a large-scale distributed platform.

You will collaborate closely with senior SREs, infrastructure engineers, and platform teams to bolster reliability, minimize operational toil, and refine our incident management processes.

What You’ll Be Working On:

  • Work with cross-functional teams to establish and enhance availability metrics for our cloud infrastructure, including the development, tracking, and improvement of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

  • Assist in incident response by diagnosing and resolving service disruptions, while supporting post-incident processes through root cause analysis documentation and participation in reviews.

  • Build, maintain, and monitor the health of our infrastructure using Crusoe’s observability tools (Prometheus, Grafana, Alertmanager, OpenTelemetry).

  • Identify and communicate reliability risks and performance bottlenecks, along with early indicators of potential incidents that may impact service availability.

  • Develop automation and tools to reduce operational toil, minimize manual processes, and improve service recovery and self-healing capabilities.

  • Collaborate with compute, network, storage, and platform teams to enhance service resilience and strengthen disaster recovery preparedness.

  • Engage in knowledge sharing and contribute to the development of operational best practices across the organization.

About Crusoe

Crusoe is at the forefront of harnessing energy and intelligence to create a sustainable future. We are committed to building the most reliable and innovative cloud platform that empowers users to leverage AI technology responsibly and efficiently.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.