About the job
Crusoe is on a mission to revolutionize the way we access and utilize energy and intelligence. We are building the infrastructure that empowers a future where ambitious AI-driven projects can thrive without compromising on scale, speed, or sustainability.
Join us at Crusoe and be part of the AI revolution through sustainable technology. Here, you will spearhead significant innovations, create a lasting impact, and collaborate with a team committed to delivering responsible and transformative cloud infrastructure.
About This Role:
As a Site Reliability Engineer (SRE) at Crusoe, you will be integral in maintaining the reliability and performance of our cutting-edge infrastructure. Our SRE team focuses on identifying, analyzing, and mitigating issues to uphold high Service Level Agreements (SLAs) through effective Service Level Indicators (SLIs) and Service Level Objectives (SLOs). By automating processes and proactively addressing potential problems, you will help ensure that our systems run seamlessly, advising engineering teams on best practices for resilient coding. Your role will involve anticipating issues before they affect our customers, conducting comprehensive post-mortems, and promoting continuous improvement to uphold the highest reliability standards for Crusoe's AI platform. The ideal candidate possesses a solid foundation in SRE practices, distributed systems, networking, and Linux, along with a passion for automation and problem-solving. This is a full-time position.
What You’ll Be Working On:
Automation and Tool Development: Streamline routine processes and enhance Crusoe’s internal infrastructure platform, allowing software teams to operate effectively without needing in-depth knowledge of the operating system, hardware, or network.
Collaboration and Planning: Engage in daily stand-up meetings with the team to review projects, recent incidents, and daily priorities. Collaborate on strategies for launching new data centers or upgrading existing ones. Work closely with software engineers to ensure the adoption of resilient coding practices and review modifications prior to deployment.
System Monitoring and Alerting: Analyze overnight alerts and performance metrics to guarantee optimal system operation. Evaluate system logs and develop innovative tools to enhance our monitoring capabilities.
Incident Response and Problem Solving: Participate in incident response simulations, post-mortems, and root cause analysis sessions to extract valuable lessons from past issues.
