About the job
At Crusoe, we are on a mission to revolutionize the accessibility of energy and intelligence. Join us in shaping a future where ambitious AI creations flourish without compromising on scale, speed, or sustainability.
As a key player in the AI revolution, you will engage with cutting-edge sustainable technologies at Crusoe. Here, your work will foster meaningful innovation, generate significant impact, and collaborate with a forward-thinking team dedicated to advancing responsible cloud infrastructure.
About This Position:
As part of Crusoe Energy Systems, our Site Reliability Engineering (SRE) team holds a pivotal role in ensuring the performance and reliability of our AI-optimized cloud infrastructure. This Storage-focused SRE position is crucial for guaranteeing the availability, performance, and scalability of Crusoe’s cloud storage solutions, which serve compute-intensive and latency-sensitive workloads in AI and High-Performance Computing (HPC) contexts. Your contributions will directly enhance our vertically integrated sustainable cloud platform by developing and fine-tuning distributed, fault-tolerant storage systems at scale.
Your Responsibilities Will Include:
You will create automation and self-healing tools to oversee and maintain Crusoe’s distributed cloud storage infrastructure, which encompasses block, file, and object storage systems. You will lead reliability initiatives focusing on data replication, encryption, backup and recovery strategies, and effective failover mechanisms. Working in close collaboration with storage engineers, you'll contribute to the implementation and upkeep of high-performance NVMe- and SSD-backed volumes that support expansive AI compute clusters. Additionally, you will ensure the performance, availability, and adherence to error budgets for user-facing storage services. Your role will involve investigating and resolving storage-related incidents using in-depth telemetry, logs, and performance profiling, while also partnering with hardware and kernel teams to identify low-level I/O issues and optimize I/O paths, caching policies, and file systems. Furthermore, you will assist in architecting fault-tolerant, scalable storage backends designed specifically for AI-first cloud environments.
Qualifications:
A minimum of 5 years of professional experience in Site Reliability Engineering, systems engineering, or storage engineering.
Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) along with a comprehensive understanding of object, block, and file storage paradigms.
Proficiency in programming languages such as Python, Go, Java, or C.

