About the job
About Periodic Labs
Periodic Labs is at the forefront of scientific innovation, leveraging artificial intelligence and advanced physical sciences to foster breakthroughs in materials, energy, and much more. As a rapidly growing company backed by leading investors, we operate with the urgency and agility required to tackle the challenges of tomorrow. Our team is characterized by deep expertise and an unrelenting commitment to pushing the boundaries of scientific achievement.
About the Role
In the role of HPC Engineer, you will be pivotal in designing, constructing, and managing high-performance computing infrastructures that drive our AI and scientific research initiatives. Our models necessitate extensive computational power , involving large GPU and CPU clusters, high-speed interconnects, low-latency parallel storage, and sophisticated workload schedulers to maximize efficiency. You will collaborate closely with researchers and infrastructure engineers to ensure that our computing environment is both fast and reliable, optimized for cutting-edge scientific discovery.
This hands-on position will require you to architect and fine-tune systems, automate provisioning, troubleshoot performance bottlenecks, and design resilient solutions at scale. You will work alongside research and machine learning teams to understand their computational needs and create an HPC environment that enhances productivity and accelerates scientific progress.
What You’ll Do
Design, deploy, and manage large-scale GPU and CPU clusters tailored for AI training, scientific simulation, and research workloads
Optimize high-speed interconnect fabrics (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS, WEKA, or similar) for peak performance
Oversee workload scheduling and resource management using Slurm, Kubernetes, or comparable systems, focusing on throughput and fairness
Implement and maintain automated cluster provisioning and configuration management using tools like Ansible and Terraform
Monitor cluster health and performance; develop dashboards and alerts to proactively address issues
Collaborate with research and ML teams to analyze workloads, resolve performance challenges, and optimize systems
Create and manage backup, disaster recovery, and fault-tolerance strategies for critical research data and computational infrastructure
Assess and integrate new hardware, including GPUs and accelerators, to enhance computational capabilities
