About the job
About Liquid AI
Originating from MIT CSAIL, Liquid AI specializes in the development of general-purpose AI systems designed to operate seamlessly across various platforms, including data center accelerators and on-device hardware. Our focus is on delivering low latency, efficient memory usage, privacy, and reliability. We collaborate with organizations in diverse sectors such as consumer electronics, automotive, life sciences, and financial services. As we experience rapid growth, we seek outstanding talent to join our mission.
The Opportunity
The Training Infrastructure team is at the forefront of building the distributed systems that empower our next-generation Liquid Foundation Models. As our operations expand, we aim to innovate, implement, and enhance the infrastructure crucial for large-scale training.
This role is centered around high ownership of training systems, emphasizing runtime, performance, and reliability rather than a typical platform or SRE function. You will collaborate within a small, agile team, creating vital systems from the ground up instead of working with pre-existing infrastructure.
While San Francisco and Boston are preferred, we are open to other locations.
What We're Looking For
We are seeking an individual who:
- Embraces the complexity of distributed systems: Our team is dedicated to maintaining stability during extensive training runs, troubleshooting training failures across GPU clusters, and enhancing overall performance.
- Is passionate about building: We value team members who take pride in developing robust, efficient, and reliable infrastructure.
- Excels in uncertain environments: Our systems are designed to support evolving model architectures. You will be making decisions based on incomplete information and rapidly iterating.
- Aligns with team goals and delivers results: The best engineers on our team align with collective priorities while providing data-driven feedback when challenges arise.
The Work
- Design and develop core systems that ensure quick and reliable large training runs.
- Create scalable distributed training infrastructure for GPU clusters.
- Implement and refine parallelism and sharding strategies for evolving architectures.
- Optimize distributed efficiency through topology-aware collectives, communication/compute overlap, and straggler mitigation.
- Develop data loading systems to eliminate I/O bottlenecks for multimodal datasets.

