About the job
About the Role
We are seeking a dedicated Machine Learning Engineer specializing in training optimization to join our team at Featherless AI. In this role, you will play a pivotal part in enhancing and scaling large-scale model training processes. Your responsibilities will bridge the gap between research and production, focusing on optimizing training pipelines for efficiency, speed, and cost-effectiveness, while working closely with our research team to advance model architecture and capabilities.
This position offers significant impact and ownership; your contributions will directly influence our iteration speed, scalability, and the efficiency of our model deployments.
What You’ll Do
- Enhance large-scale model training pipelines, focusing on throughput, convergence, stability, and cost.
- Refine distributed training strategies, including data, model, and pipeline parallelism.
- Tune optimizers, schedulers, batch sizing, and precision settings (bf16 / fp16 / fp8).
- Minimize training duration and computational costs through profiling, bottleneck analysis, and system-level enhancements.
- Collaborate with researchers to implement architecture-aware training methods.
- Develop and maintain robust training infrastructure, ensuring checkpointing, fault tolerance, and reproducibility.
- Assess and incorporate new training methodologies, such as gradient checkpointing, ZeRO, FSDP, and custom kernels.
- Manage training performance metrics and strive for continuous improvement.
What We’re Looking For
- Extensive experience in training large neural networks, particularly LLMs or similarly significant models.
- Practical expertise in training optimization, extending beyond mere model application.
- A solid foundation in backpropagation, optimization algorithms, and training dynamics.
- Knowledge of distributed systems relevant to ML training.
- Proficiency with PyTorch is essential.
- Comfort in working closely with hardware constraints, including GPUs, memory, and networking.
- The ability to seamlessly transition between research concepts and production-ready implementations.
Nice to Have
- Experience with large-scale distributed training setups, including multi-node and multi-GPU configurations.
- Familiarity with tools like DeepSpeed, FSDP, Megatron, or bespoke training stacks.
- Background in optimizing training processes for high-performance computing environments.

