About the job
About the Role
We are seeking a talented AI Researcher specializing in training optimization to enhance the efficiency, stability, and scalability of large-scale model training. This role involves working at the intersection of research and systems, where you will innovate techniques to lower training costs, speed up convergence, and enhance model quality—all while validating your concepts through rigorous experiments and publications.
This position is perfect for individuals who thrive on transforming research insights into actionable training improvements and have a proven track record (or strong aspiration) of publishing applied machine learning research.
Key Responsibilities
- Design and assess training optimization techniques for large-scale models, including optimization algorithms, schedulers, normalization methods, and curriculum strategies.
- Enhance training efficiency and stability for extended runs and vast datasets.
- Research and implement advanced methods such as:
- Innovations in optimizers and schedulers
- Mixed-precision, low-precision, and memory-efficient training solutions
- Gradient noise reduction, scaling laws, and convergence analysis
- Training-time regularization and robustness techniques
- Conduct large-scale experiments, analyze the results, and convert findings into practical improvements.
- Author or co-author research papers, technical reports, or blog articles.
- Collaborate closely with infrastructure and inference teams to ensure that training decisions yield real-world performance benefits.
Qualifications
- Strong foundation in machine learning research, particularly in training dynamics and optimization.
- Proficiency in training large neural networks (LLMs, multimodal models, or extensive sequence models).
- Experience with publications in reputable ML venues (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP, COLM, arXiv) or equivalent high-quality open research.
- Solid understanding of:
- Optimization theory and its practical applications
- Backpropagation, gradient flow, and training stability
- Distributed and large-batch training processes

