Tailoring 0 resumes…

We'll move completed jobs to Ready to Apply automatically.

Machine Learning Engineer - Training Optimization at Featherless AI | Remote | RoboApply Jobs

Machine Learning Engineer - Training Optimization

Featherless AIRemote (world)

Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

QualificationsStrong experience in training large neural networks (LLMs or similarly large models).Hands-on experience with training optimization (not just model usage).Solid understanding of:Backpropagation, optimization algorithms, and training dynamics.Distributed systems for ML training.Experience with PyTorch (required).Comfort working close to hardware (GPUs, memory, networking constraints).Ability to move fluidly between research ideas and production-ready code.

About the job

About the Role

We are seeking a dedicated Machine Learning Engineer specializing in training optimization to join our team at Featherless AI. In this role, you will play a pivotal part in enhancing and scaling large-scale model training processes. Your responsibilities will bridge the gap between research and production, focusing on optimizing training pipelines for efficiency, speed, and cost-effectiveness, while working closely with our research team to advance model architecture and capabilities.

This position offers significant impact and ownership; your contributions will directly influence our iteration speed, scalability, and the efficiency of our model deployments.

What You’ll Do

Enhance large-scale model training pipelines, focusing on throughput, convergence, stability, and cost.
Refine distributed training strategies, including data, model, and pipeline parallelism.
Tune optimizers, schedulers, batch sizing, and precision settings (bf16 / fp16 / fp8).
Minimize training duration and computational costs through profiling, bottleneck analysis, and system-level enhancements.
Collaborate with researchers to implement architecture-aware training methods.
Develop and maintain robust training infrastructure, ensuring checkpointing, fault tolerance, and reproducibility.
Assess and incorporate new training methodologies, such as gradient checkpointing, ZeRO, FSDP, and custom kernels.
Manage training performance metrics and strive for continuous improvement.

What We’re Looking For

Extensive experience in training large neural networks, particularly LLMs or similarly significant models.
Practical expertise in training optimization, extending beyond mere model application.
A solid foundation in backpropagation, optimization algorithms, and training dynamics.
Knowledge of distributed systems relevant to ML training.
Proficiency with PyTorch is essential.
Comfort in working closely with hardware constraints, including GPUs, memory, and networking.
The ability to seamlessly transition between research concepts and production-ready implementations.

Nice to Have

Experience with large-scale distributed training setups, including multi-node and multi-GPU configurations.
Familiarity with tools like DeepSpeed, FSDP, Megatron, or bespoke training stacks.
Background in optimizing training processes for high-performance computing environments.

About Featherless AI

At Featherless AI, we are dedicated to advancing artificial intelligence through innovative solutions and cutting-edge technology. Our team is composed of experts who are passionate about pushing the boundaries of machine learning and AI applications. We foster a collaborative environment where creativity and technological advancement thrive.

Machine Learning Engineer - Training Optimization

Featherless AIRemote (world)

Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Experience

Qualifications

About the job

About the Role

This position offers significant impact and ownership; your contributions will directly influence our iteration speed, scalability, and the efficiency of our model deployments.

What You’ll Do

Enhance large-scale model training pipelines, focusing on throughput, convergence, stability, and cost.
Refine distributed training strategies, including data, model, and pipeline parallelism.
Tune optimizers, schedulers, batch sizing, and precision settings (bf16 / fp16 / fp8).
Minimize training duration and computational costs through profiling, bottleneck analysis, and system-level enhancements.
Collaborate with researchers to implement architecture-aware training methods.
Develop and maintain robust training infrastructure, ensuring checkpointing, fault tolerance, and reproducibility.
Assess and incorporate new training methodologies, such as gradient checkpointing, ZeRO, FSDP, and custom kernels.
Manage training performance metrics and strive for continuous improvement.

What We’re Looking For

Extensive experience in training large neural networks, particularly LLMs or similarly significant models.
Practical expertise in training optimization, extending beyond mere model application.
A solid foundation in backpropagation, optimization algorithms, and training dynamics.
Knowledge of distributed systems relevant to ML training.
Proficiency with PyTorch is essential.
Comfort in working closely with hardware constraints, including GPUs, memory, and networking.
The ability to seamlessly transition between research concepts and production-ready implementations.

Nice to Have

Experience with large-scale distributed training setups, including multi-node and multi-GPU configurations.
Familiarity with tools like DeepSpeed, FSDP, Megatron, or bespoke training stacks.
Background in optimizing training processes for high-performance computing environments.

Machine Learning Engineer - Training Optimization

Unlock Your Potential

Experience Level

Qualifications

About the job

About the Role

What You’ll Do

What We’re Looking For

Nice to Have

About Featherless AI

Direct Appointment Setter at Southern National Roofing | Columbia, MD

Project Superintendent

Community Support Lead Care Manager at Pacific Health Group | Remote

Physical Therapist at Performance Optimal Health | New Canaan

Part-Time In-Home Veterinarian

Sales Support Specialist at Golden Lighting | Tallahassee, FL

New Home Sales Consultant at LGI Homes | Lebanon, TN

Medical Director - Licensed Psychiatrist

Recruiting Coordinator - Join Our Innovative Team

Experienced Litigation Paralegal - Remote

Senior Director of Digital Communications

Nutritional Cook for Early Childhood Center

FMS Analyst at ACT1 Federal | Patuxent River, MD

Automotive Technician Opportunity at Citrus Kia

Software Security Analyst at TP-Link Systems Inc. | Irvine, California

Network Intrusion Detection Engineer - Active TS/SCI with CI Poly

Tax Associate - Private Client

Lead Behavior Technician - Full-Time Position

Local Roofing Sales Representative - Roof Restoration Specialist

Senior Director of Inventory and Merchandise Planning

Machine Learning Engineer - Training Optimization

Unlock Your Potential

Experience Level

Qualifications

About the job

About the Role

What You’ll Do

What We’re Looking For

Nice to Have

About Featherless AI

Machine Learning Engineer - Training Optimization

Unlock Your Potential

Experience Level

Qualifications

About the job

About the Role

What You’ll Do

What We’re Looking For

Nice to Have

About Featherless AI

Machine Learning Engineer - Training Optimization

Unlock Your Potential

Experience Level

Qualifications

About the job

About the Role

What You’ll Do

What We’re Looking For

Nice to Have

About Featherless AI