AI Research Engineer - Scaling at 1X | Palo Alto, CA
1XPalo Alto, California, United States On-site Full-time $180K/yr - $300K/yr
Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Qualifications
Key ResponsibilitiesLead the scaling of distributed training and inference systemsOptimize compute resources to prioritize data as the primary constraintFacilitate extensive training runs (1000+ GPUs) using robot-generated data, ensuring robust fault tolerance and effective experiment trackingEnhance inference throughput for datacenter applications, including world models and diffusion enginesMinimize latency and improve performance for on-device robot policies using techniques such as quantization, scheduling, and distillationEssential QualificationsProficient programming skills in Python and/or C++In-depth understanding of training and inference performance bottlenecks and scaling principlesA foundational belief in the importance of scalability in humanoid roboticsBachelor's degree in Computer Science or a related fieldExperience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO) and multi-node debuggingDemonstrated ability to optimize inference performance using graph compilers, batching/scheduling, and serving systems like TensorRTFamiliarity with quantization methods (PTQ, QAT, INT8/FP8) and associated toolsExperience developing or optimizing CUDA or Triton kernels, with a focus on hardware-level optimization techniques
About the job
AI Research Engineer, Scaling | Infrastructure
Location: Palo Alto, CA (on-site)
At 1X, we are pioneering the development of humanoid robots designed to collaborate with humans, addressing labor shortages and fostering abundance across various sectors.
The Role: As an AI Research Engineer specializing in Scaling, you will be responsible for architecting and implementing robust infrastructure that facilitates large-scale training, evaluation, and deployment for our fleet of robots. Your contributions will be essential in transitioning experimental systems into production-ready platforms, optimized for throughput, latency, and overall performance in both datacenter and edge environments. This role will significantly impact the efficiency of learning and inference processes, directly influencing the capabilities of our general-purpose humanoid robots.