About the job
Join us at Moonlake, where we leverage AI to craft immersive world simulations that push the boundaries of technology.
Role Overview
Enhancing Training Efficiency
Implement advanced dataloaders, fusion techniques, activation rematerialization, and gradient checkpointing strategies.
Utilize FSDP/ZeRO/tensor+pipeline parallelism and fine-tune NCCL settings for optimal performance.
Boosting GPU and Kernel Performance
Conduct Nsight profiling and develop Triton/CUDA kernels along with fused operations.
Implement flash-attention-style optimizations, sequence packing, and KV-cache improvements.
Optimizing Inference Processes
Facilitate low-latency serving, continuous batching, and speculative decoding techniques.
Engage in quantization methods (GPTQ/AWQ), model distillation, and pruning practices.
Infrastructure and Reliability Enhancements
Manage SLURM/K8s multi-node jobs and ensure checkpoint hygiene.
Focus on determinism, environment pinning, and effective GPU failure management.
We pride ourselves on being an on-site, collaborative team located in San Mateo.
