About the job
About Cantina:
Cantina Labs is an innovative social AI company dedicated to developing cutting-edge real-time models that redefine expression, personality, and realism. Our mission is to animate characters and revolutionize storytelling, connections, and creativity. Our flagship platform, Cantina, is just the beginning of a transformative journey in social AI.
If you are passionate about harnessing AI to enhance human creativity and social interactions, we invite you to join us in shaping the future!
About the Role:
We are seeking an Applied Machine Learning Engineer with extensive hands-on experience in building large-scale video generation models, from data collection and training to distillation and acceleration into production-ready models. Our models are designed to be human-centric and product-oriented: envision interactive characters that can respond to text, audio, and image inputs while generating video with minimal latency.
This role combines applied research and engineering: you will focus on training runs, data management, model optimization, and the crucial process of transforming a capable research model into a real-time experience.
Typical time allocation (approximately):
60–75% dedicated to training, fine-tuning, and distillation of large video models.
15–25% focused on inference optimization (latency, memory, cost) and model runtime enhancements.
10–15% allocated to prototyping and product integration (transitioning demos into shipped features).
Your Responsibilities:
Train and scale video generation models: Execute large-scale training and fine-tuning on multi-GPU (and as necessary, multi-node) environments; manage the training loop, maintain stability, checkpoints, and enhance iteration speed.
Manage video modeling data: Develop and enhance video datasets and pipelines (including decoding, sampling, filtering, quality control, conditioning alignment, and storage formats), ensuring the pipeline remains efficient and reliable at scale.
Distill and compress large models into efficient ones: Implement teacher-student distillation, reduce steps, simplify architectures, and balance quality and speed to meet real-time constraints.
Achieve real-time model performance: Conduct profiling, optimize memory usage, apply quantization-aware techniques when suitable, improve kernels and runtime, and focus on practical throughput and latency enhancements.
