About the job
Join us at Featherless AI as an AI Researcher specializing in inference optimization. In this pivotal role, you will design, assess, and implement top-tier inference systems for expansive machine learning models. Your expertise will bridge model architecture, systems engineering, and hardware-aware optimization, focused on enhancing latency, throughput, and cost efficiency in real-world production settings.
Key Responsibilities
Investigate and develop strategies to enhance inference performance for substantial neural networks.
Boost latency, throughput, memory efficiency, and cost per inference.
Design and assess model-level optimizations such as quantization, pruning, KV-cache optimization, and architecture-aware simplifications.
Execute systems-level optimizations like dynamic batching, kernel fusion, multi-GPU inference, and prefill versus decode optimization.
Conduct benchmarking of inference workloads across diverse hardware accelerators.
Collaborate with engineering teams to launch optimized inference pipelines.
Transform research findings into production-ready enhancements.
Required Qualifications
Extensive background in machine learning, deep learning, or AI systems.
Proven experience in optimizing inference for large-scale models.
Proficiency in Python and contemporary ML frameworks (e.g., PyTorch).
Familiarity with inference tools such as Triton, TensorRT, vLLM, or ONNX Runtime.
Ability to design experiments and communicate results effectively.
Preferred Qualifications
Experience in deploying production inference systems at scale.
Understanding of distributed and multi-GPU inference.
Contributions to open-source ML or inference frameworks are a plus.
Authorship or co-authorship of peer-reviewed papers in machine learning, systems, or related domains.
