About the job
About the Role
We are seeking a passionate Machine Learning Engineer to spearhead the enhancement of model inference performance at scale. In this role, you will bridge the gap between theoretical research and practical application by transforming cutting-edge models into efficient, scalable, and user-centric systems.
This position is perfect for individuals who thrive in a technically challenging environment, enjoy in-depth system profiling down to the kernel and GPU levels, and excel at converting innovative research ideas into production-ready performance improvements.
What You’ll Do
Enhance inference latency, throughput, and cost-efficiency for large-scale ML models deployed in production
Analyze and troubleshoot GPU/CPU inference pipelines focusing on memory, kernels, batching, and I/O performance
Implement and optimize techniques including:
Quantization strategies (fp16, bf16, int8, fp8)
KV-cache optimization and reuse
Speculative decoding, batching, and streaming
Model pruning and architectural simplifications for optimized inference
Collaborate closely with research engineers to transition novel model architectures to production
Construct and uphold inference-serving systems using frameworks such as Triton, custom runtimes, or bespoke stacks
Benchmark performance across various hardware setups (NVIDIA / AMD GPUs, CPUs) and cloud configurations
Enhance the reliability, observability, and cost efficiency of systems under real workload conditions
What We’re Looking For
Significant experience in ML inference optimization or high-performance machine learning systems
Strong grasp of deep learning fundamentals (attention mechanisms, memory architecture, compute graphs)
Practical experience with PyTorch (or similar frameworks) and model deployment techniques
Familiarity with GPU performance enhancements (CUDA, ROCm, Triton, or kernel-level optimizations)
Proven capability in scaling inference systems for real-world users beyond research benchmarks
Adaptability to work in a fast-paced startup atmosphere, embracing ownership and navigating ambiguity
Preferred Qualifications
Experience with LLM or long-context model inference
Knowledge of various inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
