Featherless AI logoFeatherless AI logo

Machine Learning Engineer - Inference Optimization

Featherless AIRemote (world)
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

About the job

About the Role

We are seeking a passionate Machine Learning Engineer to spearhead the enhancement of model inference performance at scale. In this role, you will bridge the gap between theoretical research and practical application by transforming cutting-edge models into efficient, scalable, and user-centric systems.

This position is perfect for individuals who thrive in a technically challenging environment, enjoy in-depth system profiling down to the kernel and GPU levels, and excel at converting innovative research ideas into production-ready performance improvements.

What You’ll Do

  • Enhance inference latency, throughput, and cost-efficiency for large-scale ML models deployed in production

  • Analyze and troubleshoot GPU/CPU inference pipelines focusing on memory, kernels, batching, and I/O performance

  • Implement and optimize techniques including:

    • Quantization strategies (fp16, bf16, int8, fp8)

    • KV-cache optimization and reuse

    • Speculative decoding, batching, and streaming

    • Model pruning and architectural simplifications for optimized inference

  • Collaborate closely with research engineers to transition novel model architectures to production

  • Construct and uphold inference-serving systems using frameworks such as Triton, custom runtimes, or bespoke stacks

  • Benchmark performance across various hardware setups (NVIDIA / AMD GPUs, CPUs) and cloud configurations

  • Enhance the reliability, observability, and cost efficiency of systems under real workload conditions

What We’re Looking For

  • Significant experience in ML inference optimization or high-performance machine learning systems

  • Strong grasp of deep learning fundamentals (attention mechanisms, memory architecture, compute graphs)

  • Practical experience with PyTorch (or similar frameworks) and model deployment techniques

  • Familiarity with GPU performance enhancements (CUDA, ROCm, Triton, or kernel-level optimizations)

  • Proven capability in scaling inference systems for real-world users beyond research benchmarks

  • Adaptability to work in a fast-paced startup atmosphere, embracing ownership and navigating ambiguity

Preferred Qualifications

  • Experience with LLM or long-context model inference

  • Knowledge of various inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)

About Featherless AI

Join Featherless AI, a pioneering company at the forefront of machine learning innovation, where we strive to push the boundaries of what's possible in AI technology. Our mission is to create impactful solutions that enhance user experiences and drive efficiency across industries.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.