Staff Software Engineer, ML Performance & Systems

falSan Francisco

On-site Full-time $180K/yr - $250K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.

Experience Level

Mid to Senior

Qualifications

Key Responsibilities:Support fal in maintaining its leading position in model performance for generative media models. Design and implement cutting-edge approaches to model serving architecture on our in-house inference engine, emphasizing throughput maximization while minimizing latency and resource use. Develop tools for performance monitoring and profiling to identify bottlenecks and areas for optimization. Work closely with our Applied ML team and media sector clients to ensure their workloads benefit from our accelerator. Requirements:Solid foundation in systems programming with a keen ability to identify and resolve bottlenecks. In-depth knowledge of advanced ML infrastructure, including technologies such as PyTorch, TensorRT, TransformerEngine, and Nsight, encompassing model compilation, quantization, and serving architectures. Strong understanding of underlying hardware (currently Nvidia-based systems), with the ability to delve deeper into the stack to fix issues, including custom GEMM kernels with CUTLASS for common shapes. Proficiency in Triton or a willingness to learn, along with comparable experience in lower-level accelerator programming. Experience with multi-dimensional model parallelism, integrating various parallelism techniques such as tensor parallelism and context/sequence parallelism. Familiarity with the internals of Ring Attention, FA3, and FusedMLP implementations.

About the job

Join fal in our pursuit to maintain a leading edge in model performance for generative media models. You'll be instrumental in designing and implementing innovative solutions for model serving architecture, built on our proprietary inference engine. Your focus will be on maximizing throughput while minimizing latency and resource consumption. In addition, you will create performance monitoring and profiling tools to identify bottlenecks and optimization opportunities. Collaborate closely with our Applied ML team and clients in the media sector to ensure their workloads leverage our accelerator effectively.

About fal

fal is at the forefront of innovation in generative media models, continually advancing our technologies to deliver exceptional model performance. We pride ourselves on fostering a collaborative environment where creative minds can thrive and contribute to groundbreaking projects.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.