Databricks logoDatabricks logo

Staff Software Engineer - GenAI Performance and Kernel

DatabricksSan Francisco, California
On-site Full-time $190.9K/yr - $232.8K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Mid to Senior

Qualifications

What We Look For BS/MS/PhD in Computer Science or a related field. Substantial hands-on experience in writing and tuning compute kernels (using CUDA, Triton, OpenCL, LLVM IR, assembly, or similar technologies) for machine learning workloads. In-depth understanding of GPU and accelerator architecture, including warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, and SM occupancy. Experience with advanced optimization techniques, including tiling, blocking, software pipelining, vectorization, fusion, loop transformations, and auto-tuning. Familiarity with machine learning-specific kernel libraries (cuBLAS, cuDNN, etc.) is preferred.

About the job

P-1285

About This Role

Join our dynamic team at Databricks as a Staff Software Engineer specializing in GenAI Performance and Kernel. In this pivotal role, you will take charge of designing, implementing, and optimizing high-performance GPU kernels that drive our GenAI inference stack. Your expertise will lead the development of finely-tuned, low-level compute paths, balancing hardware efficiency with versatility, while mentoring fellow engineers in the intricacies of kernel-level performance engineering. Collaborating closely with machine learning researchers, systems engineers, and product teams, you will elevate the forefront of inference performance at scale.

What You Will Do

  • Lead the design, implementation, benchmarking, and maintenance of essential compute kernels (such as attention, MLP, softmax, layernorm, memory management) tailored for diverse hardware backends (GPU, accelerators).
  • Steer the performance roadmap for kernel-level enhancements, focusing on areas like vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, and auto-tuning.
  • Integrate kernel optimizations seamlessly with higher-level machine learning systems.
  • Develop and uphold profiling, instrumentation, and verification tools to identify correctness, performance regressions, numerical discrepancies, and hardware utilization inefficiencies.
  • Conduct performance investigations and root-cause analyses to address inference bottlenecks, such as memory bandwidth, cache contention, kernel launch overhead, and tensor fragmentation.
  • Create coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend compatibility, and maintainability.
  • Influence architectural decisions to enhance kernel efficiency (including memory layout, dataflow scheduling, and kernel fusion boundaries).
  • Guide and mentor fellow engineers focused on lower-level performance, conducting code reviews and establishing best practices.
  • Collaborate with infrastructure, tooling, and machine learning teams to implement kernel-level optimizations in production and assess their impacts.

About Databricks

Databricks is at the forefront of innovation, enabling organizations to harness the power of data and artificial intelligence. Our cutting-edge platform integrates data engineering, machine learning, and analytics, empowering teams to collaborate and drive transformational results.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.