JetBrains s.r.o. logoJetBrains s.r.o. logo

Research Engineer for LLM Training and Performance

JetBrains s.r.o.Amsterdam, Netherlands; Berlin, Germany; Limassol, Cyprus; London, United Kingdom; Munich, Germany; Paphos, Cyprus; Prague, Czech Republic; Warsaw, Poland; Yerevan, Armenia
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Mid to Senior

Qualifications

Enhanced qualifications detailed above.

About the job

At JetBrains, we are passionate about code. Since our inception in 2000, we have dedicated ourselves to creating the most powerful and effective developer tools available. By automating routine checks and corrections, our tools accelerate production processes, allowing developers to innovate and create freely.

We are seeking a Research Engineer to take ownership of the training stack and model architecture for our Mellum LLM family. This role is challenging yet rewarding: your mission is to enhance the speed, cost-efficiency, and stability of large-scale training. You will profile, design, and implement modifications to the training pipeline, from architecture to custom GPU kernels as necessary.

As a valuable member of our team, you will:

  • Enhance end-to-end performance for multi-node LLM pre-training and post-training pipelines.
  • Identify performance bottlenecks using tools like Nsight Systems/Compute and resolve them through compute/communication overlap, kernel fusion, scheduling, etc.
  • Design and assess architectural choices, including depth/width, attention variants such as GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing/load-balancing.
  • Develop custom operations with Triton and/or CUDA, integrate with PyTorch extensions, and contribute upstream when feasible.
  • Maximize memory/performance enhancements using FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, and NCCL tuning.
  • Ensure robustness in large-scale runs by constructing elastic and fault-tolerant training setups, strengthening checkpointing, enhancing reproducibility, and improving resistance to preemption.
  • Maintain a fast data path through streaming and sharded data loaders as well as tokenizer pipelines, while improving overall throughput and cache efficiency.
  • Establish appropriate metrics, create dashboards, and consistently drive improvements.
  • Efficiently run both pre-training and post-training processes (including SFT, RLHF, and GRPO-style methods) across substantial clusters.

We’re excited to welcome you on board if you possess:

  • Expertise in PyTorch and PyTorch Distributed, with experience running multi-node jobs using dozens to hundreds of GPUs.
  • Hands-on experience with Megatron-LM/Megatron-Core/NeMo, DeepSpeed, or substantial FSDP/ZeRO knowledge.
  • Strong profiling skills (Nsight Systems/Compute, nvprof) and familiarity with NVTX-instrumented workflows.
  • Proficiency in GPU programming with Triton and/or CUDA, including the ability to write, test, and debug kernels.
  • A solid grasp of NCCL collectives, as well as topology and fabric effects (IB/RoCE), and their implications in performance traces.

About JetBrains s.r.o.

JetBrains is a leading software development company known for creating advanced tools for programmers and software developers. With a commitment to improving productivity through innovative solutions, JetBrains has established itself as a pioneer in the field of developer tools.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.