About the job
At JetBrains, we are passionate about code. Since our inception in 2000, we have dedicated ourselves to creating the most powerful and effective developer tools available. By automating routine checks and corrections, our tools accelerate production processes, allowing developers to innovate and create freely.
We are seeking a Research Engineer to take ownership of the training stack and model architecture for our Mellum LLM family. This role is challenging yet rewarding: your mission is to enhance the speed, cost-efficiency, and stability of large-scale training. You will profile, design, and implement modifications to the training pipeline, from architecture to custom GPU kernels as necessary.
As a valuable member of our team, you will:
- Enhance end-to-end performance for multi-node LLM pre-training and post-training pipelines.
- Identify performance bottlenecks using tools like Nsight Systems/Compute and resolve them through compute/communication overlap, kernel fusion, scheduling, etc.
- Design and assess architectural choices, including depth/width, attention variants such as GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing/load-balancing.
- Develop custom operations with Triton and/or CUDA, integrate with PyTorch extensions, and contribute upstream when feasible.
- Maximize memory/performance enhancements using FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, and NCCL tuning.
- Ensure robustness in large-scale runs by constructing elastic and fault-tolerant training setups, strengthening checkpointing, enhancing reproducibility, and improving resistance to preemption.
- Maintain a fast data path through streaming and sharded data loaders as well as tokenizer pipelines, while improving overall throughput and cache efficiency.
- Establish appropriate metrics, create dashboards, and consistently drive improvements.
- Efficiently run both pre-training and post-training processes (including SFT, RLHF, and GRPO-style methods) across substantial clusters.
We’re excited to welcome you on board if you possess:
- Expertise in PyTorch and PyTorch Distributed, with experience running multi-node jobs using dozens to hundreds of GPUs.
- Hands-on experience with Megatron-LM/Megatron-Core/NeMo, DeepSpeed, or substantial FSDP/ZeRO knowledge.
- Strong profiling skills (Nsight Systems/Compute, nvprof) and familiarity with NVTX-instrumented workflows.
- Proficiency in GPU programming with Triton and/or CUDA, including the ability to write, test, and debug kernels.
- A solid grasp of NCCL collectives, as well as topology and fabric effects (IB/RoCE), and their implications in performance traces.
