About the job
Join Nebius in Shaping the Future of AI Cloud Computing!
Nebius is at the forefront of a revolutionary shift in cloud computing, dedicated to empowering the global AI economy. We develop innovative tools and resources that enable our customers to address real-world challenges and transform industries, all while avoiding hefty infrastructure costs and the need for expansive in-house AI/ML teams. Our team operates at the cutting edge of AI cloud infrastructure, collaborating with some of the most knowledgeable and inventive leaders and engineers in this field.
Our Global Presence
Nebius is headquartered in Amsterdam and is publicly listed on Nasdaq. We have a robust global presence with research and development hubs across Europe, North America, and Israel. Our diverse team of over 1,400 professionals includes more than 400 highly skilled engineers who boast extensive expertise in both hardware and software engineering, complemented by our dedicated in-house AI R&D team.
The Position
We are on the lookout for a Technical Product Manager to lead the product strategy for Soperator, our advanced Slurm-on-Kubernetes control plane designed for GPU clusters. In this pivotal role, you will define how machine learning engineers and research teams operate, scale, and optimize distributed workloads in production environments. If you are passionate about crafting systems that harmonize performance, reliability, and developer experience in AI infrastructure, we want to hear from you!
Your Key Responsibilities:
• Oversee the complete user journey across Soperator clusters, including Slurm workflows, dashboards, alerts/notifications, node lifecycle, and management of training/inference capacities.
• Define the end-to-end product direction: problem discovery → solution design → delivery → adoption.
• Conduct in-depth customer discovery through interviews, usage analytics, and workload analysis to identify high-impact opportunities.
• Drive execution across platform teams, including compute, networking, storage, observability, IAM, and more.
• Translate cutting-edge ML and infrastructure concepts into tangible product capabilities for real-world GPU clusters.
• Establish success metrics, prioritize features, and maintain a consistent feedback loop with stakeholders.
