About the job
Our Vision
At Reflection AI, we are on a mission to develop open superintelligence and make it universally accessible.
We are pioneering open weight models designed for individuals, agents, organizations, and even nation states. Our exceptional team of AI researchers and innovators hail from leading organizations such as DeepMind, OpenAI, Google Brain, Meta, Character. AI, and Anthropic.
About the Position
As a Research Program Manager at Reflection AI, you will play a pivotal role in enhancing our research and infrastructure teams, driving the acceleration of cutting-edge model development. This is not a role focused merely on tracking projects; rather, it’s about being a catalyst for clarity in complex situations, facilitating decision-making when uncertainty arises, and ensuring seamless collaboration across multiple teams.
Your primary focus will be on scaling our research infrastructure to facilitate extensive, frontier-scale training operations throughout pre-training, mid-training, and post-training phases. Collaborating closely with teams utilizing training libraries like Megatron, you will spearhead initiatives that transform raw computing clusters into efficient, high-performance training environments. Your responsibility will be to ensure that our infrastructure operates effectively from end to end, removing obstacles for teams, and enabling our ambitious growth plans with confidence.
You possess a proactive mindset; when challenges arise, you don’t wait for direction. Instead, you take initiative, assess situations, streamline communication, align teams, and drive resolutions.
Your Responsibilities
Lead cross-functional initiatives enhancing training infrastructure and cluster reliability across all phases of training.
Facilitate comprehensive coordination as we scale our training stack in collaboration with engineering leads and external partners.
Engage in active incident management, triaging issues, coordinating responses, and fostering resolution across teams. Advocate for a culture of constructive post-mortems and continuous improvement, transforming incidents into systemic enhancements.
Collaborate with infrastructure and research engineering leads to identify bottlenecks, prioritize tasks, and ensure that our infrastructure investments are closely aligned with research productivity.
Establish and maintain transparency regarding training run health, cluster reliability, and infrastructure performance, providing leadership and teams with the context necessary for swift, informed decision-making.
