About the job
About the Role
We are in search of experienced Machine Learning Infrastructure Engineers who excel at designing, constructing, and maintaining robust training and serving infrastructures for machine learning research initiatives.
Key Responsibilities
- Deliver comprehensive infrastructure support for our machine learning research and product development.
- Develop tools for diagnosing cluster issues and addressing hardware failures effectively.
- Oversee deployments, manage experiments, and provide ongoing support for our research activities.
- Optimize GPU allocation and utilization for both training and serving environments.
Qualifications
- A minimum of 4 years of experience in supporting infrastructure within machine learning environments.
- Proven experience in creating diagnostic tools for ML infrastructure issues.
- Familiarity with cloud platforms, such as Compute Engine, Kubernetes, and Cloud Storage.
- Hands-on experience working with GPUs.
Preferred Qualifications
- Experience managing large GPU clusters and high-performance computing/networking.
- Knowledge in supporting large language model training.
- Familiarity with machine learning frameworks like PyTorch, TensorFlow, or JAX.
- Experience in GPU kernel development.
