About the job
Your Responsibilities:
The Data Science team is dedicated to building a state-of-the-art reliability platform. This system encompasses various elements of data processing and analysis, including data intake, deriving significant metrics, detecting anomalies, forecasting potential challenges, identifying sluggish processes in distributed environments, and employing automated analysis to ascertain root causes. We work collaboratively with internal teams such as Fleet, Infrastructure, and AI Platform to bolster system stability, optimize resource utilization, minimize resolution times, and sustain service availability and financial performance.
Role Overview:
As a Senior Data & MLOps Engineer, you will architect and scale the infrastructure that underpins the GPU Intelligence Platform. Your role will entail developing pipelines for data handling, feature engineering, model training, and delivering insights and predictions regarding system health and optimization. You will lead the transition of the system from initial prototypes to a production-ready environment operational across the fleet, with a focus on scalability while differentiating between real-time services and periodic processing, as well as managing resources dynamically based on system load and data frequency. You will design and deploy scalable distributed services utilizing orchestration technologies.
Key Responsibilities:
- Create and implement scalable data ingestion pipelines.
- Develop feature processing and baseline computation systems.
- Productionize models for predictive analysis and anomaly detection.
- Establish and manage low-latency services and robust offline workflows.
- Architect horizontally scalable services with a distinct separation between components, leveraging orchestration for distribution.
