About the job
About Us:
Zefr stands as the global frontrunner in brand suitability targeting and measurement, operating across the largest platforms worldwide. Our innovative technology empowers advertisers to control their content adjacencies, aligning with their unique brand safety and suitability preferences. As an official partner of the YouTube Measurement Program, Meta for Business, and TikTok for Business, we utilize our patented machine learning and AI technology (Cognition AI) to deliver precise and transparent brand safety solutions. Headquartered in Los Angeles, California, Zefr has a global presence to better serve our clients.
Your Role:
We are seeking a Lead Machine Learning Operations Engineer to spearhead our MLOps team, driving the infrastructure, tooling, and processes that facilitate the operation of our machine learning systems at scale. You will be responsible for deploying, monitoring, and optimizing ML models that analyze vast amounts of data from platforms like TikTok, YouTube, Facebook, Instagram, and Snap. In this pivotal role, you will lead a team of engineers to construct and sustain robust ML pipelines, ensure model reliability in production, and implement best practices for model lifecycle management. Your collaboration with ML Engineers and Data Scientists will bridge the gap between research and production, fostering a culture of excellence and scalability in ML infrastructure.
Key Responsibilities:
- Lead, mentor, and develop a talented team of Machine Learning Engineers, promoting innovation and continuous improvement.
- Design and implement scalable ML infrastructure for model training, deployment, and serving.
- Establish and enforce best practices for ML model lifecycle management, including versioning, testing, and monitoring.
- Develop and maintain CI/CD pipelines tailored for machine learning workflows.
- Optimize model inference performance while minimizing latency and cost across production systems.
- Collaborate with ML Engineers and Data Scientists for efficient model production.
- Implement comprehensive monitoring, alerting, and observability solutions for ML systems.
- Drive technical decisions regarding MLOps tooling, infrastructure, and architecture.
- Ensure high availability and reliability of ML services at scale.
