About the job
Due to project requirements, candidates must be Singaporean citizens or hold Singaporean Permanent Residency (PR) at the time of application.
As a Senior MLOps Engineer within the DAMO service line, you will play a vital role in ensuring the reliability, security, performance, and continuous enhancement of large-scale machine learning and AI systems in production, encompassing both generative AI and traditional ML applications like computer vision and recommendation systems. You will engage throughout the entire software delivery lifecycle, contributing to design, implementation, deployment, and ongoing operational excellence.
Your expertise will promote engineering best practices, emphasizing clean and maintainable code, test-driven development, continuous delivery, robust observability, and collaborative development through pairing and code reviews. Staying hands-on, you will actively contribute to codebases and implement modern practices as outlined in the Thoughtworks Technology Radar.
Your role will involve crafting practical solutions that consider technical limitations, cost efficiency, performance, and system safety. Collaborating closely with developers, data scientists, platform engineers, and product teams, you will facilitate the delivery of production-ready AI capabilities that align with business objectives and maintain a high standard of quality.
Additionally, you will contribute to fostering a collaborative and inclusive team culture, encouraging feedback and supporting the professional growth of team members.
Key Responsibilities
- Design, implement, and maintain monitoring and alerting systems for ML and AI operational signals, such as model performance degradation across various model types (e.g., computer vision, recommendation, GenAI), data drift, latency issues, and anomalies. This includes specific monitoring for GenAI aspects, including prompt failures, hallucination trends, guardrail violations, and overall agent workflow health.
- Develop and manage robust evaluation and testing pipelines for all ML and AI systems, including automated regression tests for models (e.g., accuracy, precision, recall for traditional ML), prompts, workflows, tools, and model versions, ensuring new releases meet or exceed established baselines.

