Scale AI logoScale AI logo

Senior Machine Learning Engineer - Model Evaluations for Public Sector

Scale AISan Francisco, CA; St. Louis, MO; New York, NY; Washington, DC
On-site Full-time $216.3K/yr - $300.3K/yr

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

Preferred Qualifications: Experience in computer vision, deep learning, reinforcement learning, or natural language processing in production environments. Proficiency in Python programming; familiarity with TensorFlow or PyTorch. Solid foundation in algorithms, data structures, and object-oriented programming. Experience with LLM pipelines, simulation environments, or automated evaluation frameworks. Capability to translate research insights into quantifiable evaluation criteria. Additional Qualifications: Advanced degree in Computer Science, Machine Learning, or Artificial Intelligence. Experience with cloud platforms (AWS, GCP) and model deployment. Familiarity with LLM evaluation, computer vision robustness, or reinforcement learning validation. Knowledge of interpretability, adversarial robustness, or AI safety frameworks. Experience in regulated, classified, or mission-critical AI domains.

About the job

Senior Machine Learning Engineer - Model Evaluations for the Public Sector

The Public Sector Machine Learning team at Scale AI pioneers the deployment of cutting-edge AI systems, including Large Language Models (LLMs), agentic models, and comprehensive multimodal pipelines, within critical government operations. We establish robust evaluation frameworks that ensure these models function reliably, safely, and effectively in real-world scenarios. As a Senior Machine Learning Engineer, you will architect, implement, and enhance automated evaluation pipelines that empower our clients to trust and effectively utilize advanced AI systems in defense, intelligence, and federal missions.

Your Responsibilities Include:

  • Creating and maintaining automated evaluation pipelines for machine learning models, focusing on functional, performance, robustness, and safety metrics, including evaluations based on LLM judges.
  • Designing test datasets and benchmarks to assess generalization, bias, explainability, and potential failure modes.
  • Building evaluation frameworks for LLM agents, which includes the infrastructure for scenario-based and environment-based testing.
  • Conducting comparative analyses of model architectures, training procedures, and evaluation results.
  • Implementing tools for continuous monitoring, regression testing, and quality assurance of machine learning systems.
  • Designing and executing stress tests and red-teaming workflows to identify vulnerabilities and edge cases.
  • Collaborating with operations teams and subject matter experts to generate high-quality evaluation datasets.

This position requires an active security clearance or the ability to obtain one.

About Scale AI

At Scale AI, we are committed to advancing artificial intelligence technology for the public sector, ensuring that our AI systems are not only innovative but also reliable and safe for crucial government applications. Our team is dedicated to building frameworks that enhance the trust in AI systems deployed in defense and intelligence sectors.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.