About the job
LILT is creating an extensive global network of domain experts dedicated to delivering top-notch AI evaluations in training, benchmarking, red-teaming, and ongoing model monitoring. We are on the lookout for talented software engineering and DevOps professionals to lend their expert judgment in enhancing human-in-the-loop AI evaluation workflows utilized by leading enterprises and hyperscalers.
This position is tailored for individuals who possess a deep understanding of software systems, infrastructure, and development methodologies in real-world production environments. Your expertise will play a crucial role in evaluating, assessing, and improving multilingual AI systems.
Your contributions will directly impact the quality, safety, and deployment readiness of multilingual AI models.
This position offers two distinct expert tracks, differentiated by experience level and scope of responsibility.
Track A: Software Engineering & DevOps AI Rater
Raters will carry out structured evaluation tasks following clearly defined rubrics and instructions.
Responsibilities
- Evaluate AI outputs related to software engineering, DevOps, and infrastructure topics.
- Conduct structured scoring, comparison, classification, and judgment tasks.
- Assess technical correctness, completeness, security implications, and adherence to best practices.
- Identify hallucinations, incorrect code, unsafe recommendations, or misleading system guidance.
- Consistently apply domain-specific engineering and DevOps guidelines across tasks.
Ideal Background
- Software engineers, site reliability engineers, DevOps engineers, or platform engineers.
- Experience with production systems, CI/CD pipelines, cloud infrastructure, or distributed systems.
- Exceptional attention to detail and comfort working with structured evaluation criteria.
Track B: Software Engineering & DevOps AI Evaluator (Senior Track)
Evaluators provide advanced technical oversight and help shape the evaluation processes.
Responsibilities
- Validate and refine evaluation rubrics and edge-case handling.
- Adjudicate in cases of disagreement among raters.
- Conduct error analysis and qualitative assessments of model behavior.

