Lilt, Inc. logo

AI Benchmark Engineer | Native Language Specialist - Mandarin Chinese - Remote

Lilt, Inc.China (Remote)
Remote Contract

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

Required QualificationsExperience: A minimum of 5 years of professional experience in software engineering. Background: A proven record of accomplishments at leading technology firms and/or a degree from a prestigious engineering university. Language Proficiency: Native or near-native fluency in Mandarin Chinese, with a comprehensive grasp of its grammar, register, and phrasing. Strong command of English is also required. Technical Skills: Solid expertise in Python, shell scripting, and data processing techniques. Familiarity: Experience with AI language models and software benchmarking is preferred.

About the job

Lilt is developing a set of Terminal-Bench tasks to assess large language models in multilingual software settings. This work examines how AI systems manage prompts, data, and encoding challenges in terminal environments, with a strong focus on non-English contexts. The aim is to better measure and enhance AI performance in languages beyond English, especially Mandarin Chinese.

This is a freelance, remote position for native Mandarin Chinese-speaking software engineers based in China. The main responsibility involves creating, building, and validating benchmarks that evaluate AI models' abilities in Mandarin, without using English translations.

What you will do

  • Task engineering: Design tasks to evaluate the capabilities of AI coding agents.
  • Asset development: Build realistic task environments using datasets and files written in Mandarin Chinese, ensuring all materials remain in the target language.
  • Prompting and translation analysis: Identify where AI models have difficulty processing or understanding Mandarin prompts and data.
  • Implementation and verification: Help develop reference solutions and verifier scripts, using rubric-based judging when necessary.
  • Calibration and execution: Review execution logs and fine-tune task difficulty, from Easy to Very Hard, applying standard Terminal-Bench settings for various model levels.
  • Quality assurance: Take part in a four-step human review process (creation, human review, calibration review, audit) and automated checks to ensure fairness, grammatical accuracy, and benchmark reliability.

Position details

  • Freelance, remote role
  • Based in China
  • Native Mandarin Chinese proficiency required

About Lilt, Inc.

At Lilt, we are dedicated to advancing the capabilities of AI through innovative evaluation techniques and a commitment to multilingual excellence. Our team is passionate about pushing the boundaries of language models to meet real-world challenges.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.