About the job
Lilt is developing a set of Terminal-Bench tasks to assess large language models in multilingual software settings. This work examines how AI systems manage prompts, data, and encoding challenges in terminal environments, with a strong focus on non-English contexts. The aim is to better measure and enhance AI performance in languages beyond English, especially Mandarin Chinese.
This is a freelance, remote position for native Mandarin Chinese-speaking software engineers based in China. The main responsibility involves creating, building, and validating benchmarks that evaluate AI models' abilities in Mandarin, without using English translations.
What you will do
- Task engineering: Design tasks to evaluate the capabilities of AI coding agents.
- Asset development: Build realistic task environments using datasets and files written in Mandarin Chinese, ensuring all materials remain in the target language.
- Prompting and translation analysis: Identify where AI models have difficulty processing or understanding Mandarin prompts and data.
- Implementation and verification: Help develop reference solutions and verifier scripts, using rubric-based judging when necessary.
- Calibration and execution: Review execution logs and fine-tune task difficulty, from Easy to Very Hard, applying standard Terminal-Bench settings for various model levels.
- Quality assurance: Take part in a four-step human review process (creation, human review, calibration review, audit) and automated checks to ensure fairness, grammatical accuracy, and benchmark reliability.
Position details
- Freelance, remote role
- Based in China
- Native Mandarin Chinese proficiency required
