About the job
Join our innovative team at Lilt, where we are developing a comprehensive evaluation suite of Terminal-Bench tasks aimed at pushing the boundaries of large language models in addressing multilingual software challenges. Our mission is to accurately assess multilingual robustness, focusing on prompt language effects, non-English data processing, and intricate locale/encoding edge cases within terminal workflows.
We are looking for skilled native-speaking software engineers who can design, construct, and validate these benchmarks. You will be tasked with creating high-quality, impactful tasks that authentically evaluate a model's capacity to navigate multilingual settings without relying on English translations.
This is a remote, freelance position.
Target Languages: Spanish, German, Czech, Turkish, Arabic (Egyptian), Korean, Japanese, Hausa, Hindi, Marathi.
Key Responsibilities:
- Task Engineering: Assessing and optimizing Coding Agents.
- Asset Creation: Develop realistic task scenarios using datasets and files in your native language. It is essential that these assets remain in the target language to adequately evaluate multilingual capabilities.
- Prompting & Translation: Identify failure points in AI performance in your native language.
- Implementation & Verification: Assist in the development of reliable solutions (reference implementations) and create highly accurate, deterministic verification scripts, using rubric-based judging only when absolutely necessary.
- Calibration & Execution: Review execution logs and calibrate task difficulty (from Easy to Very Hard) using standard Terminal-Bench configurations against various model tiers (Haiku, Sonnet, Opus).
- Quality Assurance: Engage in a rigorous, four-layer human quality control process (creation, human review, calibration review, and audit) alongside automated LLM-based checks to ensure fairness, grammatical accuracy, and the integrity of benchmarks.
