About the job
Join our innovative team at Lilt, where we are developing a comprehensive evaluation suite of Terminal-Bench tasks to rigorously assess the capabilities of large language models in tackling multilingual software challenges. Our objective is to evaluate multilingual robustness across various prompt languages, non-English data processing, and intricate locale/encoding edge cases in terminal workflows.
We are on the lookout for skilled native-speaking software engineers to craft, construct, and validate these benchmarks. You will design high-value, high-quality tasks that authentically assess a model's proficiency in handling multilingual environments without the aid of English translations.
This is a remote, freelance position.
Target Languages: Spanish, German, Czech, Turkish, Arabic (Egyptian), Korean, Japanese, Hausa, Hindi, Marathi.
Key Responsibilities:
- Task Engineering: Evaluate coding agents.
- Asset Creation: Develop realistic task environments utilizing datasets and files in your native language. It is vital that these assets remain in the target language to accurately measure multilingual handling.
- Prompting & Translation: Identify failure points in AI performance, in your native language.
- Implementation & Verification: Assist in the creation of robust solutions (reference implementations) and develop highly reliable, deterministic verifier scripts (using rubric-based judging only when absolutely necessary).
- Calibration & Execution: Analyze execution logs and adjust task difficulty (Easy to Very Hard) utilizing standard Terminal-Bench run configurations against diverse model tiers (Haiku, Sonnet, Opus).
- Quality Assurance: Engage in a thorough, four-layer human quality control process (creation, human review, calibration review, and audit) alongside automated LLM-based checks to guarantee fairness, grammatical precision, and benchmark integrity.
