About the job
At Lilt, we are innovating the way large language models are evaluated through our advanced Terminal-Bench tasks. Our comprehensive evaluation suite is designed to push the boundaries of multilingual software capabilities, focusing on assessing multilingual robustness in handling prompt language variations, non-English data processing, and intricate locale/encoding scenarios within terminal workflows.
We are looking for seasoned native-speaking software engineers who can conceptualize, develop, and validate these benchmarks. Your role will involve crafting high-quality, impactful tasks that accurately evaluate a model's proficiency in navigating multilingual contexts without the aid of English translations.
Please note that this position is a remote, freelance opportunity.
Target Languages: Spanish, German, Czech, Turkish, Arabic (Egyptian), Korean, Japanese, Hausa, Hindi, Marathi.
Key Responsibilities:
- Task Engineering: Assessing the performance of Coding Agents.
- Asset Creation: Develop realistic task scenarios utilizing datasets and files in your native language, ensuring that these remain in the target language for authentic multilingual evaluation.
- Prompting & Translation: Identify and analyze failure points in AI performance within your native language.
- Implementation & Verification: Aid in crafting robust solutions (reference implementations) and write dependable, deterministic verifier scripts (utilizing rubric-based judging only when absolutely necessary).
- Calibration & Execution: Examine execution logs and adjust task difficulty levels (from Easy to Very Hard) using standard Terminal-Bench configurations across various model tiers (Haiku, Sonnet, Opus).
- Quality Assurance: Engage in a meticulous, 4-layer human quality control process (creation, human review, calibration review, and audit) in conjunction with automated LLM-based checks to maintain fairness, grammatical precision, and benchmark integrity.
