About the job
Join us in developing a robust and verifiable evaluation suite of Terminal-Bench tasks aimed at pushing the boundaries of large language models in addressing multilingual software challenges. Our mission is to assess multilingual resilience by investigating prompt language influences, processing non-English data, and navigating intricate locale and encoding scenarios in terminal workflows.
We are looking for skilled native-speaking software engineers to conceptualize, construct, and validate these benchmarks. Your role will involve designing high-quality, impactful tasks that effectively evaluate a model's proficiency in multilingual contexts without depending on English translations.
Note: This is a remote, freelance opportunity.
Key Responsibilities
- Task Engineering: Assessing Coding Agents.
- Asset Creation: Develop realistic task environments utilizing datasets and files in your native language. Importantly, these assets must remain in the target language to accurately evaluate multilingual capability.
- Prompting & Translation: Identifying failure points where AI struggles in your native language.
- Implementation & Verification: Aid in creating robust solutions (reference implementations) and craft highly dependable verifier scripts (using rubric-based judging only as absolutely necessary).
- Calibration & Execution: Analyze execution logs and adjust task complexity (ranging from Easy to Very Hard) using standard Terminal-Bench run configurations across various model tiers (Haiku, Sonnet, Opus).
- Quality Assurance: Engage in a meticulous, four-layer human quality control process (creation, human review, calibration review, and audit) combined with automated LLM-based checks to uphold fairness, grammatical precision, and benchmark integrity.
