lilt-production logo

AI Benchmark Engineer - Native Language Specialist | Serbian

lilt-productionSerbia (Remote)
Remote Contract

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

Required Qualifications- Experience: Over 5 years of professional experience in software engineering.- Background: Established history at leading tech firms and/or graduation from top-tier engineering universities.- Language: Native or near-native fluency with a comprehensive understanding of its grammar, stylistic nuances, and phrasing conventions. High proficiency in English is also essential.- Technical Skills: Proficient in Python, standard shell scripting, and data processing techniques.- Workflow Familiarity: Extensive experience with Terminal/CLI-based development workflows and a solid understanding of coding agents.

About the job

Join us in developing a robust and verifiable evaluation suite of Terminal-Bench tasks aimed at pushing the boundaries of large language models in addressing multilingual software challenges. Our mission is to assess multilingual resilience by investigating prompt language influences, processing non-English data, and navigating intricate locale and encoding scenarios in terminal workflows.

We are looking for skilled native-speaking software engineers to conceptualize, construct, and validate these benchmarks. Your role will involve designing high-quality, impactful tasks that effectively evaluate a model's proficiency in multilingual contexts without depending on English translations.

Note: This is a remote, freelance opportunity.

Key Responsibilities

- Task Engineering: Assessing Coding Agents.

- Asset Creation: Develop realistic task environments utilizing datasets and files in your native language. Importantly, these assets must remain in the target language to accurately evaluate multilingual capability.

- Prompting & Translation: Identifying failure points where AI struggles in your native language.

- Implementation & Verification: Aid in creating robust solutions (reference implementations) and craft highly dependable verifier scripts (using rubric-based judging only as absolutely necessary).

- Calibration & Execution: Analyze execution logs and adjust task complexity (ranging from Easy to Very Hard) using standard Terminal-Bench run configurations across various model tiers (Haiku, Sonnet, Opus).

- Quality Assurance: Engage in a meticulous, four-layer human quality control process (creation, human review, calibration review, and audit) combined with automated LLM-based checks to uphold fairness, grammatical precision, and benchmark integrity.

About lilt-production

Lilt is at the forefront of multilingual software challenges, committed to enhancing the capabilities of large language models in diverse environments. Our team is dedicated to creating innovative solutions that ensure linguistic precision and cultural relevance across various platforms.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.