Lilt, Inc. logo

AI Benchmark Engineer - Native Language Specialist | Czech

Lilt, Inc.Czech Republic (Remote)
Remote Contract

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Mid to Senior

Qualifications

Required Qualifications:- Experience: Minimum of 5 years in the software engineering field.- Background: A proven history of accomplishment with leading technology companies and/or a degree from a prestigious engineering institution.- Language: Native or near-native proficiency, with a strong grasp of its grammar, register, and phrasing rules. Excellent proficiency in English.- Technical Skills: Strong command of Python, standard shell scripting, and data processing.

About the job

Join our innovative team at Lilt, where we are developing a comprehensive evaluation suite of Terminal-Bench tasks to rigorously assess the capabilities of large language models in tackling multilingual software challenges. Our objective is to evaluate multilingual robustness across various prompt languages, non-English data processing, and intricate locale/encoding edge cases in terminal workflows.

We are on the lookout for skilled native-speaking software engineers to craft, construct, and validate these benchmarks. You will design high-value, high-quality tasks that authentically assess a model's proficiency in handling multilingual environments without the aid of English translations.

This is a remote, freelance position.

Target Languages: Spanish, German, Czech, Turkish, Arabic (Egyptian), Korean, Japanese, Hausa, Hindi, Marathi.

Key Responsibilities:

- Task Engineering: Evaluate coding agents.

- Asset Creation: Develop realistic task environments utilizing datasets and files in your native language. It is vital that these assets remain in the target language to accurately measure multilingual handling.

- Prompting & Translation: Identify failure points in AI performance, in your native language.

- Implementation & Verification: Assist in the creation of robust solutions (reference implementations) and develop highly reliable, deterministic verifier scripts (using rubric-based judging only when absolutely necessary).

- Calibration & Execution: Analyze execution logs and adjust task difficulty (Easy to Very Hard) utilizing standard Terminal-Bench run configurations against diverse model tiers (Haiku, Sonnet, Opus).

- Quality Assurance: Engage in a thorough, four-layer human quality control process (creation, human review, calibration review, and audit) alongside automated LLM-based checks to guarantee fairness, grammatical precision, and benchmark integrity.

About Lilt, Inc.

Lilt is at the forefront of creating advanced language solutions that empower organizations to thrive in a multilingual world. By leveraging cutting-edge technology and a talented team, we deliver innovative tools that enhance communication and understanding across diverse languages.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.