Lilt, Inc. logo

AI Benchmark Engineer - Japanese Language Specialist

Lilt, Inc.Japan (Remote)
Remote Contract

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

Required Qualifications:Experience: A minimum of 5 years in software engineering. Background: Demonstrated success at leading technology firms or graduation from prestigious engineering institutions. Language: Native or near-native fluency with a profound understanding of its grammatical structure, register, and phrasing conventions. High proficiency in English is also required. Technical Skills: Strong expertise in Python, standard shell scripting, and data processing.

About the job

At Lilt, we are innovating the way large language models are evaluated through our advanced Terminal-Bench tasks. Our comprehensive evaluation suite is designed to push the boundaries of multilingual software capabilities, focusing on assessing multilingual robustness in handling prompt language variations, non-English data processing, and intricate locale/encoding scenarios within terminal workflows.

We are looking for seasoned native-speaking software engineers who can conceptualize, develop, and validate these benchmarks. Your role will involve crafting high-quality, impactful tasks that accurately evaluate a model's proficiency in navigating multilingual contexts without the aid of English translations.

Please note that this position is a remote, freelance opportunity.

Target Languages: Spanish, German, Czech, Turkish, Arabic (Egyptian), Korean, Japanese, Hausa, Hindi, Marathi.

Key Responsibilities:

  • Task Engineering: Assessing the performance of Coding Agents.
  • Asset Creation: Develop realistic task scenarios utilizing datasets and files in your native language, ensuring that these remain in the target language for authentic multilingual evaluation.
  • Prompting & Translation: Identify and analyze failure points in AI performance within your native language.
  • Implementation & Verification: Aid in crafting robust solutions (reference implementations) and write dependable, deterministic verifier scripts (utilizing rubric-based judging only when absolutely necessary).
  • Calibration & Execution: Examine execution logs and adjust task difficulty levels (from Easy to Very Hard) using standard Terminal-Bench configurations across various model tiers (Haiku, Sonnet, Opus).
  • Quality Assurance: Engage in a meticulous, 4-layer human quality control process (creation, human review, calibration review, and audit) in conjunction with automated LLM-based checks to maintain fairness, grammatical precision, and benchmark integrity.

About Lilt, Inc.

Lilt, Inc. is at the forefront of machine translation and AI benchmarking, dedicated to enhancing the capabilities of language models through innovative evaluation methods. Our goal is to drive advancements in multilingual software solutions, ensuring an equitable and efficient experience for users across diverse languages.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.