About the job
About Us:
TransPerfect is a leading provider of translation software with a dynamic and innovative start-up culture. We are on the lookout for a passionate and inventive Backend Developer to join our groundbreaking Artificial Intelligence (AI) team. As a member of this division, you will play a pivotal role in shaping the future of AI within a globally recognized organization. Our AI team, which has evolved over the past decade since the inception of our first machine translation models, is a key driver of innovation in machine translation, generative AI, natural language processing, and automation.
We seek an experienced backend developer who is enthusiastic about exploring the limits of technology and making a meaningful impact in the AI domain. You will collaborate with a diverse, global team of professionals from the USA, Spain, Portugal, and India. If you have a strong passion for building robust and scalable solutions that deliver AI capabilities to users, this is the perfect opportunity for you.
About the Role:
As a Backend Developer, you will address the challenges of document processing, specifically converting complex, unstructured PDFs into well-formatted, editable .docx files. Your objective will go beyond simple text extraction; you will aim to faithfully recreate the visual and structural intent of the original documents, including nested tables, multi-column layouts, font hierarchies, and styling.
In this hybrid role, you will lead the research and implementation of our document conversion pipeline. This position requires both strategic decision-making—keeping abreast of existing tools—and hands-on development, blending engineering and AI expertise.
You will be responsible for:
- Comparative Analysis: Conducting comprehensive evaluations of commercial (e.g., ABBYY, Adobe, AWS Textract) versus open-source/AI-native (e.g., Mistral OCR, Docling, Nougat, LlamaParse) solutions.
- Benchmarking: Establishing metrics for "format fidelity" to objectively assess how effectively a tool reproduces headers, footers, tables, and styles.
- Pipeline Development: Developing a Python-based workflow that integrates OCR engines with document generation libraries (such as python-docx or Pandoc).

