About the job
Your Mission
Data at Scale: Take ownership of the robust pipelines and storage systems that empower petabyte-scale multimodal datasets for model training.
Sustainable Platforms: Develop automated and efficient tooling and systems that facilitate processing at scale while managing numerous small heterogeneous datasets.
Essential Qualifications
Data Engineering Expertise: Proficiency in Python ETL pipelines, infrastructure, data formats, and scalable storage systems.
ML Data Operations: Experience in managing datasets, annotations, and data versioning for model training purposes.
Fundamental ML Knowledge: A solid understanding of machine learning principles essential for effective collaboration with researchers and sound data platform decision-making.
Agentic Engineering: Ability to articulate high-quality specifications for AI agents while ensuring effective human review of AI-generated outputs.
Key Responsibilities
Design, automate, maintain, and optimize Python ETL pipelines (using Spark/Ray) for extensive multimodal data.
Develop and oversee data cataloging, lineage, quality assurance tools, integrity checks, access controls, and lifecycle management systems.
Provide mentorship, internal tools, and documentation to team members regarding data best practices.
Act as the custodian of the organization’s datasets, ensuring data health, quality, and discoverability.
Challenges You Will Face
Implement high-performance multimodal data pipelines capable of processing petabyte-scale datasets across thousands of CPUs and hundreds of GPUs.
Adapt data formats, storage, and processing methods to align with the latest AI advancements while ensuring backward compatibility.
Scale data infrastructure to accommodate substantial growth.
Ensure the data platform remains agile enough to handle numerous small heterogeneous datasets and ad hoc analytics queries.
Ideal Candidate Traits
Proactive Ownership: Takes initiative in prioritizing and managing tasks effectively while escalating issues promptly when priorities shift.
