Runway ML logo

Data Engineer - Data Infrastructure at Runway ML | Remote

Runway MLRemote
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

To thrive in this role, candidates should possess:Proficiency in data engineering and experience with building scalable data infrastructures. Hands-on experience with data pipeline construction, data warehousing, and machine learning datasets. Familiarity with tools and technologies such as LanceDB, ClickHouse, BigQuery, Ray, Kubernetes, dbt, Prometheus, Grafana, and Terraform. Strong problem-solving skills and the ability to tackle complex data challenges. Excellent communication and teamwork skills.

About the job

At Runway, we are at the forefront of merging art and science to develop advanced AI technologies that simulate the world around us.

We believe that creating world models is essential for advancing artificial intelligence, as traditional language models cannot tackle society's most pressing challenges, such as robotics, health crises, and groundbreaking scientific discoveries. By leveraging simulation to accelerate learning through trial and error, we can achieve unprecedented advancements in these fields.

Our vision is to transform storytelling, scientific exploration, and humanity's future through the power of world models.

We pride ourselves on our team of innovative, empathetic, and driven individuals who are committed to making a significant impact. If you share our passion for pushing boundaries and achieving the impossible, we invite you to join us.

Role Overview

We are seeking a skilled Data Engineer to enhance and expand the data infrastructure that underpins our AI research and business intelligence efforts. In this pivotal role, you will oversee crucial data pipelines that connect production databases, analytics warehouses, and extensive machine learning training datasets. Your expertise will bridge data engineering, ML infrastructure, and analytics, empowering both cutting-edge research and data-driven business strategies.

You will tackle complex challenges at scale, including managing billions of rows of multimodal training data, constructing CDC streams from production systems, optimizing vector databases for ML workflows, and laying down the foundational data layer that supports our entire organization.

Technical Stack Insights

Our data infrastructure incorporates a variety of specialized systems, including LanceDB for vector storage and dataset versioning with multimodal training data, ClickHouse as our analytics warehouse that receives CDC streams from production Postgres through AWS Kinesis, and BigQuery for training run logs and evaluation results. We utilize Ray for large-scale distributed data processing on managed Kubernetes clusters, which handle preprocessing, feature generation, and dataset curation at scale.

We are actively enhancing our data platform by implementing dbt for standardized transformations, refining dataset versioning and data lineage tracking, scaling data sourcing pipelines, and establishing robust data quality practices. We employ Prometheus and Grafana for monitoring, alongside Terraform for infrastructure management. This role offers the chance to introduce best practices and cutting-edge technologies into our evolving data landscape.

About Runway ML

Runway ML is dedicated to revolutionizing the intersection of AI, art, and science. Our mission is to create advanced models that can simulate real-world experiences, pushing the boundaries of what is possible in technology and creativity. Join us as we strive to innovate and transform various fields through the power of AI.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.