About the job
At Runway, we are at the forefront of merging art and science to develop advanced AI technologies that simulate the world around us.
We believe that creating world models is essential for advancing artificial intelligence, as traditional language models cannot tackle society's most pressing challenges, such as robotics, health crises, and groundbreaking scientific discoveries. By leveraging simulation to accelerate learning through trial and error, we can achieve unprecedented advancements in these fields.
Our vision is to transform storytelling, scientific exploration, and humanity's future through the power of world models.
We pride ourselves on our team of innovative, empathetic, and driven individuals who are committed to making a significant impact. If you share our passion for pushing boundaries and achieving the impossible, we invite you to join us.
Role Overview
We are seeking a skilled Data Engineer to enhance and expand the data infrastructure that underpins our AI research and business intelligence efforts. In this pivotal role, you will oversee crucial data pipelines that connect production databases, analytics warehouses, and extensive machine learning training datasets. Your expertise will bridge data engineering, ML infrastructure, and analytics, empowering both cutting-edge research and data-driven business strategies.
You will tackle complex challenges at scale, including managing billions of rows of multimodal training data, constructing CDC streams from production systems, optimizing vector databases for ML workflows, and laying down the foundational data layer that supports our entire organization.
Technical Stack Insights
Our data infrastructure incorporates a variety of specialized systems, including LanceDB for vector storage and dataset versioning with multimodal training data, ClickHouse as our analytics warehouse that receives CDC streams from production Postgres through AWS Kinesis, and BigQuery for training run logs and evaluation results. We utilize Ray for large-scale distributed data processing on managed Kubernetes clusters, which handle preprocessing, feature generation, and dataset curation at scale.
We are actively enhancing our data platform by implementing dbt for standardized transformations, refining dataset versioning and data lineage tracking, scaling data sourcing pipelines, and establishing robust data quality practices. We employ Prometheus and Grafana for monitoring, alongside Terraform for infrastructure management. This role offers the chance to introduce best practices and cutting-edge technologies into our evolving data landscape.
