About the job
About Us
At Matterworks, we are at the forefront of developing advanced AI tools designed to extract critical insights from the vast and expanding universe of biological data. Our mission is to unlock new possibilities in therapeutic discovery, development, and manufacturing through the innovative application of large-scale deep learning models that predict the phenotype and behavior of biological systems.
Position Overview
We are looking for a talented Data Manager specializing in Bioinformatics and Cheminformatics to spearhead our data management initiatives. In this pivotal role, you will take ownership of the strategy, processes, and daily operations required to transform complex and untidy chemical and biological datasets into high-quality, well-governed training corpora and product-ready data assets.
You will serve as a crucial link among applied sciences, AI, product teams, and our data platform/engineering team, addressing questions about data value, onboarding efficiency, and maintaining high data quality over time.
This position starts as an individual contributor role with comprehensive ownership and a clear pathway to leading a function as our organization scales.
Key Responsibilities
Data Strategy & Ownership: Collaborate with scientific, machine learning, and product stakeholders to define a strategic data roadmap that identifies impactful datasets, establishes refresh cycles, and clarifies quality standards for each use case. Set clear success metrics for onboarding speed, dataset quality, and downstream usability, aiming for reduced training/data failures and improved match rates.
Dataset Sourcing and Integration: Actively seek and incorporate public and client datasets, relevant literature, and reference materials to ensure our corpora remain current and comprehensive. Create a standardized dataset intake workflow encompassing provenance, source tracking, and refresh cadence.
Data Curation and Governance: Establish curation standards that harmonize data across various sources and modalities, manage compound identities, standardize biological/sample metadata, and facilitate schema and convention mappings. Develop a scalable framework for integrating metabolomics and other omics without starting from scratch. Implement practical quality control and assurance frameworks that marry scientific judgment with repeatable checks.
Cross-Functional Collaboration: Engage closely with leadership across engineering, AI, product, and scientific domains to optimize data management practices.

