About the job
Wynd Labs creates infrastructure that enables organizations to access large-scale web data for advanced AI model training. The team operates Grass, a bandwidth-sharing network powering a distributed web crawler that gathers high-quality public data from around the globe. Wynd Labs also manages data pipelines that process and annotate billions of videos, transcripts, and audio files, supporting research labs in building datasets.
The company emphasizes a lean structure and rapid iteration, keeping bureaucracy to a minimum to focus on progress in open web data and AI.
Role overview
The Research Crawling Engineer (Remote) focuses on designing and running systems for large-scale web data acquisition to support research and model development. This position involves working across distributed systems, scraping frameworks, and complex data pipelines.
Main responsibilities
- Develop and maintain web crawlers that operate across multiple domains at scale.
- Build high-throughput, fault-tolerant systems capable of collecting data from millions to billions of URLs each day.
- Address anti-bot protections, rate limiting, and the challenges of dynamic or JavaScript-heavy websites.
- Design data pipelines for cleaning, deduplication, filtering, and normalization.
- Create and manage datasets tailored for research and model training.
- Monitor crawl performance, coverage, and data quality, iterating quickly based on feedback.
- Collaborate with research teams to ensure data collection aligns with modeling needs.
- Optimize infrastructure for cost efficiency, low latency, and reliability.
Requirements
- Proficiency in at least one of these languages: Go, Rust, Python, Java, or C++.
- Direct experience building web crawlers or large-scale data pipelines.
- Strong understanding of HTTP, networking, and browser behavior.
- Familiarity with distributed systems and parallel processing.
- Experience working with large datasets (terabyte to petabyte scale preferred).
Comfort troubleshooting in unstable or adversarial environments is important for this role.
Preferred skills
- Experience with NLP techniques and frameworks is a plus.
