Wynd Labs logo

Research Crawling Engineer

Wynd LabsRemote
Remote Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Experience

Qualifications

Proficient in programming with one or more languages: Go, Rust, Python, Java, or C++. Demonstrated experience in building web crawlers or large-scale data pipelines. Strong grasp of HTTP, networking principles, and browser behavior. Familiarity with distributed systems and parallel processing techniques. Experience handling large datasets (scales of TB to PB preferred).

About the job

Wynd Labs creates infrastructure that enables organizations to access large-scale web data for advanced AI model training. The team operates Grass, a bandwidth-sharing network powering a distributed web crawler that gathers high-quality public data from around the globe. Wynd Labs also manages data pipelines that process and annotate billions of videos, transcripts, and audio files, supporting research labs in building datasets.

The company emphasizes a lean structure and rapid iteration, keeping bureaucracy to a minimum to focus on progress in open web data and AI.

Role overview

The Research Crawling Engineer (Remote) focuses on designing and running systems for large-scale web data acquisition to support research and model development. This position involves working across distributed systems, scraping frameworks, and complex data pipelines.

Main responsibilities

  • Develop and maintain web crawlers that operate across multiple domains at scale.
  • Build high-throughput, fault-tolerant systems capable of collecting data from millions to billions of URLs each day.
  • Address anti-bot protections, rate limiting, and the challenges of dynamic or JavaScript-heavy websites.
  • Design data pipelines for cleaning, deduplication, filtering, and normalization.
  • Create and manage datasets tailored for research and model training.
  • Monitor crawl performance, coverage, and data quality, iterating quickly based on feedback.
  • Collaborate with research teams to ensure data collection aligns with modeling needs.
  • Optimize infrastructure for cost efficiency, low latency, and reliability.

Requirements

  • Proficiency in at least one of these languages: Go, Rust, Python, Java, or C++.
  • Direct experience building web crawlers or large-scale data pipelines.
  • Strong understanding of HTTP, networking, and browser behavior.
  • Familiarity with distributed systems and parallel processing.
  • Experience working with large datasets (terabyte to petabyte scale preferred).

Comfort troubleshooting in unstable or adversarial environments is important for this role.

Preferred skills

  • Experience with NLP techniques and frameworks is a plus.

About Wynd Labs

Wynd Labs is at the forefront of web data infrastructure, enabling organizations to access and utilize vast amounts of information for AI model training. Our innovative solutions and rapid development processes allow us to stay ahead in the evolving landscape of data acquisition and AI technologies.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.