3 Types of Data Collection Every ML Engineer Should Know

Aimproved .com
Nov 27, 2023
2 min read

If you're building machine learning models and still treating data collection as an afterthought, it's time to rethink your process.

In 2023, where foundation models dominate headlines and AI systems integrate into daily life, data collection has become a strategic, competitive advantage.

Here are three core types of data collection every ML practitioner should understand — and when to use them.

1. Manual Labeling (Human-in-the-Loop)

Still one of the most reliable — and expensive — methods, manual labeling involves human annotators tagging images, sentences, or audio clips with relevant labels.

It’s crucial for:

Medical datasets (e.g., radiology image labeling)
Self-driving car perception
Sentiment analysis in domain-specific corpora

While costly, human supervision helps avoid errors that automation can’t catch, especially in high-risk fields.

Pro tip: Use platforms like Labelbox, Scale AI, or Amazon SageMaker Ground Truth for efficient pipeline management.

2. Web Scraping (and Its Risks)

The internet remains one of the largest and most dynamic data sources. NLP projects often rely on web scraping to extract content from forums, news sites, Wikipedia, and Reddit.

Scraping is fast, but messy:

You’ll get noise, redundancy, and bias.
Legal and ethical boundaries are murky.
Anti-bot protections can disrupt collection efforts.

2023 trend: More teams are shifting from scraping to partnership-based APIs to source structured data (e.g., Reddit’s API for language data).

3. User-Generated Feedback (Reinforcement Learning)

A newer but increasingly vital form, user-generated data powers models like ChatGPT via Reinforcement Learning from Human Feedback (RLHF).

Users interact with a model → rate outputs → engineers retrain the model using that feedback. It’s a feedback loop that ensures alignment with human preferences.

Key tools: OpenAI’s Prompt + Feedback APIs, HuggingFace’s TRL library.

3 Types of Data Collection Every ML Engineer Should Know

1. Manual Labeling (Human-in-the-Loop)

2. Web Scraping (and Its Risks)

3. User-Generated Feedback (Reinforcement Learning)

Recent Posts

Comments