Data Labeling
Definition
The process of adding informative tags or annotations to raw data (images, text, audio) so that machine learning models can learn from labeled examples in supervised learning.
Data labeling is the human-intensive process that makes supervised learning possible. Labelers annotate data by classifying images (cat vs. dog), drawing bounding boxes around objects, transcribing speech, marking named entities in text, rating the quality of AI outputs for RLHF, and many other tasks. The data labeling market is a multi-billion dollar industry, with companies like Scale AI, Labelbox, and Appen providing labeling services and platforms. Quality control is essential — inter-annotator agreement measures ensure consistency, and consensus approaches use multiple labelers per example. The rise of LLMs has introduced AI-assisted labeling (pre-labeling with AI, correcting with humans) and fully synthetic labeling, but human annotation remains critical for complex tasks and for generating the preference data used in RLHF. The working conditions and compensation of data labelers has become an important ethical topic.
Related Terms
Supervised Learning
A machine learning approach where models are trained on labeled data, learning to map inputs to know...
RLHF (Reinforcement Learning from Human Feedback)
A training technique where a language model is aligned with human preferences by training a reward m...
Training Data
The dataset used to teach a machine learning model, consisting of examples from which the model lear...
Annotation
The specific labels, tags, or metadata added to data elements during the data labeling process, prov...