Back to GlossaryData

Data Labeling

Definition

The process of adding informative tags or annotations to raw data (images, text, audio) so that machine learning models can learn from labeled examples in supervised learning.

Data labeling is the human-intensive process that makes supervised learning possible. Labelers annotate data by classifying images (cat vs. dog), drawing bounding boxes around objects, transcribing speech, marking named entities in text, rating the quality of AI outputs for RLHF, and many other tasks. The data labeling market is a multi-billion dollar industry, with companies like Scale AI, Labelbox, and Appen providing labeling services and platforms. Quality control is essential — inter-annotator agreement measures ensure consistency, and consensus approaches use multiple labelers per example. The rise of LLMs has introduced AI-assisted labeling (pre-labeling with AI, correcting with humans) and fully synthetic labeling, but human annotation remains critical for complex tasks and for generating the preference data used in RLHF. The working conditions and compensation of data labelers has become an important ethical topic.

Companies in Data

View Data companies →