Back to GlossaryData

Training Data

Definition

The dataset used to teach a machine learning model, consisting of examples from which the model learns patterns, relationships, and representations.

Training data is the foundation upon which all ML models are built — its quality, diversity, and scale directly determine model capabilities and limitations. For large language models, training data typically consists of trillions of tokens from web pages, books, code repositories, academic papers, and other text sources. Curating high-quality training data involves web crawling, deduplication, filtering toxic and low-quality content, handling personally identifiable information, and balancing representation across languages, topics, and perspectives. The provenance of training data has become a major legal and ethical issue, with lawsuits from content creators and publishers. Companies increasingly invest in proprietary high-quality data as a competitive moat. Data quality often matters more than data quantity — carefully curated smaller datasets can outperform larger noisy ones.

Companies in Data

View Data companies →