Dataset
Definition
A structured collection of data organized for training, evaluating, or testing machine learning models, often curated with specific tasks or research goals in mind.
Datasets are the currency of machine learning research and development. Landmark datasets have driven major advances: ImageNet (14M labeled images) catalyzed the deep learning revolution in computer vision, SQuAD enabled progress in reading comprehension, and The Pile and Common Crawl provide web-scale text for LLM training. Datasets are typically split into training (for learning), validation (for hyperparameter tuning), and test (for final evaluation) sets. Important dataset considerations include size, quality, diversity, bias, licensing, and documentation. Model Cards and Datasheets for Datasets are frameworks for documenting model and dataset characteristics. The Hugging Face Hub hosts over 100,000 datasets, making them accessible to the research community. Data governance, licensing, and ethical collection practices have become increasingly important as AI systems scale.
Related Terms
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models on specific...
Training Data
The dataset used to teach a machine learning model, consisting of examples from which the model lear...
Data Labeling
The process of adding informative tags or annotations to raw data (images, text, audio) so that mach...
Annotation
The specific labels, tags, or metadata added to data elements during the data labeling process, prov...