Back to GlossaryData

Dataset

Definition

A structured collection of data organized for training, evaluating, or testing machine learning models, often curated with specific tasks or research goals in mind.

Datasets are the currency of machine learning research and development. Landmark datasets have driven major advances: ImageNet (14M labeled images) catalyzed the deep learning revolution in computer vision, SQuAD enabled progress in reading comprehension, and The Pile and Common Crawl provide web-scale text for LLM training. Datasets are typically split into training (for learning), validation (for hyperparameter tuning), and test (for final evaluation) sets. Important dataset considerations include size, quality, diversity, bias, licensing, and documentation. Model Cards and Datasheets for Datasets are frameworks for documenting model and dataset characteristics. The Hugging Face Hub hosts over 100,000 datasets, making them accessible to the research community. Data governance, licensing, and ethical collection practices have become increasingly important as AI systems scale.

Companies in Data

View Data companies →