Benchmark
Definition
A standardized test or dataset used to evaluate and compare the performance of AI models on specific tasks, providing consistent metrics across different systems.
Benchmarks are the yardsticks of AI progress, providing objective measurements that allow fair comparison between different models and approaches. Major LLM benchmarks include MMLU (broad knowledge), HumanEval (coding), GSM8K (math), HellaSwag (commonsense reasoning), and the LMSYS Chatbot Arena (head-to-head human preference). For computer vision, ImageNet and COCO remain standard. Benchmarks drive research priorities — when a benchmark is "saturated" (models achieve near-perfect scores), new harder benchmarks are needed. Critics note that benchmark performance doesn't always reflect real-world capability, and models can be specifically optimized for benchmarks without improving general ability (Goodhart's Law). Despite limitations, benchmarks remain essential for tracking progress and guiding model development.
Related Terms
Accuracy
The proportion of correct predictions out of total predictions made by a model, the simplest and mos...
MMLU
Massive Multitask Language Understanding — a benchmark testing AI models across 57 academic subjects...
Perplexity
A metric that measures how well a language model predicts text, calculated as the exponential of the...
HumanEval
A benchmark for evaluating AI code generation by testing whether models can write correct Python fun...