Back to GlossaryEvaluation

Benchmark

Definition

A standardized test or dataset used to evaluate and compare the performance of AI models on specific tasks, providing consistent metrics across different systems.

Benchmarks are the yardsticks of AI progress, providing objective measurements that allow fair comparison between different models and approaches. Major LLM benchmarks include MMLU (broad knowledge), HumanEval (coding), GSM8K (math), HellaSwag (commonsense reasoning), and the LMSYS Chatbot Arena (head-to-head human preference). For computer vision, ImageNet and COCO remain standard. Benchmarks drive research priorities — when a benchmark is "saturated" (models achieve near-perfect scores), new harder benchmarks are needed. Critics note that benchmark performance doesn't always reflect real-world capability, and models can be specifically optimized for benchmarks without improving general ability (Goodhart's Law). Despite limitations, benchmarks remain essential for tracking progress and guiding model development.

Companies in Evaluation

View Evaluation companies →