HumanEval
Definition
A benchmark for evaluating AI code generation by testing whether models can write correct Python functions from docstrings, measured by the pass@k metric.
HumanEval was released by OpenAI in 2021 alongside the Codex model and has become the standard benchmark for code generation. It contains 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model must generate a complete function that passes all test cases. Performance is measured by pass@k: the probability that at least one of k generated solutions passes all tests. When introduced, Codex achieved 28.8 percent on pass@1. By 2025, leading models exceed 90 percent. The benchmark has been extended to multiple languages (MultiPL-E) and harder problems (HumanEval+, SWE-bench for real-world software engineering). HumanEval measures functional correctness but not code quality, efficiency, or real-world engineering skills, leading to complementary benchmarks.
Related Terms
Code Generation
The use of AI models to automatically write, complete, translate, or debug programming code based on...
Accuracy
The proportion of correct predictions out of total predictions made by a model, the simplest and mos...
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models on specific...