Back to GlossaryEvaluation

HumanEval

Definition

A benchmark for evaluating AI code generation by testing whether models can write correct Python functions from docstrings, measured by the pass@k metric.

HumanEval was released by OpenAI in 2021 alongside the Codex model and has become the standard benchmark for code generation. It contains 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model must generate a complete function that passes all test cases. Performance is measured by pass@k: the probability that at least one of k generated solutions passes all tests. When introduced, Codex achieved 28.8 percent on pass@1. By 2025, leading models exceed 90 percent. The benchmark has been extended to multiple languages (MultiPL-E) and harder problems (HumanEval+, SWE-bench for real-world software engineering). HumanEval measures functional correctness but not code quality, efficiency, or real-world engineering skills, leading to complementary benchmarks.

Companies in Evaluation

View Evaluation companies →