Back to GlossaryEvaluation

BLEU Score

Definition

Bilingual Evaluation Understudy — a metric that evaluates machine-generated text by measuring the overlap of n-grams (word sequences) between the generated output and reference translations.

BLEU was introduced in 2002 and became the standard metric for machine translation evaluation. It computes precision of n-grams (unigrams, bigrams, trigrams, 4-grams) between the generated text and one or more reference translations, with a brevity penalty for overly short outputs. Scores range from 0 to 1 (often reported as 0-100), with higher scores indicating more overlap with reference text. While BLEU enabled automated evaluation that accelerated MT research, it has significant limitations: it cannot capture semantic similarity (two valid translations with different word choices may score poorly), it ignores fluency and grammatical correctness, and it correlates imperfectly with human judgments. Modern NLP evaluation increasingly uses model-based metrics like BERTScore and human evaluation, but BLEU remains widely reported for translation tasks.

Companies in Evaluation

View Evaluation companies →