BLEU Score
Definition
Bilingual Evaluation Understudy — a metric that evaluates machine-generated text by measuring the overlap of n-grams (word sequences) between the generated output and reference translations.
BLEU was introduced in 2002 and became the standard metric for machine translation evaluation. It computes precision of n-grams (unigrams, bigrams, trigrams, 4-grams) between the generated text and one or more reference translations, with a brevity penalty for overly short outputs. Scores range from 0 to 1 (often reported as 0-100), with higher scores indicating more overlap with reference text. While BLEU enabled automated evaluation that accelerated MT research, it has significant limitations: it cannot capture semantic similarity (two valid translations with different word choices may score poorly), it ignores fluency and grammatical correctness, and it correlates imperfectly with human judgments. Modern NLP evaluation increasingly uses model-based metrics like BERTScore and human evaluation, but BLEU remains widely reported for translation tasks.
Related Terms
Machine Translation
AI systems that automatically translate text or speech from one natural language to another.
Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models on specific...
F1 Score
The harmonic mean of precision and recall, providing a single metric that balances both the complete...