Back to GlossaryEvaluation

Perplexity

Definition

A metric that measures how well a language model predicts text, calculated as the exponential of the average negative log-likelihood per token — lower perplexity indicates better prediction.

Perplexity is the most fundamental intrinsic evaluation metric for language models. Intuitively, it measures how "surprised" the model is by real text — a perplexity of 10 means the model is, on average, as uncertain as if choosing between 10 equally likely options for each next token. Lower perplexity indicates the model assigns higher probability to the actual text and therefore has a better understanding of language patterns. Perplexity is useful for comparing models during development, detecting distribution shift (test data that differs from training), and tracking improvement across model generations. However, perplexity alone does not fully capture model quality — a model can have low perplexity while still generating repetitive or unhelpful text. Modern evaluation increasingly relies on task-specific benchmarks and human preference ratings alongside perplexity.

Companies in Evaluation

View Evaluation companies →