Back to GlossaryTraining

Distillation

Definition

A technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model, compressing knowledge into a more efficient form.

Knowledge distillation, introduced by Hinton et al. in 2015, transfers the knowledge captured by a large, expensive model into a smaller, faster model suitable for deployment. The student model learns not just from the hard labels (correct answers) but from the teacher's soft probability distributions, which contain richer information about relationships between classes. For example, a teacher model's output might indicate that a cat image has a small probability of being a dog but almost zero probability of being a car — this relative information helps the student learn better. Distillation has become crucial for deploying LLMs efficiently: models like Gemma, Phi, and many specialized models are distilled from larger models. The technique enables running capable AI on mobile devices, edge hardware, and in latency-sensitive applications.

Companies in Training

View Training companies →