Distillation
Definition
A technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model, compressing knowledge into a more efficient form.
Knowledge distillation, introduced by Hinton et al. in 2015, transfers the knowledge captured by a large, expensive model into a smaller, faster model suitable for deployment. The student model learns not just from the hard labels (correct answers) but from the teacher's soft probability distributions, which contain richer information about relationships between classes. For example, a teacher model's output might indicate that a cat image has a small probability of being a dog but almost zero probability of being a car — this relative information helps the student learn better. Distillation has become crucial for deploying LLMs efficiently: models like Gemma, Phi, and many specialized models are distilled from larger models. The technique enables running capable AI on mobile devices, edge hardware, and in latency-sensitive applications.
Related Terms
Fine-Tuning
The process of taking a pre-trained model and further training it on a smaller, task-specific datase...
Inference
The process of using a trained AI model to generate predictions or outputs on new data, as opposed t...
Large Language Model
A neural network with billions of parameters trained on massive text datasets, capable of understand...
Model Serving
The infrastructure and systems for deploying trained AI models in production to handle real-time req...