Inference
Definition
The process of using a trained AI model to generate predictions or outputs on new data, as opposed to the training process where the model learns from data.
Inference is when a model does its actual job — answering questions, generating images, classifying emails, or making predictions. While training happens once (or periodically), inference runs continuously in production. Inference optimization is critical because it directly affects user experience (latency), cost (compute expenses), and scalability (how many users can be served). Techniques for faster inference include model quantization (reducing numerical precision), pruning (removing unnecessary weights), distillation (using smaller models), batching (processing multiple requests together), and speculative decoding (using a small model to draft tokens verified by a large model). For LLMs, inference is often the dominant ongoing cost, and the inference-to-training cost ratio is a key business metric.
Related Terms
Latency
The time delay between sending a request to an AI model and receiving the first response, typically ...
Throughput
The number of inference requests or tokens an AI system can process per unit of time, measuring the ...
Model Serving
The infrastructure and systems for deploying trained AI models in production to handle real-time req...
Inference Cost
The computational expense of running a trained AI model to generate predictions or outputs, typicall...