Back to GlossaryInfrastructure

Inference

Definition

The process of using a trained AI model to generate predictions or outputs on new data, as opposed to the training process where the model learns from data.

Inference is when a model does its actual job — answering questions, generating images, classifying emails, or making predictions. While training happens once (or periodically), inference runs continuously in production. Inference optimization is critical because it directly affects user experience (latency), cost (compute expenses), and scalability (how many users can be served). Techniques for faster inference include model quantization (reducing numerical precision), pruning (removing unnecessary weights), distillation (using smaller models), batching (processing multiple requests together), and speculative decoding (using a small model to draft tokens verified by a large model). For LLMs, inference is often the dominant ongoing cost, and the inference-to-training cost ratio is a key business metric.

Companies in Infrastructure

View Infrastructure companies →