Inference Cost
Definition
The computational expense of running a trained AI model to generate predictions or outputs, typically the dominant ongoing cost of operating AI systems in production.
Inference cost is a critical business metric for any organization deploying AI. For LLM APIs, costs are measured per token (e.g., $3 per million input tokens, $15 per million output tokens for frontier models). For self-hosted models, costs include GPU hardware or cloud rental, electricity, networking, and engineering staff. Total inference cost depends on model size, hardware efficiency, request volume, and optimization techniques. Cost reduction strategies include using smaller models for simpler tasks (model routing), caching frequent queries, prompt optimization to reduce token count, quantization, and distillation. As AI adoption grows, inference costs increasingly dominate IT budgets — some companies spend millions per month on API calls alone. The race to reduce inference costs drives hardware innovation, model architecture research, and the competitive dynamics of the AI industry.
Related Terms
GPU
Graphics Processing Unit — a specialized processor originally designed for rendering graphics but no...
Inference
The process of using a trained AI model to generate predictions or outputs on new data, as opposed t...
Model Serving
The infrastructure and systems for deploying trained AI models in production to handle real-time req...
Token
The basic unit of text that language models process, typically representing a word, subword, or char...