Throughput
Definition
The number of inference requests or tokens an AI system can process per unit of time, measuring the system's capacity and efficiency.
Throughput measures how much work an AI system can handle, typically expressed as tokens per second, requests per second, or images per minute. High throughput is essential for serving many users simultaneously and for keeping costs low. Throughput and latency often involve trade-offs — batching more requests together increases throughput but may increase latency per request. For LLMs, throughput depends on model size, hardware, batch size, sequence length, and optimization techniques like continuous batching and PagedAttention (used in vLLM). A single H100 GPU might generate 1,000-3,000 tokens per second for a 7B model but only 100-300 for a 70B model. Throughput optimization is a major area of systems research, as it directly impacts the economics of running AI services at scale.
Related Terms
GPU
Graphics Processing Unit — a specialized processor originally designed for rendering graphics but no...
Inference
The process of using a trained AI model to generate predictions or outputs on new data, as opposed t...
Latency
The time delay between sending a request to an AI model and receiving the first response, typically ...
Model Serving
The infrastructure and systems for deploying trained AI models in production to handle real-time req...