Back to GlossaryInfrastructure

Throughput

Definition

The number of inference requests or tokens an AI system can process per unit of time, measuring the system's capacity and efficiency.

Throughput measures how much work an AI system can handle, typically expressed as tokens per second, requests per second, or images per minute. High throughput is essential for serving many users simultaneously and for keeping costs low. Throughput and latency often involve trade-offs — batching more requests together increases throughput but may increase latency per request. For LLMs, throughput depends on model size, hardware, batch size, sequence length, and optimization techniques like continuous batching and PagedAttention (used in vLLM). A single H100 GPU might generate 1,000-3,000 tokens per second for a 7B model but only 100-300 for a 70B model. Throughput optimization is a major area of systems research, as it directly impacts the economics of running AI services at scale.

Companies in Infrastructure

View Infrastructure companies →