Back to GlossaryInfrastructure

Latency

Definition

The time delay between sending a request to an AI model and receiving the first response, typically measured in milliseconds.

Latency is a critical performance metric for production AI systems. For LLMs, two latency measures matter: time-to-first-token (TTFT, how long before the first word appears) and inter-token latency (the delay between subsequent tokens during streaming). Acceptable latency varies by application — real-time voice assistants need sub-200ms TTFT, while batch document processing can tolerate seconds. Factors affecting latency include model size, hardware (GPU type), batch size, input length, network distance, and optimization techniques. Reducing latency often involves trade-offs with model quality or cost. Techniques include model distillation, quantization, geographic distribution (placing models closer to users), KV-cache optimization, and speculative decoding. Companies like Groq and Cerebras compete on offering the lowest-latency inference.

Companies in Infrastructure

View Infrastructure companies →