Latency
Definition
The time delay between sending a request to an AI model and receiving the first response, typically measured in milliseconds.
Latency is a critical performance metric for production AI systems. For LLMs, two latency measures matter: time-to-first-token (TTFT, how long before the first word appears) and inter-token latency (the delay between subsequent tokens during streaming). Acceptable latency varies by application — real-time voice assistants need sub-200ms TTFT, while batch document processing can tolerate seconds. Factors affecting latency include model size, hardware (GPU type), batch size, input length, network distance, and optimization techniques. Reducing latency often involves trade-offs with model quality or cost. Techniques include model distillation, quantization, geographic distribution (placing models closer to users), KV-cache optimization, and speculative decoding. Companies like Groq and Cerebras compete on offering the lowest-latency inference.
Related Terms
Inference
The process of using a trained AI model to generate predictions or outputs on new data, as opposed t...
Throughput
The number of inference requests or tokens an AI system can process per unit of time, measuring the ...
Model Serving
The infrastructure and systems for deploying trained AI models in production to handle real-time req...
Edge AI
Running AI models directly on local devices (phones, IoT sensors, vehicles) rather than in the cloud...