Back to GlossaryInfrastructure

Model Serving

Definition

The infrastructure and systems for deploying trained AI models in production to handle real-time requests at scale, including load balancing, scaling, and monitoring.

Model serving is the bridge between training a model and making it available to users. A serving system must handle concurrent requests, manage GPU memory efficiently, scale up and down with demand, and maintain low latency. Popular serving frameworks include vLLM, TGI (Text Generation Inference by Hugging Face), Triton Inference Server (NVIDIA), TensorRT-LLM, and cloud-managed services from AWS, Google, and Azure. Key techniques include continuous batching (dynamically grouping requests), KV-cache management, model parallelism across multiple GPUs, and auto-scaling based on traffic patterns. For LLMs, serving is especially challenging because generation is autoregressive (each token depends on all previous tokens), making it difficult to parallelize a single request.

Companies in Infrastructure

View Infrastructure companies →