Model Serving
Definition
The infrastructure and systems for deploying trained AI models in production to handle real-time requests at scale, including load balancing, scaling, and monitoring.
Model serving is the bridge between training a model and making it available to users. A serving system must handle concurrent requests, manage GPU memory efficiently, scale up and down with demand, and maintain low latency. Popular serving frameworks include vLLM, TGI (Text Generation Inference by Hugging Face), Triton Inference Server (NVIDIA), TensorRT-LLM, and cloud-managed services from AWS, Google, and Azure. Key techniques include continuous batching (dynamically grouping requests), KV-cache management, model parallelism across multiple GPUs, and auto-scaling based on traffic patterns. For LLMs, serving is especially challenging because generation is autoregressive (each token depends on all previous tokens), making it difficult to parallelize a single request.
Related Terms
Inference
The process of using a trained AI model to generate predictions or outputs on new data, as opposed t...
Latency
The time delay between sending a request to an AI model and receiving the first response, typically ...
Throughput
The number of inference requests or tokens an AI system can process per unit of time, measuring the ...
MLOps
The set of practices, tools, and principles for managing the full lifecycle of machine learning syst...