vLLM for High-Throughput LLM Serving: Running Open-Source Models in Production

The Problem with Naive LLM Serving

When you load a model with Hugging Face Transformers and call model.generate(), each request is processed one at a time. The GPU sits idle between the prefill phase (processing the prompt) and the decode phase (generating tokens). With multiple concurrent users, requests queue up and latency becomes unacceptable.

vLLM solves this with two key innovations: PagedAttention for memory-efficient KV-cache management, and continuous batching that dynamically groups requests to maximize GPU utilization. The result is 2-24x higher throughput compared to naive serving, depending on the workload.

Installing vLLM

vLLM requires a CUDA-capable GPU. Install it with pip:

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

pip install vllm

For a specific CUDA version:

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Verify GPU detection:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import vllm
from vllm import LLM

# This will show available GPUs
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

Launching the OpenAI-Compatible Server

The fastest path to production is vLLM's built-in API server, which exposes OpenAI-compatible endpoints:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

This gives you /v1/chat/completions, /v1/completions, and /v1/models endpoints that any OpenAI-compatible client can consume immediately.

How PagedAttention Works

Traditional LLM serving pre-allocates contiguous memory blocks for the KV-cache of each request, based on the maximum possible sequence length. This wastes enormous amounts of GPU memory — a request that generates only 50 tokens still reserves memory for 4096 tokens.

PagedAttention borrows the concept of virtual memory paging from operating systems. The KV-cache is divided into fixed-size blocks (pages) that are allocated on demand as tokens are generated. This reduces memory waste from 60-80% to under 4%, enabling far more concurrent requests.

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.90,  # Use 90% of GPU memory
    max_model_len=8192,
    block_size=16,  # KV-cache block size (default: 16)
)

# Process a batch of prompts simultaneously
prompts = [
    "Explain quantum computing to a 10-year-old.",
    "Write a Python function for binary search.",
    "What caused the 2008 financial crisis?",
    "Summarize the theory of relativity.",
]

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(prompts, params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Response: {output.outputs[0].text[:100]}...")
    print("---")

Continuous Batching for Agent Workloads

Agent systems generate bursty, variable-length requests. One agent call might produce 20 tokens (a tool call), while another generates 500 tokens (a detailed explanation). Continuous batching handles this gracefully by adding new requests to the batch as soon as existing requests finish, rather than waiting for the entire batch to complete.

Configure batching parameters for agent workloads:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 256 \
    --enable-chunked-prefill

The --enable-chunked-prefill flag allows long prompts to be split across iterations, preventing a single large prompt from blocking the entire batch.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Connecting Agents to vLLM

Since vLLM exposes an OpenAI-compatible API, your agent code remains identical — just change the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

def agent_step(messages: list) -> str:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=messages,
        temperature=0.1,  # Lower temperature for agent reliability
        max_tokens=1024,
    )
    return response.choices[0].message.content

# Agent loop
messages = [{"role": "system", "content": "You are an analytical agent."}]
messages.append({"role": "user", "content": "Analyze recent trends in AI."})

result = agent_step(messages)
print(result)

Performance Tuning Checklist

Maximize throughput with these settings:

# Tensor parallelism across multiple GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 16384 \
    --quantization awq  # Use quantized model for faster inference

Key tuning levers: increase gpu-memory-utilization to allow more concurrent requests, use tensor-parallel-size to split large models across GPUs, and enable quantization to reduce memory footprint without significant quality loss.

FAQ

How does vLLM compare to Ollama for production use?

Ollama is designed for single-user local inference with a focus on ease of use. vLLM is built for multi-user production serving with high concurrency. If you need to serve 50+ concurrent agent requests, vLLM is the right choice. For local development with one or two concurrent requests, Ollama is simpler.

Can vLLM serve multiple models simultaneously?

A single vLLM server instance serves one model. To serve multiple models, run multiple vLLM instances on different ports or GPUs, then use a router or load balancer to direct requests to the appropriate instance.

What GPU do I need for vLLM?

vLLM requires an NVIDIA GPU with CUDA support. For 7-8B parameter models, a single GPU with 16+ GB VRAM (RTX 4090, A10G, or L4) works well. For 70B models, you need multiple GPUs totaling 80+ GB VRAM or use quantized variants.

#VLLM #LLMServing #ProductionAI #PagedAttention #OpenSource #AgenticAI #LearnAI #AIEngineering

vLLM for High-Throughput LLM Serving: Running Open-Source Models in Production

The Problem with Naive LLM Serving

Installing vLLM

Launching the OpenAI-Compatible Server

How PagedAttention Works

Continuous Batching for Agent Workloads

Connecting Agents to vLLM

Performance Tuning Checklist

FAQ

How does vLLM compare to Ollama for production use?

Can vLLM serve multiple models simultaneously?

What GPU do I need for vLLM?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

From Trace to Production Fix: An End-to-End Observability Workflow for Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

Regression Testing for AI Agents: Catching Silent Breakage Before Users Do

OpenAI Agents SDK vs Assistants API in 2026: Migration Guide with Eval Parity