Skip to content
Learn Agentic AI
Learn Agentic AI12 min read7 views

vLLM for High-Throughput LLM Serving: Running Open-Source Models in Production

Set up vLLM for production-grade LLM inference with PagedAttention, continuous batching, and OpenAI-compatible APIs. Learn performance tuning for serving open-source models at scale.

The Problem with Naive LLM Serving

When you load a model with Hugging Face Transformers and call model.generate(), each request is processed one at a time. The GPU sits idle between the prefill phase (processing the prompt) and the decode phase (generating tokens). With multiple concurrent users, requests queue up and latency becomes unacceptable.

vLLM solves this with two key innovations: PagedAttention for memory-efficient KV-cache management, and continuous batching that dynamically groups requests to maximize GPU utilization. The result is 2-24x higher throughput compared to naive serving, depending on the workload.

Installing vLLM

vLLM requires a CUDA-capable GPU. Install it with pip:

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
pip install vllm

For a specific CUDA version:

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Verify GPU detection:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
import vllm
from vllm import LLM

# This will show available GPUs
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

Launching the OpenAI-Compatible Server

The fastest path to production is vLLM's built-in API server, which exposes OpenAI-compatible endpoints:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

This gives you /v1/chat/completions, /v1/completions, and /v1/models endpoints that any OpenAI-compatible client can consume immediately.

How PagedAttention Works

Traditional LLM serving pre-allocates contiguous memory blocks for the KV-cache of each request, based on the maximum possible sequence length. This wastes enormous amounts of GPU memory — a request that generates only 50 tokens still reserves memory for 4096 tokens.

PagedAttention borrows the concept of virtual memory paging from operating systems. The KV-cache is divided into fixed-size blocks (pages) that are allocated on demand as tokens are generated. This reduces memory waste from 60-80% to under 4%, enabling far more concurrent requests.

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.90,  # Use 90% of GPU memory
    max_model_len=8192,
    block_size=16,  # KV-cache block size (default: 16)
)

# Process a batch of prompts simultaneously
prompts = [
    "Explain quantum computing to a 10-year-old.",
    "Write a Python function for binary search.",
    "What caused the 2008 financial crisis?",
    "Summarize the theory of relativity.",
]

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(prompts, params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Response: {output.outputs[0].text[:100]}...")
    print("---")

Continuous Batching for Agent Workloads

Agent systems generate bursty, variable-length requests. One agent call might produce 20 tokens (a tool call), while another generates 500 tokens (a detailed explanation). Continuous batching handles this gracefully by adding new requests to the batch as soon as existing requests finish, rather than waiting for the entire batch to complete.

Configure batching parameters for agent workloads:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-num-batched-tokens 32768 \
    --max-num-seqs 256 \
    --enable-chunked-prefill

The --enable-chunked-prefill flag allows long prompts to be split across iterations, preventing a single large prompt from blocking the entire batch.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Connecting Agents to vLLM

Since vLLM exposes an OpenAI-compatible API, your agent code remains identical — just change the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

def agent_step(messages: list) -> str:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=messages,
        temperature=0.1,  # Lower temperature for agent reliability
        max_tokens=1024,
    )
    return response.choices[0].message.content

# Agent loop
messages = [{"role": "system", "content": "You are an analytical agent."}]
messages.append({"role": "user", "content": "Analyze recent trends in AI."})

result = agent_step(messages)
print(result)

Performance Tuning Checklist

Maximize throughput with these settings:

# Tensor parallelism across multiple GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 16384 \
    --quantization awq  # Use quantized model for faster inference

Key tuning levers: increase gpu-memory-utilization to allow more concurrent requests, use tensor-parallel-size to split large models across GPUs, and enable quantization to reduce memory footprint without significant quality loss.

FAQ

How does vLLM compare to Ollama for production use?

Ollama is designed for single-user local inference with a focus on ease of use. vLLM is built for multi-user production serving with high concurrency. If you need to serve 50+ concurrent agent requests, vLLM is the right choice. For local development with one or two concurrent requests, Ollama is simpler.

Can vLLM serve multiple models simultaneously?

A single vLLM server instance serves one model. To serve multiple models, run multiple vLLM instances on different ports or GPUs, then use a router or load balancer to direct requests to the appropriate instance.

What GPU do I need for vLLM?

vLLM requires an NVIDIA GPU with CUDA support. For 7-8B parameter models, a single GPU with 16+ GB VRAM (RTX 4090, A10G, or L4) works well. For 70B models, you need multiple GPUs totaling 80+ GB VRAM or use quantized variants.


#VLLM #LLMServing #ProductionAI #PagedAttention #OpenSource #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

From Trace to Production Fix: An End-to-End Observability Workflow for Agents

A real workflow: user complaint → LangSmith trace → reproduce in dataset → fix → ship → re-eval. Principal-engineer notes, real numbers, honest tradeoffs.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Agentic AI

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

Offline evals catch regressions before deploy on a fixed dataset. Online evals catch real-world drift on live traffic. You need both — here is how we run them.

Agentic AI

Regression Testing for AI Agents: Catching Silent Breakage Before Users Do

Non-deterministic agents break silently when prompts, models, or tools change. Build a regression pipeline with frozen datasets, semantic diffing, and gate thresholds.

Agentic AI

OpenAI Agents SDK vs Assistants API in 2026: Migration Guide with Eval Parity

Honest principal-engineer comparison of the OpenAI Agents SDK and the legacy Assistants API, with a migration checklist and eval-parity strategy so you don't ship regressions.