Skip to content
Learn Agentic AI
Learn Agentic AI11 min read7 views

Constrained Decoding: Forcing LLM Outputs to Match Specific Grammars and Formats

Explore constrained decoding techniques that guarantee LLM outputs conform to formal grammars, regex patterns, or JSON schemas — eliminating format errors in agentic pipelines.

The Format Reliability Problem

Every agent developer has experienced it: you carefully instruct the LLM to return valid JSON, and 95% of the time it works. But 5% of the time the model adds a trailing comma, wraps the JSON in markdown fences, or injects an explanation before the opening brace. That 5% failure rate crashes your downstream parser and breaks the entire agent pipeline.

Constrained decoding solves this by modifying the token selection process itself so that only tokens consistent with a target grammar can be chosen. The model literally cannot produce invalid output.

How Constrained Decoding Works

During standard autoregressive generation, the model picks from all possible next tokens. Constrained decoding introduces a mask at each generation step that zeros out the probability of any token that would violate the target grammar. Only tokens that keep the output on a valid path through the grammar are eligible for selection.

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

This is implemented as a finite-state machine or pushdown automaton that tracks the current position in the grammar and determines which tokens are valid continuations.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

GBNF: Grammar-Based Format Specification

GBNF (GGML BNF) is a grammar format used by llama.cpp and compatible inference engines to define output constraints:

# GBNF grammar for a JSON object with specific fields
json_grammar = r"""
root   ::= "{" ws "\"action\"" ws ":" ws action "," ws "\"params\"" ws ":" ws params "}"
action ::= "\"search\"" | "\"calculate\"" | "\"respond\""
params ::= "{" ws (param ("," ws param)*)? ws "}"
param  ::= string ws ":" ws value
string ::= "\"" [a-zA-Z_]+ "\""
value  ::= string | number | "true" | "false" | "null"
number ::= "-"? [0-9]+ ("." [0-9]+)?
ws     ::= [ \t\n]*
"""

When this grammar is applied during generation, the model is physically prevented from producing output that does not match the root rule. Every generated token must be a valid continuation within the grammar.

The Outlines Library

Outlines is a Python library that brings constrained generation to any HuggingFace-compatible model. It supports regex patterns, JSON schemas, and custom grammars:

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")

# Regex-constrained generation: force a valid email
email_generator = outlines.generate.regex(
    model,
    r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
)
result = email_generator("Extract the email from: Contact us at ")
print(result)  # guaranteed to be a valid email format

# JSON schema-constrained generation
from pydantic import BaseModel

class ToolCall(BaseModel):
    action: str
    query: str
    confidence: float

json_generator = outlines.generate.json(model, ToolCall)
tool_call = json_generator("Decide what tool to use for: What is 42 * 17?")
print(tool_call)  # always a valid ToolCall instance

Regex-Guided Generation

For simpler format constraints, regex-guided generation offers a lightweight alternative. The regex is compiled into a finite-state automaton, and at each token the automaton determines which tokens are valid next characters:

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-v0.1")

# Force output to be a valid ISO date
date_gen = outlines.generate.regex(model, r"[0-9]{4}-[0-9]{2}-[0-9]{2}")

# Force output to be one of specific choices
choice_gen = outlines.generate.choice(model, ["approve", "reject", "escalate"])
decision = choice_gen("Should this refund request be approved? Customer spent $500 last month.")
print(decision)  # guaranteed to be one of the three options

Impact on Agent Architecture

Constrained decoding changes how you design agent pipelines. Instead of parsing LLM output and handling format errors with retries, you get guaranteed-valid structured output on every call. This eliminates an entire category of error-handling code and makes agents more reliable and faster — no retry loops needed.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

The tradeoff is that constrained decoding requires access to the model's logits during generation. This works with local models and some API providers but is not available through all inference endpoints. OpenAI's structured output mode and Anthropic's tool use provide similar guarantees through different mechanisms.

FAQ

Does constrained decoding reduce output quality?

Constraining the format does not meaningfully reduce content quality. The model still selects the highest-probability valid token at each step. Studies show that for structured tasks, constrained decoding actually improves accuracy because the model does not waste capacity on format compliance.

Can I use constrained decoding with OpenAI's API?

Not directly — you do not have access to logits during generation. However, OpenAI's response_format: { type: "json_schema" } parameter provides a similar guarantee through their own constrained decoding implementation on the server side.

What happens when the grammar is too restrictive?

If the grammar leaves very few valid tokens at a given step, the model may be forced to choose low-probability tokens, reducing coherence. Design grammars that constrain format without over-constraining content — for example, require JSON structure but allow free-form string values.


#ConstrainedDecoding #StructuredOutput #GBNF #Outlines #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.