Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps
Learn how to build agent systems that route different reasoning tasks to different language models — using fast, cheap models for classification and routing, and powerful models for generation and complex reasoning.
Why One Model Does Not Fit All Tasks
Running GPT-4o or Claude Opus for every agent step is like using a sports car to deliver groceries. Classification tasks (is this a billing question or a technical question?) need millisecond responses and cost fractions of a cent. Complex reasoning (analyze this contract and identify risky clauses) needs the most capable model available. Multi-model architectures match model capability to task complexity, cutting costs by 60-80% while maintaining output quality where it matters.
The Model Routing Pattern
The core idea is a router that examines each task and dispatches it to the appropriate model. The router itself should be fast and cheap — it is the one component that runs on every request.
flowchart LR
subgraph IN["Inputs"]
I1["Monthly call volume"]
I2["Average deal value"]
I3["Current answer rate"]
I4["Receptionist cost<br/>per month"]
end
subgraph CALC["CallSphere Captures"]
C1["Missed calls converted<br/>at 24 by 7 coverage"]
C2["Receptionist payroll<br/>displaced or freed"]
end
subgraph OUT["Outputs"]
O1["Recovered revenue<br/>per month"]
O2["Operating cost saved"]
O3((Net ROI<br/>monthly))
end
I1 --> C1
I2 --> C1
I3 --> C1
I4 --> C2
C1 --> O1 --> O3
C2 --> O2 --> O3
style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
style O3 fill:#059669,stroke:#047857,color:#fff
from enum import Enum
from dataclasses import dataclass
from typing import Any
import litellm
class ModelTier(Enum):
FAST = "fast" # Classification, extraction, routing
BALANCED = "balanced" # Summarization, simple generation
POWERFUL = "powerful" # Complex reasoning, creative writing
@dataclass
class ModelConfig:
tier: ModelTier
model_id: str
max_tokens: int
cost_per_1k_input: float
cost_per_1k_output: float
MODEL_REGISTRY = {
ModelTier.FAST: ModelConfig(
tier=ModelTier.FAST,
model_id="gpt-4o-mini",
max_tokens=1024,
cost_per_1k_input=0.00015,
cost_per_1k_output=0.0006,
),
ModelTier.BALANCED: ModelConfig(
tier=ModelTier.BALANCED,
model_id="gpt-4o",
max_tokens=4096,
cost_per_1k_input=0.0025,
cost_per_1k_output=0.01,
),
ModelTier.POWERFUL: ModelConfig(
tier=ModelTier.POWERFUL,
model_id="claude-opus-4-20250514",
max_tokens=8192,
cost_per_1k_input=0.015,
cost_per_1k_output=0.075,
),
}
Building the Task Router
The router classifies incoming tasks and assigns them a model tier. This classification itself uses the fast model.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
import json
class TaskRouter:
def __init__(self):
self.fast_model = MODEL_REGISTRY[ModelTier.FAST].model_id
async def classify_task(self, task_description: str) -> ModelTier:
response = await litellm.acompletion(
model=self.fast_model,
messages=[
{"role": "system", "content": """Classify this task into one tier:
- FAST: simple classification, yes/no questions, entity extraction, formatting
- BALANCED: summarization, translation, simple Q&A, data transformation
- POWERFUL: complex reasoning, multi-step analysis, creative writing, code generation
Respond with ONLY the tier name."""},
{"role": "user", "content": task_description},
],
max_tokens=10,
temperature=0,
)
tier_name = response.choices[0].message.content.strip().upper()
return ModelTier[tier_name]
async def route_and_execute(
self, task: str, system_prompt: str
) -> dict[str, Any]:
tier = await self.classify_task(task)
config = MODEL_REGISTRY[tier]
response = await litellm.acompletion(
model=config.model_id,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": task},
],
max_tokens=config.max_tokens,
)
return {
"result": response.choices[0].message.content,
"model_used": config.model_id,
"tier": tier.value,
"estimated_cost": self._estimate_cost(response, config),
}
def _estimate_cost(self, response, config: ModelConfig) -> float:
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
return (
(input_tokens / 1000) * config.cost_per_1k_input
+ (output_tokens / 1000) * config.cost_per_1k_output
)
Multi-Model Agent Pipeline
In a multi-step agent pipeline, each step can use a different model. Here is a document analysis pipeline where steps are assigned different tiers.
from agents import Agent
# Step 1: Extract key entities (fast model)
extractor = Agent(
name="Entity Extractor",
model="gpt-4o-mini",
instructions="Extract all named entities (people, companies, dates, amounts) from the text. Return as JSON.",
)
# Step 2: Classify document type (fast model)
classifier = Agent(
name="Document Classifier",
model="gpt-4o-mini",
instructions="Classify this document as: contract, invoice, letter, report, or memo. Return only the type.",
)
# Step 3: Deep analysis (powerful model)
analyzer = Agent(
name="Document Analyzer",
model="claude-opus-4-20250514",
instructions="""Perform deep analysis of this document:
- Identify key obligations and deadlines
- Flag potential risks or ambiguities
- Summarize the document's purpose and implications
Use the entity data and document type provided for context.""",
)
Orchestrating the Pipeline
from agents import Runner
async def analyze_document(document_text: str) -> dict:
# Fast: Extract entities ($0.001)
entities_result = await Runner.run(
extractor, f"Extract entities from: {document_text}"
)
# Fast: Classify document ($0.0005)
class_result = await Runner.run(
classifier, f"Classify: {document_text[:500]}"
)
# Powerful: Deep analysis ($0.05)
analysis_prompt = f"""Document type: {class_result.final_output}
Entities found: {entities_result.final_output}
Full document: {document_text}"""
analysis_result = await Runner.run(analyzer, analysis_prompt)
return {
"entities": entities_result.final_output,
"document_type": class_result.final_output,
"analysis": analysis_result.final_output,
"total_estimated_cost": 0.05, # vs $0.15 if all steps used the powerful model
}
The fast steps cost almost nothing. The expensive model only runs for the one step that genuinely needs deep reasoning. Over thousands of documents, this architecture saves significant cost.
Cost Tracking and Model Selection Feedback
Track actual costs and quality per tier to refine routing decisions over time.
import sqlite3
from datetime import datetime
class CostTracker:
def __init__(self, db_path: str = "model_costs.db"):
self.db = sqlite3.connect(db_path)
self.db.execute("""
CREATE TABLE IF NOT EXISTS model_usage (
id INTEGER PRIMARY KEY,
timestamp TEXT,
task_type TEXT,
tier TEXT,
model_id TEXT,
input_tokens INTEGER,
output_tokens INTEGER,
cost REAL,
quality_score REAL
)
""")
def log_usage(self, task_type: str, tier: str, model_id: str,
input_tokens: int, output_tokens: int, cost: float):
self.db.execute(
"INSERT INTO model_usage (timestamp, task_type, tier, model_id, "
"input_tokens, output_tokens, cost) VALUES (?, ?, ?, ?, ?, ?, ?)",
(datetime.utcnow().isoformat(), task_type, tier, model_id,
input_tokens, output_tokens, cost),
)
self.db.commit()
def get_cost_summary(self) -> dict:
rows = self.db.execute(
"SELECT tier, SUM(cost), COUNT(*) FROM model_usage GROUP BY tier"
).fetchall()
return {row[0]: {"total_cost": row[1], "requests": row[2]} for row in rows}
FAQ
How do you handle cases where the router misclassifies a task?
Add a quality feedback loop. If the output from a FAST-tier model is flagged as low quality (by a user or automated check), automatically retry with a higher tier and log the misclassification. Over time, use these logs to fine-tune the router's classification prompt or train a small classifier model specifically for routing.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Should the router model itself be swappable?
Yes. The router should be the fastest and cheapest model available. As new small models are released (like GPT-4o-mini successors), swap the router model without changing the rest of the architecture. The router's accuracy requirements are modest — it just needs to distinguish simple from complex tasks.
How do you handle cross-model context passing?
Each model in the pipeline receives only the information it needs, not the full conversation history. The orchestrator extracts relevant outputs from upstream steps and formats them as context for downstream steps. This reduces token usage and prevents context window overflow when using models with smaller limits.
#MultiModelAI #ModelRouting #CostOptimization #AgentArchitecture #LLMOrchestration #AIEngineering #SmartRouting #ProductionAI
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.