Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps

Why One Model Does Not Fit All Tasks

Running GPT-4o or Claude Opus for every agent step is like using a sports car to deliver groceries. Classification tasks (is this a billing question or a technical question?) need millisecond responses and cost fractions of a cent. Complex reasoning (analyze this contract and identify risky clauses) needs the most capable model available. Multi-model architectures match model capability to task complexity, cutting costs by 60-80% while maintaining output quality where it matters.

The Model Routing Pattern

The core idea is a router that examines each task and dispatches it to the appropriate model. The router itself should be fast and cheap — it is the one component that runs on every request.

flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost<br/>per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted<br/>at 24 by 7 coverage"]
        C2["Receptionist payroll<br/>displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue<br/>per month"]
        O2["Operating cost saved"]
        O3((Net ROI<br/>monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff

from enum import Enum
from dataclasses import dataclass
from typing import Any
import litellm

class ModelTier(Enum):
    FAST = "fast"       # Classification, extraction, routing
    BALANCED = "balanced"  # Summarization, simple generation
    POWERFUL = "powerful"  # Complex reasoning, creative writing

@dataclass
class ModelConfig:
    tier: ModelTier
    model_id: str
    max_tokens: int
    cost_per_1k_input: float
    cost_per_1k_output: float

MODEL_REGISTRY = {
    ModelTier.FAST: ModelConfig(
        tier=ModelTier.FAST,
        model_id="gpt-4o-mini",
        max_tokens=1024,
        cost_per_1k_input=0.00015,
        cost_per_1k_output=0.0006,
    ),
    ModelTier.BALANCED: ModelConfig(
        tier=ModelTier.BALANCED,
        model_id="gpt-4o",
        max_tokens=4096,
        cost_per_1k_input=0.0025,
        cost_per_1k_output=0.01,
    ),
    ModelTier.POWERFUL: ModelConfig(
        tier=ModelTier.POWERFUL,
        model_id="claude-opus-4-20250514",
        max_tokens=8192,
        cost_per_1k_input=0.015,
        cost_per_1k_output=0.075,
    ),
}

Building the Task Router

The router classifies incoming tasks and assigns them a model tier. This classification itself uses the fast model.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import json

class TaskRouter:
    def __init__(self):
        self.fast_model = MODEL_REGISTRY[ModelTier.FAST].model_id

    async def classify_task(self, task_description: str) -> ModelTier:
        response = await litellm.acompletion(
            model=self.fast_model,
            messages=[
                {"role": "system", "content": """Classify this task into one tier:
- FAST: simple classification, yes/no questions, entity extraction, formatting
- BALANCED: summarization, translation, simple Q&A, data transformation
- POWERFUL: complex reasoning, multi-step analysis, creative writing, code generation

Respond with ONLY the tier name."""},
                {"role": "user", "content": task_description},
            ],
            max_tokens=10,
            temperature=0,
        )
        tier_name = response.choices[0].message.content.strip().upper()
        return ModelTier[tier_name]

    async def route_and_execute(
        self, task: str, system_prompt: str
    ) -> dict[str, Any]:
        tier = await self.classify_task(task)
        config = MODEL_REGISTRY[tier]

        response = await litellm.acompletion(
            model=config.model_id,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": task},
            ],
            max_tokens=config.max_tokens,
        )

        return {
            "result": response.choices[0].message.content,
            "model_used": config.model_id,
            "tier": tier.value,
            "estimated_cost": self._estimate_cost(response, config),
        }

    def _estimate_cost(self, response, config: ModelConfig) -> float:
        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        return (
            (input_tokens / 1000) * config.cost_per_1k_input
            + (output_tokens / 1000) * config.cost_per_1k_output
        )

Multi-Model Agent Pipeline

In a multi-step agent pipeline, each step can use a different model. Here is a document analysis pipeline where steps are assigned different tiers.

from agents import Agent

# Step 1: Extract key entities (fast model)
extractor = Agent(
    name="Entity Extractor",
    model="gpt-4o-mini",
    instructions="Extract all named entities (people, companies, dates, amounts) from the text. Return as JSON.",
)

# Step 2: Classify document type (fast model)
classifier = Agent(
    name="Document Classifier",
    model="gpt-4o-mini",
    instructions="Classify this document as: contract, invoice, letter, report, or memo. Return only the type.",
)

# Step 3: Deep analysis (powerful model)
analyzer = Agent(
    name="Document Analyzer",
    model="claude-opus-4-20250514",
    instructions="""Perform deep analysis of this document:
    - Identify key obligations and deadlines
    - Flag potential risks or ambiguities
    - Summarize the document's purpose and implications
    Use the entity data and document type provided for context.""",
)

Orchestrating the Pipeline

from agents import Runner

async def analyze_document(document_text: str) -> dict:
    # Fast: Extract entities ($0.001)
    entities_result = await Runner.run(
        extractor, f"Extract entities from: {document_text}"
    )

    # Fast: Classify document ($0.0005)
    class_result = await Runner.run(
        classifier, f"Classify: {document_text[:500]}"
    )

    # Powerful: Deep analysis ($0.05)
    analysis_prompt = f"""Document type: {class_result.final_output}
Entities found: {entities_result.final_output}
Full document: {document_text}"""

    analysis_result = await Runner.run(analyzer, analysis_prompt)

    return {
        "entities": entities_result.final_output,
        "document_type": class_result.final_output,
        "analysis": analysis_result.final_output,
        "total_estimated_cost": 0.05,  # vs $0.15 if all steps used the powerful model
    }

The fast steps cost almost nothing. The expensive model only runs for the one step that genuinely needs deep reasoning. Over thousands of documents, this architecture saves significant cost.

Cost Tracking and Model Selection Feedback

Track actual costs and quality per tier to refine routing decisions over time.

import sqlite3
from datetime import datetime

class CostTracker:
    def __init__(self, db_path: str = "model_costs.db"):
        self.db = sqlite3.connect(db_path)
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS model_usage (
                id INTEGER PRIMARY KEY,
                timestamp TEXT,
                task_type TEXT,
                tier TEXT,
                model_id TEXT,
                input_tokens INTEGER,
                output_tokens INTEGER,
                cost REAL,
                quality_score REAL
            )
        """)

    def log_usage(self, task_type: str, tier: str, model_id: str,
                  input_tokens: int, output_tokens: int, cost: float):
        self.db.execute(
            "INSERT INTO model_usage (timestamp, task_type, tier, model_id, "
            "input_tokens, output_tokens, cost) VALUES (?, ?, ?, ?, ?, ?, ?)",
            (datetime.utcnow().isoformat(), task_type, tier, model_id,
             input_tokens, output_tokens, cost),
        )
        self.db.commit()

    def get_cost_summary(self) -> dict:
        rows = self.db.execute(
            "SELECT tier, SUM(cost), COUNT(*) FROM model_usage GROUP BY tier"
        ).fetchall()
        return {row[0]: {"total_cost": row[1], "requests": row[2]} for row in rows}

FAQ

How do you handle cases where the router misclassifies a task?

Add a quality feedback loop. If the output from a FAST-tier model is flagged as low quality (by a user or automated check), automatically retry with a higher tier and log the misclassification. Over time, use these logs to fine-tune the router's classification prompt or train a small classifier model specifically for routing.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Should the router model itself be swappable?

Yes. The router should be the fastest and cheapest model available. As new small models are released (like GPT-4o-mini successors), swap the router model without changing the rest of the architecture. The router's accuracy requirements are modest — it just needs to distinguish simple from complex tasks.

How do you handle cross-model context passing?

Each model in the pipeline receives only the information it needs, not the full conversation history. The orchestrator extracts relevant outputs from upstream steps and formats them as context for downstream steps. This reduces token usage and prevents context window overflow when using models with smaller limits.

#MultiModelAI #ModelRouting #CostOptimization #AgentArchitecture #LLMOrchestration #AIEngineering #SmartRouting #ProductionAI

Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps

Why One Model Does Not Fit All Tasks

The Model Routing Pattern

Building the Task Router

Multi-Model Agent Pipeline

Orchestrating the Pipeline

Cost Tracking and Model Selection Feedback

FAQ

How do you handle cases where the router misclassifies a task?

Should the router model itself be swappable?

How do you handle cross-model context passing?

Try CallSphere AI Voice Agents

Related Articles You May Like

Cost-Aware Agent Evaluation: Putting Token Spend, Latency, and Quality on the Same Dashboard

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

The Claude Refusal Tax: How Anthropic's Caution Costs Production Teams Real Money

Why Voice AI Builders Pick OpenAI Over Claude (and When That's the Wrong Call)

Is Claude Actually Too Cautious? What Production Voice AI Data Reveals