Skip to content
Learn Agentic AI
Learn Agentic AI12 min read16 views

MCPServerManager: Orchestrating Multiple MCP Servers

Use MCPServerManager to orchestrate multiple MCP server connections with automatic failure detection, reconnection strategies, and health monitoring using active_servers, failed_servers, and drop_failed_servers.

The Multi-Server Challenge

Production agents rarely use a single MCP server. A typical enterprise agent might connect to:

  • A filesystem server for document access
  • A database server for customer records
  • A search server for knowledge base queries
  • A custom business logic server for domain operations
  • An email server for sending notifications

When everything is healthy, this works well. But in production, servers crash, network connections drop, and deployments restart services. A single failed server can break the entire agent if connections are not managed properly.

MCPServerManager is the orchestration layer that handles multi-server lifecycle management. It tracks which servers are active, which have failed, and provides strategies for recovery — so your agent degrades gracefully instead of crashing.

Setting Up MCPServerManager

MCPServerManager wraps multiple MCP server instances and provides a unified interface for connection management:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    HOST(["MCP host<br/>Claude Desktop or IDE"])
    CLIENT["MCP client"]
    subgraph SERVERS["MCP Servers"]
        S1["Filesystem server"]
        S2["GitHub server"]
        S3["Postgres server"]
        SX["Custom tool server"]
    end
    LLM["LLM session"]
    OUT(["Grounded action"])
    HOST <--> CLIENT
    CLIENT <-->|stdio or HTTP+SSE| S1
    CLIENT <--> S2
    CLIENT <--> S3
    CLIENT <--> SX
    CLIENT --> LLM --> OUT
    style HOST fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style CLIENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
from agents.mcp import (
    MCPServerStdio,
    MCPServerStreamableHTTP,
    MCPServerManager,
)

# Define your servers
filesystem = MCPServerStdio(
    name="Filesystem",
    params={
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"],
    },
    cache_tools_list=True,
)

database = MCPServerStreamableHTTP(
    name="Database",
    params={"url": "http://db-mcp:8001/mcp"},
    cache_tools_list=True,
)

search = MCPServerStreamableHTTP(
    name="Search",
    params={"url": "http://search-mcp:8002/mcp"},
    cache_tools_list=True,
)

custom_tools = MCPServerStdio(
    name="BusinessLogic",
    params={
        "command": "python",
        "args": ["business_logic_server.py"],
    },
    cache_tools_list=True,
)

# Create the manager
manager = MCPServerManager(
    servers=[filesystem, database, search, custom_tools]
)

Connecting with the Manager

Use the manager as an async context manager. It handles connecting to all servers and provides status tracking:

from agents import Agent, Runner

agent = Agent(
    name="Enterprise Assistant",
    instructions="You help employees with file access, data queries, and business operations.",
    mcp_servers=[filesystem, database, search, custom_tools],
)

async def run_agent(user_message: str):
    async with manager:
        # Check which servers connected successfully
        active = manager.active_servers
        failed = manager.failed_servers

        print(f"Active servers: {[s.name for s in active]}")
        print(f"Failed servers: {[s.name for s in failed]}")

        if not active:
            return "All MCP servers are unavailable. Please try again later."

        result = await Runner.run(agent, user_message)
        return result.final_output

The key difference from managing servers individually is that MCPServerManager does not raise an exception if one server fails to connect. Instead, it tracks the failure and lets you decide how to respond.

Monitoring Active and Failed Servers

MCPServerManager provides two properties for monitoring server health:

  • active_servers — A list of server instances that are currently connected and operational.
  • failed_servers — A list of server instances that failed to connect or lost their connection.

Use these to build health checks and adaptive behavior:

from fastapi import FastAPI

app = FastAPI()

@app.get("/health/mcp")
async def mcp_health():
    active = manager.active_servers
    failed = manager.failed_servers
    return {
        "status": "degraded" if failed else "healthy",
        "active": [s.name for s in active],
        "failed": [s.name for s in failed],
        "total": len(active) + len(failed),
        "active_count": len(active),
    }

You can also use server status to adjust agent behavior dynamically:

async def adaptive_instructions(run_context, agent):
    active_names = {s.name for s in manager.active_servers}
    base = "You are an enterprise assistant."

    if "Database" not in active_names:
        base += (
            " The database server is currently unavailable. "
            "Let the user know you cannot look up records right now "
            "and suggest they try again in a few minutes."
        )

    if "Search" not in active_names:
        base += (
            " The search server is offline. You cannot search the "
            "knowledge base. Answer from your training data and note "
            "that results may not reflect the latest documentation."
        )

    return base

agent = Agent(
    name="Enterprise Assistant",
    instructions=adaptive_instructions,
    mcp_servers=[filesystem, database, search, custom_tools],
)

Dropping Failed Servers

When a server fails, it stays in the manager's server list by default. The agent SDK will skip it when listing tools, but it still occupies a connection slot and may cause timeouts if the agent tries to reach it.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

drop_failed_servers() removes failed servers from the manager entirely:

async def run_with_cleanup():
    async with manager:
        # Some servers may have failed to connect
        if manager.failed_servers:
            failed_names = [s.name for s in manager.failed_servers]
            print(f"Dropping failed servers: {failed_names}")
            manager.drop_failed_servers()

        # Now only healthy servers remain
        result = await Runner.run(agent, "Check my recent orders")
        return result.final_output

This is useful when you know a server will not recover during the current session. Dropping it prevents the agent from wasting tokens generating tool calls that will fail.

Reconnection Strategies

For long-running services, you need a strategy to reconnect failed servers. The manager itself does not auto-reconnect, but you can build reconnection logic on top of it:

import asyncio
import logging

logger = logging.getLogger(__name__)

class ResilientMCPManager:
    def __init__(self, servers, reconnect_interval=60, max_retries=5):
        self.all_servers = servers
        self.manager = MCPServerManager(servers=servers)
        self.reconnect_interval = reconnect_interval
        self.max_retries = max_retries
        self.retry_counts = {s.name: 0 for s in servers}
        self._reconnect_task = None

    async def __aenter__(self):
        await self.manager.__aenter__()
        self._reconnect_task = asyncio.create_task(self._reconnect_loop())
        return self

    async def __aexit__(self, *args):
        if self._reconnect_task:
            self._reconnect_task.cancel()
        await self.manager.__aexit__(*args)

    async def _reconnect_loop(self):
        while True:
            await asyncio.sleep(self.reconnect_interval)
            failed = list(self.manager.failed_servers)
            for server in failed:
                if self.retry_counts[server.name] >= self.max_retries:
                    logger.warning(
                        f"Server {server.name} exceeded max retries, skipping"
                    )
                    continue
                try:
                    logger.info(f"Attempting reconnect: {server.name}")
                    await server.connect()
                    self.retry_counts[server.name] = 0
                    logger.info(f"Reconnected: {server.name}")
                except Exception as e:
                    self.retry_counts[server.name] += 1
                    logger.error(
                        f"Reconnect failed for {server.name}: {e} "
                        f"(attempt {self.retry_counts[server.name]}/"
                        f"{self.max_retries})"
                    )

    @property
    def active_servers(self):
        return self.manager.active_servers

    @property
    def failed_servers(self):
        return self.manager.failed_servers

Integrating with Agent Runner

Here is a complete example that ties the manager into an agent service:

from agents import Agent, Runner
from fastapi import FastAPI
import logging

logger = logging.getLogger(__name__)
app = FastAPI()

resilient_manager = ResilientMCPManager(
    servers=[filesystem, database, search, custom_tools],
    reconnect_interval=30,
    max_retries=10,
)

agent = Agent(
    name="Enterprise Assistant",
    instructions=adaptive_instructions,
    mcp_servers=[filesystem, database, search, custom_tools],
)

@app.on_event("startup")
async def startup():
    await resilient_manager.__aenter__()
    active = resilient_manager.active_servers
    failed = resilient_manager.failed_servers
    logger.info(f"MCP servers active: {[s.name for s in active]}")
    if failed:
        logger.warning(f"MCP servers failed: {[s.name for s in failed]}")

@app.on_event("shutdown")
async def shutdown():
    await resilient_manager.__aexit__(None, None, None)

@app.post("/chat")
async def chat(message: str):
    active = resilient_manager.active_servers
    if not active:
        return {"error": "All MCP servers are unavailable", "status": 503}

    result = await Runner.run(agent, message)
    return {
        "response": result.final_output,
        "servers_used": [s.name for s in active],
    }

Best Practices for Multi-Server Agents

  1. Always use MCPServerManager when connecting to two or more MCP servers. Direct management of multiple servers leads to inconsistent error handling.
  2. Categorize servers by criticality. Fail fast if essential servers are down. Degrade gracefully for optional ones.
  3. Set connection timeouts. Do not let a slow server block the entire startup sequence.
  4. Drop permanently failed servers. If a server exceeds your retry limit, remove it to prevent useless tool calls.
  5. Expose health endpoints. Report which servers are active and wire this into your alerting system.
  6. Log every lifecycle event. Connection, disconnection, and reconnection attempts should all produce structured log entries with server names and error details.

MCPServerManager transforms multi-server MCP from a fragile setup into a resilient system. By tracking server health, supporting graceful degradation, and enabling reconnection, it gives your production agents the reliability they need to serve real users.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

MCP Registry Catalogs in 2026: Official Registry vs Smithery vs mcp.so

The Official MCP Registry hit API freeze v0.1. Smithery has 7,000+ servers, mcp.so has 19,700+, PulseMCP is hand-curated. We compare discovery, install, and security across the major catalogs.

AI Infrastructure

MCP Servers for SaaS Tools: A 2026 Registry Walkthrough for Voice Agent Teams

The public MCP registry crossed 9,400 servers in April 2026. Here is a curated walkthrough of the SaaS MCP servers CallSphere mounts in production, with OAuth 2.1 PKCE patterns.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Funding & Industry

OpenAI revenue run-rate — April 2026 read — April 2026 update

OpenAI's April 2026 reported revenue run-rate cleared $13B annualized, on continued ChatGPT growth, agentic Operator monetization, and enterprise API expansion.

Funding & Industry

Stargate progress update — April 2026 site and capex

OpenAI's Stargate with Oracle and SoftBank crossed a milestone in April 2026 with the first Texas site partially energized and three additional sites under construction.