Skip to content
Learn Agentic AI
Learn Agentic AI14 min read5 views

Consensus Algorithms for Multi-Agent Systems: Voting, Averaging, and Byzantine Fault Tolerance

Explore how multi-agent AI systems reach agreement using consensus algorithms including majority voting, weighted averaging, and Byzantine fault tolerance. Includes Python implementations for each pattern.

Why Agents Need Consensus

When multiple AI agents collaborate on a task, they frequently produce different answers. One agent might classify a support ticket as "billing," another as "account access," and a third as "technical." Without a structured way to reconcile these disagreements, your system either picks arbitrarily or fails entirely.

Consensus algorithms provide the mechanism for agents to reach agreement. Borrowed from distributed systems theory, these patterns let you build multi-agent pipelines that are more accurate than any single agent and resilient to individual agent failures.

Pattern 1: Majority Voting

The simplest consensus mechanism asks each agent for a discrete answer and picks the one chosen most often. This works best when agents produce categorical outputs like classifications, yes/no decisions, or label assignments.

flowchart TD
    INPUT(["Task input"])
    SUPER["Supervisor agent<br/>plans plus monitors"]
    W1["Worker 1<br/>research"]
    W2["Worker 2<br/>code"]
    W3["Worker 3<br/>writing"]
    CRITIC{"Output meets<br/>rubric?"}
    REWORK["Rework or<br/>retry path"]
    SHARED[("Shared scratchpad<br/>and memory")]
    OUT(["Final result"])
    INPUT --> SUPER
    SUPER --> W1 --> CRITIC
    SUPER --> W2 --> CRITIC
    SUPER --> W3 --> CRITIC
    W1 --> SHARED
    W2 --> SHARED
    W3 --> SHARED
    SHARED --> SUPER
    CRITIC -->|Pass| OUT
    CRITIC -->|Fail| REWORK --> SUPER
    style SUPER fill:#4f46e5,stroke:#4338ca,color:#fff
    style CRITIC fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OUT fill:#059669,stroke:#047857,color:#fff
    style SHARED fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
from collections import Counter
from dataclasses import dataclass
from typing import Any

@dataclass
class AgentVote:
    agent_id: str
    choice: str
    confidence: float

class MajorityVotingConsensus:
    def __init__(self, quorum: int = 3):
        self.quorum = quorum

    def resolve(self, votes: list[AgentVote]) -> dict[str, Any]:
        if len(votes) < self.quorum:
            raise ValueError(
                f"Need {self.quorum} votes, got {len(votes)}"
            )

        counts = Counter(v.choice for v in votes)
        winner, winner_count = counts.most_common(1)[0]
        total = len(votes)

        return {
            "decision": winner,
            "agreement_ratio": winner_count / total,
            "vote_distribution": dict(counts),
            "unanimous": winner_count == total,
        }

# Usage
consensus = MajorityVotingConsensus(quorum=3)
votes = [
    AgentVote("classifier-1", "billing", 0.85),
    AgentVote("classifier-2", "billing", 0.72),
    AgentVote("classifier-3", "account_access", 0.65),
]
result = consensus.resolve(votes)
# decision: "billing", agreement_ratio: 0.667

The agreement_ratio field is critical for downstream logic. A 3-to-0 unanimous vote carries far more weight than a 2-to-1 split. You should define thresholds — for example, escalate to a human reviewer when agreement drops below 0.6.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Pattern 2: Weighted Averaging

When agents produce numeric outputs (scores, probabilities, estimates), weighted averaging lets you combine them while giving more influence to agents with higher confidence or better historical accuracy.

class WeightedAverageConsensus:
    def __init__(self, agent_weights: dict[str, float] | None = None):
        self.agent_weights = agent_weights or {}

    def resolve(
        self, estimates: list[dict[str, float]]
    ) -> dict[str, float]:
        total_weight = 0.0
        weighted_sum = 0.0

        for est in estimates:
            agent_id = est["agent_id"]
            value = est["value"]
            confidence = est["confidence"]
            historical_weight = self.agent_weights.get(agent_id, 1.0)

            weight = confidence * historical_weight
            weighted_sum += value * weight
            total_weight += weight

        consensus_value = weighted_sum / total_weight
        variance = sum(
            ((e["value"] - consensus_value) ** 2) for e in estimates
        ) / len(estimates)

        return {
            "consensus_value": round(consensus_value, 4),
            "variance": round(variance, 4),
            "num_agents": len(estimates),
        }

# Agents with proven track records get higher weight
consensus = WeightedAverageConsensus(
    agent_weights={"estimator-a": 1.5, "estimator-b": 1.0, "estimator-c": 0.7}
)

Pattern 3: Byzantine Fault Tolerance

In real deployments, agents can fail in unpredictable ways — returning garbage, hallucinating confidently, or being compromised. Byzantine fault tolerance (BFT) handles these scenarios by requiring a supermajority to agree, filtering out outliers before consensus.

import statistics

class ByzantineFaultTolerantConsensus:
    """Tolerates up to f faulty agents out of 3f+1 total."""

    def __init__(self, max_faulty: int = 1):
        self.max_faulty = max_faulty
        self.min_agents = 3 * max_faulty + 1

    def resolve(self, responses: list[dict]) -> dict:
        if len(responses) < self.min_agents:
            raise ValueError(
                f"Need >= {self.min_agents} agents for f={self.max_faulty}"
            )

        values = [r["value"] for r in responses]
        median = statistics.median(values)
        mad = statistics.median(
            [abs(v - median) for v in values]
        )
        threshold = 3 * mad if mad > 0 else 0.1 * abs(median)

        trusted = [
            r for r in responses
            if abs(r["value"] - median) <= threshold
        ]
        excluded = [
            r for r in responses
            if abs(r["value"] - median) > threshold
        ]

        if len(trusted) < len(responses) - self.max_faulty:
            return {"status": "no_consensus", "excluded": excluded}

        consensus_val = statistics.mean(r["value"] for r in trusted)
        return {
            "status": "consensus",
            "value": round(consensus_val, 4),
            "trusted_agents": len(trusted),
            "excluded_agents": [e["agent_id"] for e in excluded],
        }

The key insight is 3f + 1: to tolerate one faulty agent, you need at least four agents total. To tolerate two, you need seven. This is a fundamental lower bound from distributed systems theory.

Choosing the Right Pattern

Use majority voting for classification tasks with discrete outputs. Use weighted averaging for numeric estimates where agent reliability varies. Use BFT when agent outputs cannot be trusted unconditionally — such as when agents call external APIs that might return errors, or when you run heterogeneous models with different failure modes.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

FAQ

When should I use consensus instead of just picking the best single agent?

Use consensus whenever the cost of a wrong answer exceeds the cost of running multiple agents. In practice, a 3-agent majority vote with mid-tier models often outperforms a single top-tier model at lower total cost, especially for classification tasks where agreement rate gives you a built-in confidence signal.

How do I handle ties in majority voting?

Common strategies include: adding more agents until the tie breaks, falling back to the agent with the highest confidence score, or escalating to a human reviewer. Never resolve ties randomly in production — you lose reproducibility and auditability.

Does BFT work for text generation, not just numeric outputs?

Yes, but you need a similarity metric to replace numeric distance. Use embedding cosine similarity or ROUGE scores to identify outliers. If one agent generates text that is semantically distant from all others, treat it as a Byzantine failure and exclude it before selecting the most representative output.


#ConsensusAlgorithms #MultiAgentSystems #ByzantineFaultTolerance #DistributedAI #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.