Skip to content
Learn Agentic AI
Learn Agentic AI10 min read17 views

LLM Watermarking: Detecting AI-Generated Content in Agent Outputs

Understand how LLM watermarking techniques embed detectable signals in generated text, how detection algorithms work, and the implications for agent transparency, compliance, and content provenance.

Why Watermark AI-Generated Text?

As AI agents produce more content — emails, reports, code, customer communications — the ability to distinguish AI-generated text from human-written text becomes increasingly important. Regulatory frameworks like the EU AI Act require transparency about AI-generated content. Internal compliance teams need to audit which communications were written by agents. And content platforms need tools to enforce their policies.

LLM watermarking embeds a statistically detectable signal in generated text that is invisible to human readers but can be identified by a detection algorithm.

How Text Watermarking Works

The most influential watermarking technique, introduced by Kirchenbauer et al., works by splitting the vocabulary into a "green list" and a "red list" at each generation step using a hash of the preceding token. During generation, a small bias is added to green-list tokens, making them slightly more likely to be selected. The resulting text looks natural but contains a statistical imbalance that a detector can identify.

flowchart LR
    REQ(["Inbound request"])
    PII["PII detection<br/>regex plus NER"]
    POL{"Policy engine<br/>OPA or rules"}
    REDACT["Redact or mask"]
    LLM["LLM call"]
    OUT["Response"]
    AUDIT[("Append only<br/>audit log")]
    BLOCK(["Block plus<br/>notify DPO"])
    REQ --> PII --> POL
    POL -->|Allow| REDACT --> LLM --> OUT --> AUDIT
    POL -->|Deny| BLOCK
    style POL fill:#4f46e5,stroke:#4338ca,color:#fff
    style AUDIT fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
import hashlib
import numpy as np
from typing import Optional

class LLMWatermarker:
    """Implements token-level watermarking during text generation."""

    def __init__(self, vocab_size: int, gamma: float = 0.5, delta: float = 2.0):
        self.vocab_size = vocab_size
        self.gamma = gamma    # fraction of vocabulary in the green list
        self.delta = delta    # logit bias added to green-list tokens

    def _get_green_list(self, prev_token_id: int, seed: int = 42) -> set[int]:
        """Deterministically split vocabulary into green/red using prev token."""
        hash_input = f"{seed}:{prev_token_id}".encode()
        hash_val = int(hashlib.sha256(hash_input).hexdigest(), 16)
        rng = np.random.RandomState(hash_val % (2**31))

        # Randomly select gamma fraction of vocab as green list
        perm = rng.permutation(self.vocab_size)
        green_size = int(self.gamma * self.vocab_size)
        return set(perm[:green_size].tolist())

    def apply_watermark(
        self, logits: np.ndarray, prev_token_id: int, seed: int = 42
    ) -> np.ndarray:
        """Add watermark bias to logits during generation."""
        green_list = self._get_green_list(prev_token_id, seed)
        watermarked_logits = logits.copy()

        for token_id in green_list:
            watermarked_logits[token_id] += self.delta

        return watermarked_logits

Detecting the Watermark

Detection works by examining the generated text and checking whether green-list tokens appear more frequently than expected by chance. Under the null hypothesis (no watermark), green-list tokens should appear with probability gamma. A z-test determines whether the observed frequency is significantly higher:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
from scipy import stats

class WatermarkDetector:
    """Detects watermarked text by analyzing green-list token frequency."""

    def __init__(self, vocab_size: int, gamma: float = 0.5, seed: int = 42):
        self.vocab_size = vocab_size
        self.gamma = gamma
        self.seed = seed

    def _get_green_list(self, prev_token_id: int) -> set[int]:
        hash_input = f"{self.seed}:{prev_token_id}".encode()
        hash_val = int(hashlib.sha256(hash_input).hexdigest(), 16)
        rng = np.random.RandomState(hash_val % (2**31))
        perm = rng.permutation(self.vocab_size)
        green_size = int(self.gamma * self.vocab_size)
        return set(perm[:green_size].tolist())

    def detect(
        self, token_ids: list[int], threshold: float = 4.0
    ) -> dict:
        """Test whether a sequence of tokens contains a watermark."""
        green_count = 0
        total = 0

        for i in range(1, len(token_ids)):
            prev_id = token_ids[i - 1]
            curr_id = token_ids[i]
            green_list = self._get_green_list(prev_id)

            if curr_id in green_list:
                green_count += 1
            total += 1

        if total == 0:
            return {"watermarked": False, "z_score": 0.0, "p_value": 1.0}

        # Z-test: is green fraction significantly above gamma?
        expected = self.gamma
        observed = green_count / total
        z_score = (observed - expected) / np.sqrt(expected * (1 - expected) / total)
        p_value = 1 - stats.norm.cdf(z_score)

        return {
            "watermarked": z_score > threshold,
            "z_score": float(z_score),
            "p_value": float(p_value),
            "green_fraction": float(observed),
            "tokens_analyzed": total,
        }

Robustness Considerations

Watermarks face adversarial attacks. Paraphrasing the text using another model can remove the watermark because the new model generates from a different distribution. Simple edits — inserting, deleting, or substituting a few words — degrade the signal. Longer texts are more robustly watermarked because the statistical signal grows with sequence length.

Current research focuses on robust watermarking schemes that survive paraphrasing and editing by embedding the signal at a semantic level rather than a token level. These approaches encode the watermark in the distribution of ideas or sentence structures rather than individual token choices.

Privacy and Ethical Considerations

Watermarking raises important privacy questions. If every output from an agent is watermarked with a unique key tied to a user or session, it becomes possible to trace any piece of text back to the user who generated it. This enables accountability but also surveillance.

Agent developers must consider: Who holds the watermark keys? Under what circumstances can detection be performed? Are users informed that outputs are watermarked? These are design decisions with legal and ethical implications that go beyond the technical implementation.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Implementing Watermarking in Agent Pipelines

For production agents, watermarking can be applied at the inference layer (modifying logits during generation) or as a metadata approach (embedding cryptographic signatures in output metadata without modifying the text itself). The metadata approach preserves output quality completely but can be stripped by copying the text without its metadata.

FAQ

Does watermarking reduce the quality of generated text?

With a small delta (bias value around 1.0-2.0), the quality impact is negligible — human evaluators generally cannot distinguish watermarked from non-watermarked text. Higher delta values make the watermark more robust but can introduce subtle statistical artifacts in word choice.

Can watermarks survive translation into another language?

Token-level watermarks typically do not survive translation because the new language uses a completely different vocabulary and token distribution. Semantic-level watermarking approaches show more promise for cross-lingual robustness, but this remains an active research area.

How long does text need to be for reliable detection?

Detection reliability depends on gamma, delta, and the significance threshold. With typical parameters (gamma=0.5, delta=2.0), reliable detection (z-score above 4.0) generally requires at least 50-100 tokens. Shorter texts produce unreliable results with high false-positive and false-negative rates.


#LLMWatermarking #AIDetection #ContentProvenance #Compliance #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.