Error Handling and Retry Patterns for Playwright AI Agents

Why Error Handling Is Critical for Browser Automation Agents

Browser automation is inherently unreliable. Networks fail, pages load slowly, elements appear and disappear unpredictably, and websites deploy updates that change their DOM structure without warning. An AI agent that does not handle these failures gracefully will crash on its first encounter with the real web.

Production-grade Playwright agents need layered error handling: catching specific exceptions, implementing intelligent retry logic, providing fallback strategies, and logging sufficient context for debugging. This post covers patterns that make your agents resilient.

Playwright Exception Types

Playwright raises specific exception types that tell you exactly what went wrong:

flowchart TD
    CALL(["Inbound Call"])
    HEALTH{"Primary<br/>agent healthy?"}
    PRIMARY["Primary agent<br/>LLM provider A"]
    SECONDARY["Hot standby<br/>LLM provider B"]
    QUEUE[("Persisted<br/>call state")]
    HUMAN(["Live human<br/>fallback"])
    DONE(["Caller served"])
    CALL --> HEALTH
    HEALTH -->|Yes| PRIMARY
    HEALTH -->|Timeout or 5xx| SECONDARY
    PRIMARY --> QUEUE
    SECONDARY --> QUEUE
    PRIMARY --> DONE
    SECONDARY --> DONE
    SECONDARY -->|Both fail| HUMAN
    style HEALTH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PRIMARY fill:#4f46e5,stroke:#4338ca,color:#fff
    style SECONDARY fill:#0ea5e9,stroke:#0369a1,color:#fff
    style HUMAN fill:#dc2626,stroke:#b91c1c,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

from playwright.sync_api import (
    sync_playwright,
    TimeoutError as PlaywrightTimeout,
    Error as PlaywrightError,
)

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    try:
        page.goto("https://example.com", timeout=5000)
    except PlaywrightTimeout:
        print("Page took too long to load")
    except PlaywrightError as e:
        if "net::ERR_NAME_NOT_RESOLVED" in str(e):
            print("DNS resolution failed — invalid domain")
        elif "net::ERR_CONNECTION_REFUSED" in str(e):
            print("Server refused the connection")
        elif "net::ERR_CONNECTION_TIMED_OUT" in str(e):
            print("Connection timed out at network level")
        else:
            print(f"Browser error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    finally:
        browser.close()

The key exceptions to handle are:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

TimeoutError — element not found within timeout, page did not load
Error with network messages — DNS, connection, SSL failures
Error with element messages — element detached, not visible, not clickable

Handling Element Not Found

The most common failure in browser automation is trying to interact with an element that does not exist or is not ready:

def safe_click(page, selector: str, timeout: int = 5000) -> bool:
    """Click an element if it exists, return success status."""
    try:
        locator = page.locator(selector)
        locator.wait_for(state="visible", timeout=timeout)
        locator.click()
        return True
    except PlaywrightTimeout:
        print(f"Element not found: {selector}")
        return False
    except PlaywrightError as e:
        print(f"Cannot click {selector}: {e}")
        return False

def safe_fill(page, selector: str, value: str, timeout: int = 5000) -> bool:
    """Fill a form field if it exists, return success status."""
    try:
        locator = page.locator(selector)
        locator.wait_for(state="visible", timeout=timeout)
        locator.fill(value)
        return True
    except PlaywrightTimeout:
        print(f"Field not found: {selector}")
        return False

def safe_text(page, selector: str, default: str = "") -> str:
    """Extract text content safely."""
    try:
        locator = page.locator(selector)
        if locator.count() > 0:
            return locator.first.text_content() or default
        return default
    except Exception:
        return default

Building a Retry Decorator

A generic retry decorator that handles transient failures:

import time
import functools
from playwright.sync_api import TimeoutError as PlaywrightTimeout

def retry(
    max_attempts: int = 3,
    delay: float = 1.0,
    backoff: float = 2.0,
    exceptions: tuple = (PlaywrightTimeout, Exception),
):
    """
    Retry decorator with exponential backoff.

    Args:
        max_attempts: Maximum number of attempts
        delay: Initial delay between retries in seconds
        backoff: Multiplier for delay after each retry
        exceptions: Tuple of exception types to catch
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            current_delay = delay
            last_exception = None

            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    last_exception = e
                    if attempt == max_attempts:
                        print(
                            f"[{func.__name__}] Failed after "
                            f"{max_attempts} attempts: {e}"
                        )
                        raise
                    print(
                        f"[{func.__name__}] Attempt {attempt} failed: {e}. "
                        f"Retrying in {current_delay:.1f}s..."
                    )
                    time.sleep(current_delay)
                    current_delay *= backoff

        return wrapper
    return decorator

# Usage
@retry(max_attempts=3, delay=2.0, backoff=2.0)
def navigate_and_extract(page, url: str) -> dict:
    page.goto(url, wait_until="networkidle", timeout=10000)
    return {
        "title": page.title(),
        "content": page.locator("main").text_content(),
    }

Page-Level Retry with Fresh Context

Sometimes the page itself gets into a bad state. Retry with a fresh browser context:

from playwright.sync_api import sync_playwright

def robust_scrape(url: str, max_attempts: int = 3) -> dict:
    """Scrape a URL with retry logic that creates fresh contexts."""
    with sync_playwright() as p:
        browser = p.chromium.launch()

        for attempt in range(1, max_attempts + 1):
            context = browser.new_context()
            page = context.new_page()

            try:
                page.goto(url, wait_until="networkidle", timeout=15000)

                # Wait for content to be present
                page.wait_for_selector("body", timeout=5000)

                data = {
                    "url": url,
                    "title": page.title(),
                    "text": page.locator("body").text_content()[:5000],
                    "attempt": attempt,
                }
                return data

            except Exception as e:
                print(f"Attempt {attempt}/{max_attempts} failed: {e}")
                if attempt == max_attempts:
                    return {"url": url, "error": str(e)}

            finally:
                context.close()

        browser.close()

Graceful Degradation Pattern

When an agent cannot complete its primary task, fall back to progressively simpler strategies:

class ResilientAgent:
    def __init__(self, browser):
        self.browser = browser

    def extract_product_data(self, url: str) -> dict:
        """
        Try multiple strategies to extract product data,
        degrading gracefully if preferred methods fail.
        """
        context = self.browser.new_context()
        page = context.new_page()
        result = {"url": url, "strategy": None}

        try:
            page.goto(url, wait_until="networkidle", timeout=15000)

            # Strategy 1: Structured data (JSON-LD)
            try:
                json_ld = page.locator(
                    'script[type="application/ld+json"]'
                ).text_content()
                import json
                data = json.loads(json_ld)
                result.update({
                    "name": data.get("name"),
                    "price": data.get("offers", {}).get("price"),
                    "strategy": "json-ld",
                })
                return result
            except Exception:
                pass

            # Strategy 2: Open Graph meta tags
            try:
                result.update({
                    "name": page.locator(
                        'meta[property="og:title"]'
                    ).get_attribute("content"),
                    "price": None,
                    "strategy": "open-graph",
                })
                if result["name"]:
                    return result
            except Exception:
                pass

            # Strategy 3: DOM selectors (least reliable)
            try:
                result.update({
                    "name": (
                        safe_text(page, "h1")
                        or safe_text(page, ".product-title")
                    ),
                    "price": (
                        safe_text(page, ".price")
                        or safe_text(page, "[data-price]")
                    ),
                    "strategy": "dom-selectors",
                })
                return result
            except Exception:
                pass

            # Strategy 4: Take a screenshot for manual review
            page.screenshot(path=f"fallback_{hash(url)}.png")
            result.update({
                "name": page.title(),
                "price": None,
                "strategy": "screenshot-fallback",
            })
            return result

        except Exception as e:
            result["error"] = str(e)
            result["strategy"] = "failed"
            return result

        finally:
            context.close()

Timeout Configuration

Configure timeouts at different levels for fine-grained control:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()

    # Context-level default timeout (applies to all actions)
    context = browser.new_context()
    context.set_default_timeout(10000)        # 10s for actions
    context.set_default_navigation_timeout(30000)  # 30s for navigation

    page = context.new_page()

    # Page-level timeout override
    page.set_default_timeout(5000)

    # Per-action timeout (highest priority)
    page.goto("https://example.com", timeout=60000)
    page.locator("#slow-widget").wait_for(state="visible", timeout=20000)

    context.close()
    browser.close()

Timeout priority from highest to lowest: per-action > page-level > context-level > default (30 seconds).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Comprehensive Error-Handling Agent

Putting it all together in a production-ready agent:

import logging
import time
from dataclasses import dataclass
from playwright.sync_api import (
    sync_playwright,
    TimeoutError as PlaywrightTimeout,
    Error as PlaywrightError,
)

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("browser_agent")

@dataclass
class AgentResult:
    url: str
    success: bool
    data: dict | None = None
    error: str | None = None
    attempts: int = 0

class RobustBrowserAgent:
    def __init__(self, max_retries: int = 3, timeout: int = 15000):
        self.max_retries = max_retries
        self.timeout = timeout

    def execute(self, url: str, task_fn) -> AgentResult:
        with sync_playwright() as p:
            browser = p.chromium.launch()

            for attempt in range(1, self.max_retries + 1):
                context = browser.new_context()
                context.set_default_timeout(self.timeout)
                page = context.new_page()

                try:
                    logger.info(
                        f"Attempt {attempt}/{self.max_retries}: {url}"
                    )
                    page.goto(url, wait_until="networkidle")
                    data = task_fn(page)
                    return AgentResult(
                        url=url, success=True,
                        data=data, attempts=attempt,
                    )

                except PlaywrightTimeout as e:
                    logger.warning(f"Timeout on attempt {attempt}: {e}")
                    page.screenshot(
                        path=f"timeout_attempt_{attempt}.png"
                    )

                except PlaywrightError as e:
                    error_msg = str(e)
                    if "net::ERR_" in error_msg:
                        logger.error(f"Network error: {error_msg}")
                    else:
                        logger.error(f"Browser error: {error_msg}")

                except Exception as e:
                    logger.error(f"Unexpected error: {e}")

                finally:
                    context.close()

                if attempt < self.max_retries:
                    delay = 2 ** attempt
                    logger.info(f"Waiting {delay}s before retry...")
                    time.sleep(delay)

            browser.close()
            return AgentResult(
                url=url, success=False,
                error="Max retries exceeded",
                attempts=self.max_retries,
            )

# Usage
agent = RobustBrowserAgent(max_retries=3, timeout=10000)

def scrape_task(page):
    return {
        "title": page.title(),
        "heading": page.locator("h1").text_content(),
    }

result = agent.execute("https://example.com", scrape_task)
if result.success:
    print(f"Success after {result.attempts} attempt(s): {result.data}")
else:
    print(f"Failed: {result.error}")

FAQ

How should I handle CAPTCHAs in my AI agent?

CAPTCHAs are specifically designed to block automation. Options include: using CAPTCHA-solving services (like 2Captcha or Anti-Captcha), switching to an official API if the site provides one, or escalating to a human operator. Some CAPTCHAs can be avoided by using residential proxies, maintaining realistic browsing patterns, and keeping session cookies. Never attempt to bypass CAPTCHAs on sites where you do not have permission to automate.

What is the right retry count for production agents?

Three retries with exponential backoff (2s, 4s, 8s) works well for most scenarios. For critical tasks, increase to 5 retries. For bulk scraping where individual failures are acceptable, use 2 retries to optimize throughput. Always set a circuit breaker — if more than 50 percent of requests fail in a window, pause the agent and alert an operator rather than continuing to hammer a broken or blocking site.

How do I distinguish between transient and permanent failures?

Network errors (net::ERR_CONNECTION_TIMED_OUT, net::ERR_CONNECTION_RESET) are typically transient and worth retrying. DNS failures (net::ERR_NAME_NOT_RESOLVED) are usually permanent. HTTP 404 and 410 responses are permanent. HTTP 429 (rate limited) and 503 (service unavailable) are transient. Element-not-found errors may be permanent if the page structure changed, or transient if the page had not finished loading. Log the specific error type and use it to decide whether to retry.

#ErrorHandling #RetryPatterns #Playwright #Resilience #AIAgents #BrowserAutomation #FaultTolerance

Error Handling and Retry Patterns for Playwright AI Agents

Why Error Handling Is Critical for Browser Automation Agents

Playwright Exception Types

Handling Element Not Found

Building a Retry Decorator

Page-Level Retry with Fresh Context

Graceful Degradation Pattern

Timeout Configuration

Comprehensive Error-Handling Agent

FAQ

How should I handle CAPTCHAs in my AI agent?

What is the right retry count for production agents?

How do I distinguish between transient and permanent failures?

Try CallSphere AI Voice Agents

Related Articles You May Like

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)