Skip to content
Learn Agentic AI
Learn Agentic AI13 min read19 views

Error Handling and Retry Patterns for Playwright AI Agents

Build resilient Playwright AI agents with comprehensive error handling for timeouts, missing elements, navigation failures, and network errors, plus retry decorators and graceful degradation strategies.

Why Error Handling Is Critical for Browser Automation Agents

Browser automation is inherently unreliable. Networks fail, pages load slowly, elements appear and disappear unpredictably, and websites deploy updates that change their DOM structure without warning. An AI agent that does not handle these failures gracefully will crash on its first encounter with the real web.

Production-grade Playwright agents need layered error handling: catching specific exceptions, implementing intelligent retry logic, providing fallback strategies, and logging sufficient context for debugging. This post covers patterns that make your agents resilient.

Playwright Exception Types

Playwright raises specific exception types that tell you exactly what went wrong:

flowchart TD
    CALL(["Inbound Call"])
    HEALTH{"Primary<br/>agent healthy?"}
    PRIMARY["Primary agent<br/>LLM provider A"]
    SECONDARY["Hot standby<br/>LLM provider B"]
    QUEUE[("Persisted<br/>call state")]
    HUMAN(["Live human<br/>fallback"])
    DONE(["Caller served"])
    CALL --> HEALTH
    HEALTH -->|Yes| PRIMARY
    HEALTH -->|Timeout or 5xx| SECONDARY
    PRIMARY --> QUEUE
    SECONDARY --> QUEUE
    PRIMARY --> DONE
    SECONDARY --> DONE
    SECONDARY -->|Both fail| HUMAN
    style HEALTH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PRIMARY fill:#4f46e5,stroke:#4338ca,color:#fff
    style SECONDARY fill:#0ea5e9,stroke:#0369a1,color:#fff
    style HUMAN fill:#dc2626,stroke:#b91c1c,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from playwright.sync_api import (
    sync_playwright,
    TimeoutError as PlaywrightTimeout,
    Error as PlaywrightError,
)

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    try:
        page.goto("https://example.com", timeout=5000)
    except PlaywrightTimeout:
        print("Page took too long to load")
    except PlaywrightError as e:
        if "net::ERR_NAME_NOT_RESOLVED" in str(e):
            print("DNS resolution failed — invalid domain")
        elif "net::ERR_CONNECTION_REFUSED" in str(e):
            print("Server refused the connection")
        elif "net::ERR_CONNECTION_TIMED_OUT" in str(e):
            print("Connection timed out at network level")
        else:
            print(f"Browser error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    finally:
        browser.close()

The key exceptions to handle are:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • TimeoutError — element not found within timeout, page did not load
  • Error with network messages — DNS, connection, SSL failures
  • Error with element messages — element detached, not visible, not clickable

Handling Element Not Found

The most common failure in browser automation is trying to interact with an element that does not exist or is not ready:

def safe_click(page, selector: str, timeout: int = 5000) -> bool:
    """Click an element if it exists, return success status."""
    try:
        locator = page.locator(selector)
        locator.wait_for(state="visible", timeout=timeout)
        locator.click()
        return True
    except PlaywrightTimeout:
        print(f"Element not found: {selector}")
        return False
    except PlaywrightError as e:
        print(f"Cannot click {selector}: {e}")
        return False

def safe_fill(page, selector: str, value: str, timeout: int = 5000) -> bool:
    """Fill a form field if it exists, return success status."""
    try:
        locator = page.locator(selector)
        locator.wait_for(state="visible", timeout=timeout)
        locator.fill(value)
        return True
    except PlaywrightTimeout:
        print(f"Field not found: {selector}")
        return False

def safe_text(page, selector: str, default: str = "") -> str:
    """Extract text content safely."""
    try:
        locator = page.locator(selector)
        if locator.count() > 0:
            return locator.first.text_content() or default
        return default
    except Exception:
        return default

Building a Retry Decorator

A generic retry decorator that handles transient failures:

import time
import functools
from playwright.sync_api import TimeoutError as PlaywrightTimeout

def retry(
    max_attempts: int = 3,
    delay: float = 1.0,
    backoff: float = 2.0,
    exceptions: tuple = (PlaywrightTimeout, Exception),
):
    """
    Retry decorator with exponential backoff.

    Args:
        max_attempts: Maximum number of attempts
        delay: Initial delay between retries in seconds
        backoff: Multiplier for delay after each retry
        exceptions: Tuple of exception types to catch
    """
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            current_delay = delay
            last_exception = None

            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    last_exception = e
                    if attempt == max_attempts:
                        print(
                            f"[{func.__name__}] Failed after "
                            f"{max_attempts} attempts: {e}"
                        )
                        raise
                    print(
                        f"[{func.__name__}] Attempt {attempt} failed: {e}. "
                        f"Retrying in {current_delay:.1f}s..."
                    )
                    time.sleep(current_delay)
                    current_delay *= backoff

        return wrapper
    return decorator

# Usage
@retry(max_attempts=3, delay=2.0, backoff=2.0)
def navigate_and_extract(page, url: str) -> dict:
    page.goto(url, wait_until="networkidle", timeout=10000)
    return {
        "title": page.title(),
        "content": page.locator("main").text_content(),
    }

Page-Level Retry with Fresh Context

Sometimes the page itself gets into a bad state. Retry with a fresh browser context:

from playwright.sync_api import sync_playwright

def robust_scrape(url: str, max_attempts: int = 3) -> dict:
    """Scrape a URL with retry logic that creates fresh contexts."""
    with sync_playwright() as p:
        browser = p.chromium.launch()

        for attempt in range(1, max_attempts + 1):
            context = browser.new_context()
            page = context.new_page()

            try:
                page.goto(url, wait_until="networkidle", timeout=15000)

                # Wait for content to be present
                page.wait_for_selector("body", timeout=5000)

                data = {
                    "url": url,
                    "title": page.title(),
                    "text": page.locator("body").text_content()[:5000],
                    "attempt": attempt,
                }
                return data

            except Exception as e:
                print(f"Attempt {attempt}/{max_attempts} failed: {e}")
                if attempt == max_attempts:
                    return {"url": url, "error": str(e)}

            finally:
                context.close()

        browser.close()

Graceful Degradation Pattern

When an agent cannot complete its primary task, fall back to progressively simpler strategies:

class ResilientAgent:
    def __init__(self, browser):
        self.browser = browser

    def extract_product_data(self, url: str) -> dict:
        """
        Try multiple strategies to extract product data,
        degrading gracefully if preferred methods fail.
        """
        context = self.browser.new_context()
        page = context.new_page()
        result = {"url": url, "strategy": None}

        try:
            page.goto(url, wait_until="networkidle", timeout=15000)

            # Strategy 1: Structured data (JSON-LD)
            try:
                json_ld = page.locator(
                    'script[type="application/ld+json"]'
                ).text_content()
                import json
                data = json.loads(json_ld)
                result.update({
                    "name": data.get("name"),
                    "price": data.get("offers", {}).get("price"),
                    "strategy": "json-ld",
                })
                return result
            except Exception:
                pass

            # Strategy 2: Open Graph meta tags
            try:
                result.update({
                    "name": page.locator(
                        'meta[property="og:title"]'
                    ).get_attribute("content"),
                    "price": None,
                    "strategy": "open-graph",
                })
                if result["name"]:
                    return result
            except Exception:
                pass

            # Strategy 3: DOM selectors (least reliable)
            try:
                result.update({
                    "name": (
                        safe_text(page, "h1")
                        or safe_text(page, ".product-title")
                    ),
                    "price": (
                        safe_text(page, ".price")
                        or safe_text(page, "[data-price]")
                    ),
                    "strategy": "dom-selectors",
                })
                return result
            except Exception:
                pass

            # Strategy 4: Take a screenshot for manual review
            page.screenshot(path=f"fallback_{hash(url)}.png")
            result.update({
                "name": page.title(),
                "price": None,
                "strategy": "screenshot-fallback",
            })
            return result

        except Exception as e:
            result["error"] = str(e)
            result["strategy"] = "failed"
            return result

        finally:
            context.close()

Timeout Configuration

Configure timeouts at different levels for fine-grained control:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()

    # Context-level default timeout (applies to all actions)
    context = browser.new_context()
    context.set_default_timeout(10000)        # 10s for actions
    context.set_default_navigation_timeout(30000)  # 30s for navigation

    page = context.new_page()

    # Page-level timeout override
    page.set_default_timeout(5000)

    # Per-action timeout (highest priority)
    page.goto("https://example.com", timeout=60000)
    page.locator("#slow-widget").wait_for(state="visible", timeout=20000)

    context.close()
    browser.close()

Timeout priority from highest to lowest: per-action > page-level > context-level > default (30 seconds).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Comprehensive Error-Handling Agent

Putting it all together in a production-ready agent:

import logging
import time
from dataclasses import dataclass
from playwright.sync_api import (
    sync_playwright,
    TimeoutError as PlaywrightTimeout,
    Error as PlaywrightError,
)

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("browser_agent")

@dataclass
class AgentResult:
    url: str
    success: bool
    data: dict | None = None
    error: str | None = None
    attempts: int = 0

class RobustBrowserAgent:
    def __init__(self, max_retries: int = 3, timeout: int = 15000):
        self.max_retries = max_retries
        self.timeout = timeout

    def execute(self, url: str, task_fn) -> AgentResult:
        with sync_playwright() as p:
            browser = p.chromium.launch()

            for attempt in range(1, self.max_retries + 1):
                context = browser.new_context()
                context.set_default_timeout(self.timeout)
                page = context.new_page()

                try:
                    logger.info(
                        f"Attempt {attempt}/{self.max_retries}: {url}"
                    )
                    page.goto(url, wait_until="networkidle")
                    data = task_fn(page)
                    return AgentResult(
                        url=url, success=True,
                        data=data, attempts=attempt,
                    )

                except PlaywrightTimeout as e:
                    logger.warning(f"Timeout on attempt {attempt}: {e}")
                    page.screenshot(
                        path=f"timeout_attempt_{attempt}.png"
                    )

                except PlaywrightError as e:
                    error_msg = str(e)
                    if "net::ERR_" in error_msg:
                        logger.error(f"Network error: {error_msg}")
                    else:
                        logger.error(f"Browser error: {error_msg}")

                except Exception as e:
                    logger.error(f"Unexpected error: {e}")

                finally:
                    context.close()

                if attempt < self.max_retries:
                    delay = 2 ** attempt
                    logger.info(f"Waiting {delay}s before retry...")
                    time.sleep(delay)

            browser.close()
            return AgentResult(
                url=url, success=False,
                error="Max retries exceeded",
                attempts=self.max_retries,
            )

# Usage
agent = RobustBrowserAgent(max_retries=3, timeout=10000)

def scrape_task(page):
    return {
        "title": page.title(),
        "heading": page.locator("h1").text_content(),
    }

result = agent.execute("https://example.com", scrape_task)
if result.success:
    print(f"Success after {result.attempts} attempt(s): {result.data}")
else:
    print(f"Failed: {result.error}")

FAQ

How should I handle CAPTCHAs in my AI agent?

CAPTCHAs are specifically designed to block automation. Options include: using CAPTCHA-solving services (like 2Captcha or Anti-Captcha), switching to an official API if the site provides one, or escalating to a human operator. Some CAPTCHAs can be avoided by using residential proxies, maintaining realistic browsing patterns, and keeping session cookies. Never attempt to bypass CAPTCHAs on sites where you do not have permission to automate.

What is the right retry count for production agents?

Three retries with exponential backoff (2s, 4s, 8s) works well for most scenarios. For critical tasks, increase to 5 retries. For bulk scraping where individual failures are acceptable, use 2 retries to optimize throughput. Always set a circuit breaker — if more than 50 percent of requests fail in a window, pause the agent and alert an operator rather than continuing to hammer a broken or blocking site.

How do I distinguish between transient and permanent failures?

Network errors (net::ERR_CONNECTION_TIMED_OUT, net::ERR_CONNECTION_RESET) are typically transient and worth retrying. DNS failures (net::ERR_NAME_NOT_RESOLVED) are usually permanent. HTTP 404 and 410 responses are permanent. HTTP 429 (rate limited) and 503 (service unavailable) are transient. Element-not-found errors may be permanent if the page structure changed, or transient if the page had not finished loading. Log the specific error type and use it to decide whether to retry.


#ErrorHandling #RetryPatterns #Playwright #Resilience #AIAgents #BrowserAutomation #FaultTolerance

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.