Skip to content
Learn Agentic AI
Learn Agentic AI13 min read24 views

Building a Claude Browser Agent: Automated Web Navigation with Anthropic SDK

Step-by-step guide to building a browser automation agent with Claude Computer Use — from SDK setup and screenshot capture to executing click, type, and scroll actions for real web navigation tasks.

Setting Up the Environment

Building a Claude browser agent requires three components: the Anthropic Python SDK, a browser that can be controlled programmatically for screenshot capture, and an input simulation layer. We will use Playwright for browser management (to launch and screenshot) while letting Claude drive all the navigation decisions.

Start by installing the dependencies:

# requirements.txt
anthropic>=0.39.0
playwright>=1.40.0
Pillow>=10.0.0

Initialize the project:

pip install -r requirements.txt
playwright install chromium

Architecture of the Browser Agent

The agent architecture has three layers:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    GOAL(["High level goal"])
    PLAN["Planner LLM"]
    SCREEN["Screen capture<br/>every step"]
    VLM["Vision LLM<br/>reads UI state"]
    ACT{"Action type"}
    CLICK["Click coordinate"]
    TYPE["Type text"]
    KEY["Keyboard shortcut"]
    GUARD["Safety filter<br/>allow lists"]
    OS[("OS sandbox<br/>ephemeral VM")]
    DONE(["Goal verified"])
    GOAL --> PLAN --> SCREEN --> VLM --> ACT
    ACT --> CLICK --> GUARD
    ACT --> TYPE --> GUARD
    ACT --> KEY --> GUARD
    GUARD --> OS --> SCREEN
    OS --> DONE
    style PLAN fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style DONE fill:#059669,stroke:#047857,color:#fff
  1. Browser Manager — Launches a headless or headed Chromium instance, navigates to a starting URL, captures screenshots, and executes low-level browser actions
  2. Action Executor — Translates Claude's computer use tool calls into Playwright mouse and keyboard commands
  3. Agent Loop — Orchestrates the screenshot-action cycle and manages the conversation history with Claude

Here is the complete browser manager:

import asyncio
from playwright.async_api import async_playwright, Page, Browser
import base64

class BrowserManager:
    def __init__(self, width: int = 1280, height: int = 800):
        self.width = width
        self.height = height
        self.browser: Browser | None = None
        self.page: Page | None = None

    async def start(self, url: str = "about:blank"):
        pw = await async_playwright().start()
        self.browser = await pw.chromium.launch(headless=False)
        context = await self.browser.new_context(
            viewport={"width": self.width, "height": self.height}
        )
        self.page = await context.new_page()
        await self.page.goto(url)

    async def screenshot(self) -> str:
        """Capture current page as base64 PNG."""
        img_bytes = await self.page.screenshot(full_page=False)
        return base64.standard_b64encode(img_bytes).decode()

    async def click(self, x: int, y: int, button: str = "left"):
        await self.page.mouse.click(x, y, button=button)

    async def type_text(self, text: str):
        await self.page.keyboard.type(text, delay=50)

    async def press_key(self, key: str):
        await self.page.keyboard.press(key)

    async def scroll(self, x: int, y: int, direction: str):
        await self.page.mouse.move(x, y)
        delta = 300 if direction == "down" else -300
        await self.page.mouse.wheel(0, delta)

    async def close(self):
        if self.browser:
            await self.browser.close()

The Agent Loop

The agent loop ties everything together. It sends screenshots to Claude, processes tool calls, executes actions, and repeats until the task is done:

import anthropic

class ClaudeBrowserAgent:
    def __init__(self, browser: BrowserManager):
        self.browser = browser
        self.client = anthropic.Anthropic()
        self.messages = []
        self.model = "claude-sonnet-4-20250514"

    async def run(self, task: str, max_steps: int = 30):
        self.messages = [{"role": "user", "content": task}]

        for step in range(max_steps):
            screenshot_b64 = await self.browser.screenshot()

            self.messages.append({
                "role": "user",
                "content": [{
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    },
                }],
            })

            response = self.client.messages.create(
                model=self.model,
                max_tokens=1024,
                tools=[{
                    "type": "computer_20241022",
                    "name": "computer",
                    "display_width_px": self.browser.width,
                    "display_height_px": self.browser.height,
                    "display_number": 0,
                }],
                messages=self.messages,
            )

            if response.stop_reason == "end_turn":
                final_text = next(
                    (b.text for b in response.content if hasattr(b, "text")),
                    "Task complete"
                )
                print(f"Done: {final_text}")
                return final_text

            assistant_content = response.content
            self.messages.append({"role": "assistant", "content": assistant_content})

            for block in assistant_content:
                if block.type == "tool_use":
                    await self._execute(block.input)
                    self.messages.append({
                        "role": "user",
                        "content": [{
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": "Action executed successfully",
                        }],
                    })
                    await asyncio.sleep(1)  # Wait for page to render

        return "Max steps reached"

    async def _execute(self, action: dict):
        action_type = action.get("action", action.get("type"))
        if action_type == "click":
            x, y = action["coordinate"]
            await self.browser.click(x, y)
        elif action_type == "type":
            await self.browser.type_text(action["text"])
        elif action_type == "key":
            await self.browser.press_key(action["text"])
        elif action_type == "scroll":
            x, y = action["coordinate"]
            await self.browser.scroll(x, y, action["direction"])

Running the Agent

Here is how to use the agent for a real web navigation task:

async def main():
    browser = BrowserManager(width=1280, height=800)
    await browser.start("https://news.ycombinator.com")

    agent = ClaudeBrowserAgent(browser)
    result = await agent.run(
        "Find the top story on Hacker News and click on the comments link. "
        "Then tell me how many comments the story has."
    )
    print(result)
    await browser.close()

asyncio.run(main())

The agent will take a screenshot of the Hacker News homepage, identify the top story, locate the comments link, click it, take another screenshot of the comments page, and report the comment count back to you.

Optimizing Conversation History

A critical performance consideration is managing the message history. Each screenshot consumes a significant number of tokens. If your task requires 20 steps, you are sending 20 high-resolution images in the conversation. This gets expensive and eventually hits context limits.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

A practical optimization is to maintain a sliding window of recent screenshots while summarizing older interactions as text:

def trim_history(messages: list, keep_last: int = 5) -> list:
    """Keep only the last N screenshot exchanges."""
    trimmed = [messages[0]]  # Keep original task
    image_exchanges = [m for m in messages[1:] if _has_image(m)]

    if len(image_exchanges) > keep_last:
        trimmed.append({
            "role": "user",
            "content": f"[Previous {len(image_exchanges) - keep_last} "
                       f"steps completed successfully]"
        })

    # Keep last N exchanges intact
    start_idx = max(1, len(messages) - keep_last * 3)
    trimmed.extend(messages[start_idx:])
    return trimmed

FAQ

Can I use a headless browser with Claude Computer Use?

Yes, and it is recommended for server-side deployments. Playwright supports headless mode, and the screenshots are identical to what you would see in a headed browser. Set headless=True when launching the browser.

How do I handle pages that take time to load?

Add a short delay (1-2 seconds) after executing each action before capturing the next screenshot. For pages with dynamic content, you can also use Playwright's wait_for_load_state("networkidle") before taking the screenshot.

What is the cost per step of the agent loop?

Each step involves sending a screenshot image plus the conversation history to Claude. A 1280x800 screenshot typically costs around 1,000-1,500 input tokens. With the conversation context, expect roughly 2,000-5,000 tokens per step. At Claude Sonnet pricing, a 20-step task costs approximately $0.15-$0.40 depending on conversation length.


#ClaudeBrowserAgent #WebAutomation #AnthropicSDK #ComputerUse #AIBrowserAgent #PythonAutomation #AgenticAI

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.