Skip to content
AI Voice Agents
AI Voice Agents12 min read0 views

Build a Voice Agent on AWS Bedrock: Claude + Polly + Transcribe (2026)

Wire Amazon Transcribe streaming, Claude 4.7 Sonnet on Bedrock, and Polly generative voices into a sub-second voice agent. Real Python + boto3 code, IAM policy, and production tips.

TL;DR — Amazon Transcribe streams partial transcripts over a WebSocket-style StartStreamTranscription API, you forward final segments to Claude 4.7 Sonnet on Bedrock with the InvokeModel API, then synthesize the reply with Polly's generative engine. Three boto3 clients, one event loop, ~700ms voice-to-voice on us-east-1.

What you'll build

A Python service that accepts an HTTP POST with raw 16kHz PCM (or via WebSocket), pipes the audio into Amazon Transcribe streaming, sends each finalized utterance to anthropic.claude-sonnet-4-7-20250620-v1:0 on Bedrock, then streams the response text into Polly with the generative engine. The whole agent runs on a single t3.small EC2 or in a Fargate task — no GPU required.

Prerequisites

  1. AWS account with Bedrock access enabled in us-east-1 (request access for Anthropic models in the Bedrock console).
  2. IAM role with transcribe:StartStreamTranscription, bedrock:InvokeModel, and polly:SynthesizeSpeech.
  3. Python 3.11, boto3>=1.34, amazon-transcribe>=0.6.2, asyncio.
  4. An audio source: 16kHz mono PCM (works directly with Transcribe).

Architecture

flowchart LR
  CALLER[Caller / Browser] -->|PCM16 16kHz| APP[Python Agent]
  APP -->|StartStreamTranscription| TRANS[Amazon Transcribe Streaming]
  TRANS -->|partial + final| APP
  APP -->|InvokeModel claude-4-7-sonnet| BR[Amazon Bedrock]
  BR -->|text reply| APP
  APP -->|SynthesizeSpeech engine=generative| POLLY[Amazon Polly]
  POLLY -->|MP3 / PCM| CALLER

Step 1 — IAM policy for the agent role

Attach this minimal inline policy to the EC2 instance role or task role:

```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "transcribe:StartStreamTranscription", "Resource": "" }, { "Effect": "Allow", "Action": "bedrock:InvokeModel", "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-7-20250620-v1:0" }, { "Effect": "Allow", "Action": "polly:SynthesizeSpeech", "Resource": "" } ] } ```

Step 2 — Stream audio into Amazon Transcribe

```python from amazon_transcribe.client import TranscribeStreamingClient from amazon_transcribe.handlers import TranscriptResultStreamHandler import asyncio

class Handler(TranscriptResultStreamHandler): def init(self, stream, on_final): super().init(stream) self.on_final = on_final async def handle_transcript_event(self, event): for r in event.transcript.results: if not r.is_partial and r.alternatives: await self.on_final(r.alternatives[0].transcript)

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

async def transcribe(pcm_iter, on_final): client = TranscribeStreamingClient(region="us-east-1") stream = await client.start_stream_transcription( language_code="en-US", media_sample_rate_hz=16000, media_encoding="pcm") async def feed(): async for chunk in pcm_iter: await stream.input_stream.send_audio_event(audio_chunk=chunk) await stream.input_stream.end_stream() await asyncio.gather(feed(), Handler(stream.output_stream, on_final).handle_events()) ```

Step 3 — Call Claude 4.7 Sonnet on Bedrock

```python import boto3, json br = boto3.client("bedrock-runtime", region_name="us-east-1")

def ask_claude(history, user_text): history.append({"role": "user", "content": [{"type": "text", "text": user_text}]}) resp = br.invoke_model( modelId="anthropic.claude-sonnet-4-7-20250620-v1:0", body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "system": "You are a concise voice agent. Keep replies under 2 sentences.", "messages": history, })) text = json.loads(resp["body"].read())["content"][0]["text"] history.append({"role": "assistant", "content": [{"type": "text", "text": text}]}) return text ```

For lower latency, switch to invoke_model_with_response_stream and pipe deltas straight into Polly.

Step 4 — Synthesize with Polly generative voices

```python polly = boto3.client("polly", region_name="us-east-1") def synth(text, voice="Ruth"): # Ruth/Stephen are generative voices out = polly.synthesize_speech( Text=text, VoiceId=voice, OutputFormat="pcm", SampleRate="16000", Engine="generative") return out["AudioStream"].read() ```

Generative voices add ~150ms vs neural but sound dramatically more human; use neural for stricter latency budgets.

Step 5 — Glue: VAD, tool-use, and barge-in

Use a simple energy-based VAD (RMS threshold) to chunk inputs to Transcribe; throw away anything below 600ms of speech. For barge-in, kill the current Polly playback the moment Transcribe emits a non-empty partial. For tool-use, switch from invoke_model to Bedrock's converse API which supports native tool calling — Claude returns a toolUse block, you execute, and reply with a toolResult block.

Step 6 — Containerize and deploy on Fargate

```dockerfile FROM python:3.11-slim RUN pip install boto3 amazon-transcribe uvicorn fastapi COPY app.py /app/app.py CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"] ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

docker build -t voice-agent . && aws ecr-public get-login-password | docker login ... && docker push .... Then run as a Fargate service behind an NLB; mTLS to Twilio if you're terminating PSTN.

Step 7 — Wire to Twilio Media Streams

Convert Twilio's mu-law 8kHz frames to PCM16 16kHz with audioop.ulaw2lin + audioop.ratecv before forwarding into the Transcribe stream. Reverse the chain (PCM16 16kHz → mu-law 8kHz) on Polly output frames before media events back to Twilio.

Pitfalls

  • Bedrock model access isn't on by default — request it once per region in the Bedrock console.
  • Transcribe streaming has a 4-hour cap per session; reset on long calls.
  • Polly generative is regional — only available in us-east-1, eu-west-1, ap-northeast-1 as of May 2026.
  • Cost trap: Polly generative is $30/M chars vs $4/M for neural. Cache common greetings.
  • boto3 retry storms: set Config(retries={"max_attempts": 1, "mode": "standard"}) on Bedrock; the default exponential backoff will blow your latency budget.

How CallSphere does this in production

CallSphere's Healthcare voice stack runs on FastAPI :8084 with OpenAI Realtime as the primary path because we measured 350ms cheaper TTFT vs Bedrock InvokeModel for short utterances. We keep an AWS Bedrock + Polly fallback wired through the same FastAPI surface for HIPAA-locked tenants who need their audio to never leave AWS, and Claude 4.7 Sonnet on Bedrock powers our 90+ tools across 6 verticals. We run 37 voice agents under one orchestration layer with 115+ Postgres tables tracking every turn. Pricing tiers are $149/$499/$1499 with a 14-day trial and a 22% lifetime affiliate cut.

FAQ

Q: Why not just use Bedrock AgentCore? AgentCore is great for chat but doesn't give you raw audio control — you can't bridge Twilio media streams without a wrapper service anyway. Going direct to Transcribe + InvokeModel + Polly keeps you in the audio path.

Q: Can I use Nova Sonic instead of this stack? Nova Sonic (Amazon's speech-to-speech model) is excellent and cuts latency further, but it's currently only routable through Bedrock InvokeModelWithBidirectionalStream which requires SigV4 signing on a streaming socket — more code than this tutorial.

Q: How do I handle PHI? Sign a BAA with AWS, enable VPC endpoints for all three services so audio never traverses the public internet, and turn off Transcribe content redaction logging.

Q: What's the realistic latency? On us-east-1 with warm clients: Transcribe partial ~250ms, Bedrock TTFT ~400ms, Polly first-byte ~200ms. Voice-to-voice ~700ms.

Q: Can I stream Claude's output into Polly? Yes — use invoke_model_with_response_stream, accumulate deltas into sentence boundaries (. ! ?), and call Polly per sentence. Cuts perceived latency by 40%.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.