Build a Voice Agent on AWS Bedrock: Claude + Polly + Transcribe (2026)
Wire Amazon Transcribe streaming, Claude 4.7 Sonnet on Bedrock, and Polly generative voices into a sub-second voice agent. Real Python + boto3 code, IAM policy, and production tips.
TL;DR — Amazon Transcribe streams partial transcripts over a WebSocket-style
StartStreamTranscriptionAPI, you forward final segments to Claude 4.7 Sonnet on Bedrock with the InvokeModel API, then synthesize the reply with Polly'sgenerativeengine. Three boto3 clients, one event loop, ~700ms voice-to-voice on us-east-1.
What you'll build
A Python service that accepts an HTTP POST with raw 16kHz PCM (or via WebSocket), pipes the audio into Amazon Transcribe streaming, sends each finalized utterance to anthropic.claude-sonnet-4-7-20250620-v1:0 on Bedrock, then streams the response text into Polly with the generative engine. The whole agent runs on a single t3.small EC2 or in a Fargate task — no GPU required.
Prerequisites
- AWS account with Bedrock access enabled in
us-east-1(request access for Anthropic models in the Bedrock console). - IAM role with
transcribe:StartStreamTranscription,bedrock:InvokeModel, andpolly:SynthesizeSpeech. - Python 3.11,
boto3>=1.34,amazon-transcribe>=0.6.2,asyncio. - An audio source: 16kHz mono PCM (works directly with Transcribe).
Architecture
flowchart LR
CALLER[Caller / Browser] -->|PCM16 16kHz| APP[Python Agent]
APP -->|StartStreamTranscription| TRANS[Amazon Transcribe Streaming]
TRANS -->|partial + final| APP
APP -->|InvokeModel claude-4-7-sonnet| BR[Amazon Bedrock]
BR -->|text reply| APP
APP -->|SynthesizeSpeech engine=generative| POLLY[Amazon Polly]
POLLY -->|MP3 / PCM| CALLER
Step 1 — IAM policy for the agent role
Attach this minimal inline policy to the EC2 instance role or task role:
```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "transcribe:StartStreamTranscription", "Resource": "" }, { "Effect": "Allow", "Action": "bedrock:InvokeModel", "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-7-20250620-v1:0" }, { "Effect": "Allow", "Action": "polly:SynthesizeSpeech", "Resource": "" } ] } ```
Step 2 — Stream audio into Amazon Transcribe
```python from amazon_transcribe.client import TranscribeStreamingClient from amazon_transcribe.handlers import TranscriptResultStreamHandler import asyncio
class Handler(TranscriptResultStreamHandler): def init(self, stream, on_final): super().init(stream) self.on_final = on_final async def handle_transcript_event(self, event): for r in event.transcript.results: if not r.is_partial and r.alternatives: await self.on_final(r.alternatives[0].transcript)
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
async def transcribe(pcm_iter, on_final): client = TranscribeStreamingClient(region="us-east-1") stream = await client.start_stream_transcription( language_code="en-US", media_sample_rate_hz=16000, media_encoding="pcm") async def feed(): async for chunk in pcm_iter: await stream.input_stream.send_audio_event(audio_chunk=chunk) await stream.input_stream.end_stream() await asyncio.gather(feed(), Handler(stream.output_stream, on_final).handle_events()) ```
Step 3 — Call Claude 4.7 Sonnet on Bedrock
```python import boto3, json br = boto3.client("bedrock-runtime", region_name="us-east-1")
def ask_claude(history, user_text): history.append({"role": "user", "content": [{"type": "text", "text": user_text}]}) resp = br.invoke_model( modelId="anthropic.claude-sonnet-4-7-20250620-v1:0", body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "system": "You are a concise voice agent. Keep replies under 2 sentences.", "messages": history, })) text = json.loads(resp["body"].read())["content"][0]["text"] history.append({"role": "assistant", "content": [{"type": "text", "text": text}]}) return text ```
For lower latency, switch to invoke_model_with_response_stream and pipe deltas straight into Polly.
Step 4 — Synthesize with Polly generative voices
```python polly = boto3.client("polly", region_name="us-east-1") def synth(text, voice="Ruth"): # Ruth/Stephen are generative voices out = polly.synthesize_speech( Text=text, VoiceId=voice, OutputFormat="pcm", SampleRate="16000", Engine="generative") return out["AudioStream"].read() ```
Generative voices add ~150ms vs neural but sound dramatically more human; use neural for stricter latency budgets.
Step 5 — Glue: VAD, tool-use, and barge-in
Use a simple energy-based VAD (RMS threshold) to chunk inputs to Transcribe; throw away anything below 600ms of speech. For barge-in, kill the current Polly playback the moment Transcribe emits a non-empty partial. For tool-use, switch from invoke_model to Bedrock's converse API which supports native tool calling — Claude returns a toolUse block, you execute, and reply with a toolResult block.
Step 6 — Containerize and deploy on Fargate
```dockerfile FROM python:3.11-slim RUN pip install boto3 amazon-transcribe uvicorn fastapi COPY app.py /app/app.py CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"] ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
docker build -t voice-agent . && aws ecr-public get-login-password | docker login ... && docker push .... Then run as a Fargate service behind an NLB; mTLS to Twilio if you're terminating PSTN.
Step 7 — Wire to Twilio Media Streams
Convert Twilio's mu-law 8kHz frames to PCM16 16kHz with audioop.ulaw2lin + audioop.ratecv before forwarding into the Transcribe stream. Reverse the chain (PCM16 16kHz → mu-law 8kHz) on Polly output frames before media events back to Twilio.
Pitfalls
- Bedrock model access isn't on by default — request it once per region in the Bedrock console.
- Transcribe streaming has a 4-hour cap per session; reset on long calls.
- Polly generative is regional — only available in
us-east-1,eu-west-1,ap-northeast-1as of May 2026. - Cost trap: Polly generative is $30/M chars vs $4/M for neural. Cache common greetings.
- boto3 retry storms: set
Config(retries={"max_attempts": 1, "mode": "standard"})on Bedrock; the default exponential backoff will blow your latency budget.
How CallSphere does this in production
CallSphere's Healthcare voice stack runs on FastAPI :8084 with OpenAI Realtime as the primary path because we measured 350ms cheaper TTFT vs Bedrock InvokeModel for short utterances. We keep an AWS Bedrock + Polly fallback wired through the same FastAPI surface for HIPAA-locked tenants who need their audio to never leave AWS, and Claude 4.7 Sonnet on Bedrock powers our 90+ tools across 6 verticals. We run 37 voice agents under one orchestration layer with 115+ Postgres tables tracking every turn. Pricing tiers are $149/$499/$1499 with a 14-day trial and a 22% lifetime affiliate cut.
FAQ
Q: Why not just use Bedrock AgentCore? AgentCore is great for chat but doesn't give you raw audio control — you can't bridge Twilio media streams without a wrapper service anyway. Going direct to Transcribe + InvokeModel + Polly keeps you in the audio path.
Q: Can I use Nova Sonic instead of this stack? Nova Sonic (Amazon's speech-to-speech model) is excellent and cuts latency further, but it's currently only routable through Bedrock InvokeModelWithBidirectionalStream which requires SigV4 signing on a streaming socket — more code than this tutorial.
Q: How do I handle PHI? Sign a BAA with AWS, enable VPC endpoints for all three services so audio never traverses the public internet, and turn off Transcribe content redaction logging.
Q: What's the realistic latency? On us-east-1 with warm clients: Transcribe partial ~250ms, Bedrock TTFT ~400ms, Polly first-byte ~200ms. Voice-to-voice ~700ms.
Q: Can I stream Claude's output into Polly?
Yes — use invoke_model_with_response_stream, accumulate deltas into sentence boundaries (. ! ?), and call Polly per sentence. Cuts perceived latency by 40%.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.