Skip to content
AI Voice Agents
AI Voice Agents12 min read0 views

Build a Swift iOS Voice Agent with SwiftUI and WebRTC

Native iOS voice agent in Swift using AVFoundation and WebRTC. Real working SwiftUI code for ephemeral key flow, RTCPeerConnection setup, and a live waveform.

TL;DR — Apple ships WebRTC inside WebKit, but for native voice you want the standalone WebRTC.framework. Pair it with an ephemeral OpenAI Realtime token, and a 200-line SwiftUI app gets you sub-700ms voice on iOS.

What you'll build

A SwiftUI iOS app with one tap-to-talk button that opens a WebRTC RTCPeerConnection to OpenAI Realtime. We set the system instructions over the data channel, render a live audio waveform, and handle background audio session interruptions.

Prerequisites

  1. Xcode 16+, iOS 17 deployment target.
  2. WebRTC.framework via SPM: https://github.com/stasel/WebRTC.
  3. OPENAI_API_KEY on your backend (never in the app).
  4. NSMicrophoneUsageDescription in Info.plist.
  5. Familiarity with async/await and AVAudioSession.

Architecture

sequenceDiagram
  participant I as iOS app
  participant K as Your /session endpoint
  participant O as OpenAI Realtime
  I->>K: GET /session (mint ephemeral)
  K-->>I: client_secret
  I->>I: RTCPeerConnection.offer
  I->>O: POST /v1/realtime (SDP, Bearer eph)
  O-->>I: SDP answer
  I<->O: Opus + DataChannel events

Step 1 — Configure the audio session

```swift import AVFoundation

func activateAudio() throws { let session = AVAudioSession.sharedInstance() try session.setCategory(.playAndRecord, mode: .voiceChat, options: [.defaultToSpeaker, .allowBluetooth, .duckOthers]) try session.setActive(true) } ```

Step 2 — Build the peer connection

```swift import WebRTC

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

final class RealtimeClient: NSObject { private let factory: RTCPeerConnectionFactory = { RTCInitializeSSL() return RTCPeerConnectionFactory( encoderFactory: RTCDefaultVideoEncoderFactory(), decoderFactory: RTCDefaultVideoDecoderFactory()) }() var pc: RTCPeerConnection! var dc: RTCDataChannel!

func makeConnection() {
    let cfg = RTCConfiguration()
    cfg.iceServers = [RTCIceServer(urlStrings: ["stun:stun.l.google.com:19302"])]
    cfg.sdpSemantics = .unifiedPlan
    let constraints = RTCMediaConstraints(
        mandatoryConstraints: nil, optionalConstraints: nil)
    pc = factory.peerConnection(with: cfg, constraints: constraints, delegate: self)!

    let audioSrc = factory.audioSource(with: nil)
    let audioTrack = factory.audioTrack(with: audioSrc, trackId: "mic0")
    pc.add(audioTrack, streamIds: ["s0"])

    let dcCfg = RTCDataChannelConfiguration()
    dc = pc.dataChannel(forLabel: "oai-events", configuration: dcCfg)
    dc.delegate = self
}

} ```

Step 3 — Mint ephemeral key on your server

```swift struct Ephemeral: Decodable { struct Secret: Decodable { let value: String } let client_secret: Secret }

func fetchKey() async throws -> String { let url = URL(string: "https://api.callsphere.ai/voice/session")! let (data, _) = try await URLSession.shared.data(from: url) return try JSONDecoder().decode(Ephemeral.self, from: data).client_secret.value } ```

Step 4 — Trade SDP

```swift func connect() async throws { let key = try await fetchKey() let constraints = RTCMediaConstraints( mandatoryConstraints: ["OfferToReceiveAudio":"true"], optionalConstraints: nil) let offer: RTCSessionDescription = try await withCheckedThrowingContinuation { c in pc.offer(for: constraints) { sdp, err in if let sdp = sdp { c.resume(returning: sdp) } else { c.resume(throwing: err!) } } } try await withCheckedThrowingContinuation { c in pc.setLocalDescription(offer) { e in if let e = e { c.resume(throwing: e) } else { c.resume() } } }

var req = URLRequest(url: URL(
    string:"https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03")!)
req.httpMethod = "POST"
req.setValue("Bearer \(key)", forHTTPHeaderField:"Authorization")
req.setValue("application/sdp", forHTTPHeaderField:"Content-Type")
req.httpBody = offer.sdp.data(using:.utf8)
let (ans, _) = try await URLSession.shared.data(for: req)
let answer = RTCSessionDescription(type: .answer,
    sdp: String(data: ans, encoding:.utf8)!)
try await withCheckedThrowingContinuation { c in
    pc.setRemoteDescription(answer) { e in
        if let e = e { c.resume(throwing: e) } else { c.resume() }
    }
}

} ```

Step 5 — SwiftUI screen

```swift struct VoiceView: View { @StateObject var vm = VoiceVM() var body: some View { VStack(spacing: 24) { Text(vm.status).font(.headline) WaveformView(level: vm.audioLevel) .frame(width: 220, height: 220) Button(vm.connected ? "End" : "Talk") { Task { vm.connected ? vm.end() : await vm.start() } } .buttonStyle(.borderedProminent) } .padding() } } ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Push session.update on data-channel open

```swift extension RealtimeClient: RTCDataChannelDelegate { func dataChannelDidChangeState(_ ch: RTCDataChannel) { guard ch.readyState == .open else { return } let payload: [String: Any] = [ "type": "session.update", "session": [ "instructions":"You are CallSphere's iOS demo agent.", "voice":"alloy", "turn_detection":["type":"server_vad"] ] ] let data = try! JSONSerialization.data(withJSONObject: payload) ch.sendData(RTCDataBuffer(data: data, isBinary: false)) } } ```

Common pitfalls

  • Wrong AVAudioSession mode.voiceChat is what gives you echo cancellation.
  • Not handling AVAudioSession.interruptionNotification — phone call kills your mic until you reactivate.
  • Shipping the API key — always mint ephemeral on your server.
  • Forgetting to call RTCInitializeSSL() — silent crash on first connect.

How CallSphere does this in production

CallSphere's iOS partner app uses this exact pattern, talking to the same FastAPI :8084 backend that powers our Healthcare HIPAA voice agent. 37 agents, 115+ DB tables, SOC 2 + HIPAA. Try it for 14 days — see /pricing.

FAQ

Can I skip WebRTC and use WebSocket? Yes, but jitter + AEC are way harder.

Why ephemeral keys? App-store binaries can be unpacked; long-lived keys leak.

Does CallKit play nice? Yes — set the audio session in your CXProvider delegate.

Background audio? Add the audio background mode capability in Info.plist.

Catalyst / iPad? Same code path — WebRTC.framework is universal.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.