Skip to content
Learn Agentic AI
Learn Agentic AI11 min read15 views

Streaming Responses from OpenAI: Real-Time Token-by-Token Output

Learn how to stream OpenAI responses token-by-token using the Python SDK, implement async streaming for web applications, and display incremental results to users.

Why Streaming Matters

When a model generates a long response, the standard (non-streaming) API makes you wait for the entire completion before returning anything. For a 500-token response, that can mean several seconds of silence before any text appears. Streaming changes this by delivering tokens as they are generated, giving users the familiar "typing" experience seen in ChatGPT.

Streaming is essential for chatbots, real-time UIs, and any application where perceived latency matters.

Basic Synchronous Streaming

Enable streaming by setting stream=True:

sequenceDiagram
    autonumber
    participant Client
    participant Edge as Edge Worker
    participant LLM as LLM Provider
    participant DB as Logs and Trace
    Client->>Edge: POST /chat (stream=true)
    Edge->>LLM: messages.create(stream=true)
    loop Each token
        LLM-->>Edge: SSE chunk delta
        Edge-->>Client: SSE chunk delta
        Edge->>DB: append token to span
    end
    LLM-->>Edge: stop_reason=end_turn
    Edge-->>Client: event: done
    Edge->>DB: finalize trace
from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Write a short guide to Python decorators."},
    ],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)

print()  # newline after streaming completes

Each chunk is a ChatCompletionChunk object. The delta field contains the incremental content — usually one or a few tokens per chunk. The first chunk often has the role field set, and subsequent chunks contain content.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Async Streaming

For web applications built with FastAPI, Django, or similar frameworks, async streaming is the right approach:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def stream_response(prompt: str):
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": prompt},
        ],
        stream=True,
    )

    full_response = ""
    async for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            full_response += delta.content
            print(delta.content, end="", flush=True)

    print()
    return full_response

result = asyncio.run(stream_response("Explain async generators in Python."))

The async client uses async for to iterate over chunks without blocking the event loop, which means your server can handle other requests concurrently during generation.

Building an SSE Endpoint with FastAPI

Server-Sent Events (SSE) are the standard way to push streaming responses to a browser:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI()

async def generate_stream(prompt: str):
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )

    async for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            yield f"data: {delta.content}\n\n"

    yield "data: [DONE]\n\n"

@app.get("/stream")
async def stream_endpoint(prompt: str):
    return StreamingResponse(
        generate_stream(prompt),
        media_type="text/event-stream",
    )

On the frontend, consume this with the EventSource API or a fetch-based SSE reader.

Collecting the Full Response While Streaming

A common pattern is to display tokens in real-time while also building up the complete response for storage or further processing:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

from openai import OpenAI

client = OpenAI()

def stream_and_collect(messages: list[dict]) -> str:
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True,
    )

    collected_content = []
    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            collected_content.append(delta.content)
            print(delta.content, end="", flush=True)

    print()
    return "".join(collected_content)

full_text = stream_and_collect([
    {"role": "user", "content": "Summarize the Python GIL in 3 sentences."},
])
# full_text now contains the entire response

Handling Stream Interruptions

Network issues can interrupt a stream mid-response. Wrap your streaming code in proper error handling:

from openai import OpenAI, APIConnectionError, APITimeoutError

client = OpenAI()

def safe_stream(messages: list[dict]) -> str:
    try:
        stream = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            stream=True,
        )

        parts = []
        for chunk in stream:
            delta = chunk.choices[0].delta
            if delta.content:
                parts.append(delta.content)

        return "".join(parts)

    except APIConnectionError:
        return "Connection lost during streaming. Please retry."
    except APITimeoutError:
        return "Request timed out. Please retry."

FAQ

Does streaming cost more tokens than non-streaming?

No. Token usage is identical whether you stream or not. The only difference is how the response is delivered to your client.

Can I use streaming with function calling?

Yes. When the model decides to call a function, the tool call arguments are streamed incrementally in the delta.tool_calls field. You accumulate the argument string across chunks and parse it once complete.

How do I know when the stream is finished?

The stream ends when iteration completes. The last chunk will have a finish_reason set on choices[0] (e.g., stop or tool_calls). If you are sending SSE, emit a [DONE] event as a signal to the frontend.


#OpenAI #Streaming #ServerSentEvents #AsyncPython #RealTime #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Streaming changes the eval game — final-answer correctness isn't enough when users perceive the answer one token at a time. Here's the metric set that matters.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.

Agentic AI

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Funding & Industry

OpenAI revenue run-rate — April 2026 read — April 2026 update

OpenAI's April 2026 reported revenue run-rate cleared $13B annualized, on continued ChatGPT growth, agentic Operator monetization, and enterprise API expansion.

Funding & Industry

Stargate progress update — April 2026 site and capex

OpenAI's Stargate with Oracle and SoftBank crossed a milestone in April 2026 with the first Texas site partially energized and three additional sites under construction.