Generating Multimodal Outputs: AI Agents That Create Images, Audio, and Documents

Beyond Text Responses

Most AI agents return plain text. But many real tasks require richer outputs: a marketing agent should deliver copy alongside generated images, a report agent should produce formatted PDFs, and an accessibility agent should provide audio narrations. This guide builds an agent that generates images, audio, and documents as part of its response.

Image Generation with DALL-E

Start with the most common multimodal output — generating images from text descriptions:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

import openai
import httpx
from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class GeneratedImage:
    url: str
    local_path: str | None = None
    prompt: str = ""
    revised_prompt: str = ""

async def generate_image(
    prompt: str,
    client: openai.AsyncOpenAI,
    size: str = "1024x1024",
    quality: str = "standard",
    save_dir: str = "./outputs",
) -> GeneratedImage:
    """Generate an image using DALL-E 3."""
    response = await client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size=size,
        quality=quality,
        n=1,
    )

    image_url = response.data[0].url
    revised = response.data[0].revised_prompt

    # Download and save locally
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = prompt[:50].replace(" ", "_").replace("/", "_")
    local_path = f"{save_dir}/{safe_name}.png"

    async with httpx.AsyncClient() as http:
        img_response = await http.get(image_url)
        with open(local_path, "wb") as f:
            f.write(img_response.content)

    return GeneratedImage(
        url=image_url,
        local_path=local_path,
        prompt=prompt,
        revised_prompt=revised,
    )

Text-to-Speech Generation

For audio output, use OpenAI's TTS API to convert text to natural speech:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

@dataclass
class GeneratedAudio:
    local_path: str
    duration_estimate: float
    voice: str
    text: str

async def generate_speech(
    text: str,
    client: openai.AsyncOpenAI,
    voice: str = "alloy",
    save_dir: str = "./outputs",
) -> GeneratedAudio:
    """Generate speech audio from text using OpenAI TTS."""
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = text[:30].replace(" ", "_").replace("/", "_")
    local_path = f"{save_dir}/{safe_name}.mp3"

    response = await client.audio.speech.create(
        model="tts-1-hd",
        voice=voice,
        input=text,
    )

    with open(local_path, "wb") as f:
        f.write(response.content)

    # Rough duration estimate: ~150 words per minute
    word_count = len(text.split())
    duration = word_count / 150 * 60

    return GeneratedAudio(
        local_path=local_path,
        duration_estimate=duration,
        voice=voice,
        text=text,
    )

PDF Document Generation

For structured document output, generate PDFs with formatted text, tables, and embedded images:

from reportlab.lib.pagesizes import letter
from reportlab.platypus import (
    SimpleDocTemplate, Paragraph, Spacer, Table,
    TableStyle, RLImage,
)
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors

@dataclass
class GeneratedDocument:
    local_path: str
    page_count: int
    title: str

def generate_pdf_report(
    title: str,
    sections: list[dict],
    save_dir: str = "./outputs",
    images: list[str] | None = None,
) -> GeneratedDocument:
    """Generate a formatted PDF report.

    Each section: {"heading": str, "body": str, "table": list | None}
    """
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = title[:40].replace(" ", "_")
    path = f"{save_dir}/{safe_name}.pdf"

    doc = SimpleDocTemplate(path, pagesize=letter)
    styles = getSampleStyleSheet()
    story = []

    # Title
    story.append(Paragraph(title, styles["Title"]))
    story.append(Spacer(1, 20))

    for section in sections:
        # Section heading
        story.append(
            Paragraph(section["heading"], styles["Heading2"])
        )
        story.append(Spacer(1, 10))

        # Body text
        story.append(
            Paragraph(section["body"], styles["BodyText"])
        )
        story.append(Spacer(1, 10))

        # Optional table
        if section.get("table"):
            table_data = section["table"]
            t = Table(table_data)
            t.setStyle(TableStyle([
                ("BACKGROUND", (0, 0), (-1, 0), colors.grey),
                ("TEXTCOLOR", (0, 0), (-1, 0), colors.whitesmoke),
                ("GRID", (0, 0), (-1, -1), 1, colors.black),
                ("FONTSIZE", (0, 0), (-1, -1), 9),
            ]))
            story.append(t)
            story.append(Spacer(1, 15))

    # Embed images if provided
    for img_path in (images or []):
        if Path(img_path).exists():
            story.append(RLImage(img_path, width=400, height=300))
            story.append(Spacer(1, 15))

    doc.build(story)

    # Approximate page count
    page_count = max(1, len(sections) // 3 + 1)

    return GeneratedDocument(
        local_path=path,
        page_count=page_count,
        title=title,
    )

The Multimodal Output Agent

Bring all generators together into an agent that decides which output formats are appropriate for each request:

@dataclass
class MultimodalResponse:
    text: str
    images: list[GeneratedImage] = field(default_factory=list)
    audio: list[GeneratedAudio] = field(default_factory=list)
    documents: list[GeneratedDocument] = field(default_factory=list)

class MultimodalOutputAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def _plan_outputs(self, query: str) -> dict:
        """Ask the LLM what output formats are appropriate."""
        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Decide what output formats to generate. "
                        "Return a JSON object with boolean fields: "
                        "needs_image, needs_audio, needs_document, "
                        "and string fields: image_prompt, "
                        "audio_text, document_title, text_response."
                    ),
                },
                {"role": "user", "content": query},
            ],
            response_format={"type": "json_object"},
        )
        import json
        return json.loads(response.choices[0].message.content)

    async def respond(self, query: str) -> MultimodalResponse:
        plan = await self._plan_outputs(query)

        result = MultimodalResponse(
            text=plan.get("text_response", "")
        )

        # Generate outputs in parallel where possible
        import asyncio
        tasks = []

        if plan.get("needs_image") and plan.get("image_prompt"):
            tasks.append(self._gen_image(plan["image_prompt"]))

        if plan.get("needs_audio") and plan.get("audio_text"):
            tasks.append(self._gen_audio(plan["audio_text"]))

        outputs = await asyncio.gather(*tasks, return_exceptions=True)

        for output in outputs:
            if isinstance(output, GeneratedImage):
                result.images.append(output)
            elif isinstance(output, GeneratedAudio):
                result.audio.append(output)
            elif isinstance(output, Exception):
                result.text += (
                    f"\n\n[Generation error: {output}]"
                )

        if plan.get("needs_document"):
            doc = generate_pdf_report(
                title=plan.get("document_title", "Report"),
                sections=[{
                    "heading": "Content",
                    "body": result.text,
                }],
                images=[
                    img.local_path
                    for img in result.images
                    if img.local_path
                ],
            )
            result.documents.append(doc)

        return result

    async def _gen_image(self, prompt: str) -> GeneratedImage:
        return await generate_image(prompt, self.client)

    async def _gen_audio(self, text: str) -> GeneratedAudio:
        return await generate_speech(text, self.client)

Usage Example

import asyncio

async def main():
    agent = MultimodalOutputAgent()

    response = await agent.respond(
        "Create a brief market analysis report for the AI "
        "industry in 2026, with a cover image and an audio "
        "executive summary."
    )

    print("Text:", response.text[:200])
    print("Images:", [img.local_path for img in response.images])
    print("Audio:", [a.local_path for a in response.audio])
    print("Docs:", [d.local_path for d in response.documents])

asyncio.run(main())

FAQ

How do I control the cost of generating multiple output types per request?

Implement a budget system that tracks estimated costs per generation type. DALL-E 3 costs approximately $0.04 per standard image, TTS costs about $0.015 per 1000 characters, and GPT-4o planning costs standard token rates. Set per-request spending limits and skip optional outputs (like images) when the budget is tight. Also cache generated outputs — if the same image prompt appears twice, serve the cached version instead of regenerating.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Can I use open-source alternatives instead of OpenAI APIs for generation?

Yes. For image generation, use Stable Diffusion via a local ComfyUI or A1111 server. For TTS, Coqui TTS and Bark provide open-source speech synthesis. For document generation, reportlab (shown above) is already open-source and runs locally with no API calls. Replace the API calls in each generator function with calls to your local model servers while keeping the same return types.

How do I serve multimodal outputs through a web API?

Return a JSON response with the text content inline and URLs or file paths for binary outputs. For a FastAPI endpoint, upload generated images and audio to cloud storage (S3, GCS) and return signed URLs. Alternatively, serve files directly from the local output directory using FastAPI's StaticFiles mount. For documents, return a download URL that streams the PDF directly to the client.

#MultimodalOutput #ImageGeneration #TexttoSpeech #DocumentGeneration #DALLE #AgenticAI #LearnAI #AIEngineering

Generating Multimodal Outputs: AI Agents That Create Images, Audio, and Documents

Beyond Text Responses

Image Generation with DALL-E

Text-to-Speech Generation

PDF Document Generation

The Multimodal Output Agent

Usage Example

FAQ

How do I control the cost of generating multiple output types per request?

Can I use open-source alternatives instead of OpenAI APIs for generation?

How do I serve multimodal outputs through a web API?

Try CallSphere AI Voice Agents

Related Articles You May Like

GPT Image 2.0: Launch Overview, Capabilities, and What Replaces DALL-E 3

Streaming TTS Quality Benchmarks 2026: Naturalness, Latency, and Cost Side-by-Side

Chat Agents With Inline Image Generation: DALL-E 4, Flux 2, and Stable Diffusion in 2026

AI Agent for Document Generation: Contracts, Proposals, and Reports on Demand

Building a Real Estate Agent Assistant: Property Search, Valuation, and Document Prep

Text-to-Speech for AI Agents: ElevenLabs, OpenAI TTS, and Play.ht Compared