Skip to content
Learn Agentic AI
Learn Agentic AI14 min read6 views

Generating Multimodal Outputs: AI Agents That Create Images, Audio, and Documents

Build AI agents that generate rich multimodal outputs including images with DALL-E, speech with TTS, PDF documents, and formatted reports. Learn how to orchestrate multiple generation APIs into cohesive, multi-format responses.

Beyond Text Responses

Most AI agents return plain text. But many real tasks require richer outputs: a marketing agent should deliver copy alongside generated images, a report agent should produce formatted PDFs, and an accessibility agent should provide audio narrations. This guide builds an agent that generates images, audio, and documents as part of its response.

Image Generation with DALL-E

Start with the most common multimodal output — generating images from text descriptions:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
import openai
import httpx
from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class GeneratedImage:
    url: str
    local_path: str | None = None
    prompt: str = ""
    revised_prompt: str = ""

async def generate_image(
    prompt: str,
    client: openai.AsyncOpenAI,
    size: str = "1024x1024",
    quality: str = "standard",
    save_dir: str = "./outputs",
) -> GeneratedImage:
    """Generate an image using DALL-E 3."""
    response = await client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size=size,
        quality=quality,
        n=1,
    )

    image_url = response.data[0].url
    revised = response.data[0].revised_prompt

    # Download and save locally
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = prompt[:50].replace(" ", "_").replace("/", "_")
    local_path = f"{save_dir}/{safe_name}.png"

    async with httpx.AsyncClient() as http:
        img_response = await http.get(image_url)
        with open(local_path, "wb") as f:
            f.write(img_response.content)

    return GeneratedImage(
        url=image_url,
        local_path=local_path,
        prompt=prompt,
        revised_prompt=revised,
    )

Text-to-Speech Generation

For audio output, use OpenAI's TTS API to convert text to natural speech:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
@dataclass
class GeneratedAudio:
    local_path: str
    duration_estimate: float
    voice: str
    text: str

async def generate_speech(
    text: str,
    client: openai.AsyncOpenAI,
    voice: str = "alloy",
    save_dir: str = "./outputs",
) -> GeneratedAudio:
    """Generate speech audio from text using OpenAI TTS."""
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = text[:30].replace(" ", "_").replace("/", "_")
    local_path = f"{save_dir}/{safe_name}.mp3"

    response = await client.audio.speech.create(
        model="tts-1-hd",
        voice=voice,
        input=text,
    )

    with open(local_path, "wb") as f:
        f.write(response.content)

    # Rough duration estimate: ~150 words per minute
    word_count = len(text.split())
    duration = word_count / 150 * 60

    return GeneratedAudio(
        local_path=local_path,
        duration_estimate=duration,
        voice=voice,
        text=text,
    )

PDF Document Generation

For structured document output, generate PDFs with formatted text, tables, and embedded images:

from reportlab.lib.pagesizes import letter
from reportlab.platypus import (
    SimpleDocTemplate, Paragraph, Spacer, Table,
    TableStyle, RLImage,
)
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors

@dataclass
class GeneratedDocument:
    local_path: str
    page_count: int
    title: str

def generate_pdf_report(
    title: str,
    sections: list[dict],
    save_dir: str = "./outputs",
    images: list[str] | None = None,
) -> GeneratedDocument:
    """Generate a formatted PDF report.

    Each section: {"heading": str, "body": str, "table": list | None}
    """
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = title[:40].replace(" ", "_")
    path = f"{save_dir}/{safe_name}.pdf"

    doc = SimpleDocTemplate(path, pagesize=letter)
    styles = getSampleStyleSheet()
    story = []

    # Title
    story.append(Paragraph(title, styles["Title"]))
    story.append(Spacer(1, 20))

    for section in sections:
        # Section heading
        story.append(
            Paragraph(section["heading"], styles["Heading2"])
        )
        story.append(Spacer(1, 10))

        # Body text
        story.append(
            Paragraph(section["body"], styles["BodyText"])
        )
        story.append(Spacer(1, 10))

        # Optional table
        if section.get("table"):
            table_data = section["table"]
            t = Table(table_data)
            t.setStyle(TableStyle([
                ("BACKGROUND", (0, 0), (-1, 0), colors.grey),
                ("TEXTCOLOR", (0, 0), (-1, 0), colors.whitesmoke),
                ("GRID", (0, 0), (-1, -1), 1, colors.black),
                ("FONTSIZE", (0, 0), (-1, -1), 9),
            ]))
            story.append(t)
            story.append(Spacer(1, 15))

    # Embed images if provided
    for img_path in (images or []):
        if Path(img_path).exists():
            story.append(RLImage(img_path, width=400, height=300))
            story.append(Spacer(1, 15))

    doc.build(story)

    # Approximate page count
    page_count = max(1, len(sections) // 3 + 1)

    return GeneratedDocument(
        local_path=path,
        page_count=page_count,
        title=title,
    )

The Multimodal Output Agent

Bring all generators together into an agent that decides which output formats are appropriate for each request:

@dataclass
class MultimodalResponse:
    text: str
    images: list[GeneratedImage] = field(default_factory=list)
    audio: list[GeneratedAudio] = field(default_factory=list)
    documents: list[GeneratedDocument] = field(default_factory=list)

class MultimodalOutputAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def _plan_outputs(self, query: str) -> dict:
        """Ask the LLM what output formats are appropriate."""
        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Decide what output formats to generate. "
                        "Return a JSON object with boolean fields: "
                        "needs_image, needs_audio, needs_document, "
                        "and string fields: image_prompt, "
                        "audio_text, document_title, text_response."
                    ),
                },
                {"role": "user", "content": query},
            ],
            response_format={"type": "json_object"},
        )
        import json
        return json.loads(response.choices[0].message.content)

    async def respond(self, query: str) -> MultimodalResponse:
        plan = await self._plan_outputs(query)

        result = MultimodalResponse(
            text=plan.get("text_response", "")
        )

        # Generate outputs in parallel where possible
        import asyncio
        tasks = []

        if plan.get("needs_image") and plan.get("image_prompt"):
            tasks.append(self._gen_image(plan["image_prompt"]))

        if plan.get("needs_audio") and plan.get("audio_text"):
            tasks.append(self._gen_audio(plan["audio_text"]))

        outputs = await asyncio.gather(*tasks, return_exceptions=True)

        for output in outputs:
            if isinstance(output, GeneratedImage):
                result.images.append(output)
            elif isinstance(output, GeneratedAudio):
                result.audio.append(output)
            elif isinstance(output, Exception):
                result.text += (
                    f"\n\n[Generation error: {output}]"
                )

        if plan.get("needs_document"):
            doc = generate_pdf_report(
                title=plan.get("document_title", "Report"),
                sections=[{
                    "heading": "Content",
                    "body": result.text,
                }],
                images=[
                    img.local_path
                    for img in result.images
                    if img.local_path
                ],
            )
            result.documents.append(doc)

        return result

    async def _gen_image(self, prompt: str) -> GeneratedImage:
        return await generate_image(prompt, self.client)

    async def _gen_audio(self, text: str) -> GeneratedAudio:
        return await generate_speech(text, self.client)

Usage Example

import asyncio

async def main():
    agent = MultimodalOutputAgent()

    response = await agent.respond(
        "Create a brief market analysis report for the AI "
        "industry in 2026, with a cover image and an audio "
        "executive summary."
    )

    print("Text:", response.text[:200])
    print("Images:", [img.local_path for img in response.images])
    print("Audio:", [a.local_path for a in response.audio])
    print("Docs:", [d.local_path for d in response.documents])

asyncio.run(main())

FAQ

How do I control the cost of generating multiple output types per request?

Implement a budget system that tracks estimated costs per generation type. DALL-E 3 costs approximately $0.04 per standard image, TTS costs about $0.015 per 1000 characters, and GPT-4o planning costs standard token rates. Set per-request spending limits and skip optional outputs (like images) when the budget is tight. Also cache generated outputs — if the same image prompt appears twice, serve the cached version instead of regenerating.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Can I use open-source alternatives instead of OpenAI APIs for generation?

Yes. For image generation, use Stable Diffusion via a local ComfyUI or A1111 server. For TTS, Coqui TTS and Bark provide open-source speech synthesis. For document generation, reportlab (shown above) is already open-source and runs locally with no API calls. Replace the API calls in each generator function with calls to your local model servers while keeping the same return types.

How do I serve multimodal outputs through a web API?

Return a JSON response with the text content inline and URLs or file paths for binary outputs. For a FastAPI endpoint, upload generated images and audio to cloud storage (S3, GCS) and return signed URLs. Alternatively, serve files directly from the local output directory using FastAPI's StaticFiles mount. For documents, return a download URL that streams the PDF directly to the client.


#MultimodalOutput #ImageGeneration #TexttoSpeech #DocumentGeneration #DALLE #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Models

GPT Image 2.0: Launch Overview, Capabilities, and What Replaces DALL-E 3

OpenAI shipped gpt-image-2 on April 21, 2026 — 4K resolution, ~99% text accuracy, native reasoning. The full overview of what replaces DALL-E 3 and GPT Image 1.5.

Voice AI Agents

Streaming TTS Quality Benchmarks 2026: Naturalness, Latency, and Cost Side-by-Side

The state of streaming TTS in 2026 — ElevenLabs, OpenAI, Cartesia, Sesame, Deepgram Aura, and Inworld benchmarked on the metrics that matter.

Agentic AI

Chat Agents With Inline Image Generation: DALL-E 4, Flux 2, and Stable Diffusion in 2026

GPT-4o native image gen, Flux 2, and Stable Diffusion 4 power conversational image creation. Here is how 2026 chat agents render images inline, iterate on prompts, and edit in place.

Learn Agentic AI

AI Agent for Document Generation: Contracts, Proposals, and Reports on Demand

Build an AI agent that generates professional documents like contracts, proposals, and reports by combining template engines, dynamic data injection, and PDF rendering with version tracking.

Learn Agentic AI

Building a Real Estate Agent Assistant: Property Search, Valuation, and Document Prep

Build an AI assistant for real estate agents that searches property listings, performs comparative market analysis, generates valuations, and prepares transaction documents.

Learn Agentic AI

Text-to-Speech for AI Agents: ElevenLabs, OpenAI TTS, and Play.ht Compared

A detailed comparison of ElevenLabs, OpenAI TTS, and Play.ht for voice AI agents — covering voice quality, latency, voice cloning, emotion control, and pricing to help you choose the right TTS engine.