Building a Multi-Input Agent: Combining User Text with Uploaded Files for Rich Interactions

The Multi-Input Problem

Most AI chat interfaces accept text only. But real user needs often involve files: "Here is my resume — help me improve it," "What does this error log mean," or "Analyze this CSV and tell me the trends." A multi-input agent must accept text and files together, detect what each file contains, process it appropriately, and generate a response that meaningfully integrates all inputs.

File Format Detection

The first step is reliably identifying what the user uploaded. MIME type detection combined with content inspection handles the vast majority of formats:

flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway<br/>auth plus rate limit"]
    APP["FastAPI app<br/>handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer<br/>business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b

import mimetypes
import magic  # python-magic
from dataclasses import dataclass
from enum import Enum
from pathlib import Path

class FileCategory(Enum):
    TEXT = "text"
    IMAGE = "image"
    AUDIO = "audio"
    VIDEO = "video"
    DOCUMENT = "document"  # PDF, DOCX
    SPREADSHEET = "spreadsheet"  # CSV, XLSX
    CODE = "code"
    ARCHIVE = "archive"
    UNKNOWN = "unknown"

@dataclass
class DetectedFile:
    filename: str
    mime_type: str
    category: FileCategory
    size_bytes: int
    content: bytes

MIME_CATEGORY_MAP = {
    "application/pdf": FileCategory.DOCUMENT,
    "application/vnd.openxmlformats-officedocument"
    ".wordprocessingml.document": FileCategory.DOCUMENT,
    "text/csv": FileCategory.SPREADSHEET,
    "application/vnd.openxmlformats-officedocument"
    ".spreadsheetml.sheet": FileCategory.SPREADSHEET,
}

CODE_EXTENSIONS = {
    ".py", ".js", ".ts", ".java", ".go", ".rs",
    ".rb", ".cpp", ".c", ".h", ".sql", ".sh",
}

def detect_file(filename: str, content: bytes) -> DetectedFile:
    """Detect the type and category of an uploaded file."""
    mime = magic.from_buffer(content, mime=True)
    ext = Path(filename).suffix.lower()

    # Check extension-based overrides
    if ext in CODE_EXTENSIONS:
        category = FileCategory.CODE
    elif mime in MIME_CATEGORY_MAP:
        category = MIME_CATEGORY_MAP[mime]
    elif mime.startswith("image/"):
        category = FileCategory.IMAGE
    elif mime.startswith("audio/"):
        category = FileCategory.AUDIO
    elif mime.startswith("video/"):
        category = FileCategory.VIDEO
    elif mime.startswith("text/"):
        category = FileCategory.TEXT
    else:
        category = FileCategory.UNKNOWN

    return DetectedFile(
        filename=filename,
        mime_type=mime,
        category=category,
        size_bytes=len(content),
        content=content,
    )

Category-Specific Processors

Each file category has a dedicated processor that extracts content into a text representation the LLM can reason over:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import csv
import io

async def process_text_file(file: DetectedFile) -> str:
    text = file.content.decode("utf-8", errors="replace")
    if len(text) > 50000:
        text = text[:50000] + "\n... [truncated]"
    return f"Contents of {file.filename}:\n{text}"

async def process_code_file(file: DetectedFile) -> str:
    code = file.content.decode("utf-8", errors="replace")
    ext = Path(file.filename).suffix.lstrip(".")
    return (
        f"Code file: {file.filename}\n"
        f"Language: {ext}\n"
        f"Lines: {code.count(chr(10)) + 1}\n"
        f"~~~{ext}\n{code}\n~~~"
    )

async def process_csv_file(file: DetectedFile) -> str:
    text = file.content.decode("utf-8", errors="replace")
    reader = csv.reader(io.StringIO(text))
    rows = list(reader)

    if not rows:
        return f"{file.filename}: empty CSV"

    header = rows[0]
    preview_rows = rows[1:11]  # First 10 data rows

    lines = [
        f"CSV file: {file.filename}",
        f"Columns: {', '.join(header)}",
        f"Total rows: {len(rows) - 1}",
        "",
        "Preview (first 10 rows):",
        "| " + " | ".join(header) + " |",
        "| " + " | ".join("---" for _ in header) + " |",
    ]
    for row in preview_rows:
        lines.append("| " + " | ".join(row) + " |")

    return "\n".join(lines)

PROCESSORS = {
    FileCategory.TEXT: process_text_file,
    FileCategory.CODE: process_code_file,
    FileCategory.SPREADSHEET: process_csv_file,
}

The Unified Processing Pipeline

Bring file detection, processing, and LLM reasoning together:

import openai

class MultiInputAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def _process_file(self, file: DetectedFile) -> str:
        processor = PROCESSORS.get(file.category)
        if processor:
            return await processor(file)

        # Fallback: describe what we received
        return (
            f"File: {file.filename} "
            f"({file.category.value}, {file.size_bytes} bytes)"
        )

    async def chat(
        self,
        user_message: str,
        files: list[tuple[str, bytes]] | None = None,
    ) -> str:
        """Process user text and optional file uploads."""
        # Detect and process all files
        file_contexts = []
        image_parts = []

        for filename, content in (files or []):
            detected = detect_file(filename, content)

            if detected.category == FileCategory.IMAGE:
                import base64
                b64 = base64.b64encode(content).decode()
                image_parts.append({
                    "type": "image_url",
                    "image_url": {
                        "url": (
                            f"data:{detected.mime_type};"
                            f"base64,{b64}"
                        )
                    },
                })
            else:
                processed = await self._process_file(detected)
                file_contexts.append(processed)

        # Build the prompt
        parts = []
        if file_contexts:
            parts.append(
                "Uploaded file contents:\n\n"
                + "\n\n---\n\n".join(file_contexts)
            )
        parts.append(f"User message: {user_message}")

        content = [{"type": "text", "text": "\n\n".join(parts)}]
        content.extend(image_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a helpful assistant that analyzes "
                        "user messages along with any uploaded files. "
                        "Reference specific file contents in your "
                        "response."
                    ),
                },
                {"role": "user", "content": content},
            ],
        )
        return response.choices[0].message.content

FastAPI Endpoint

Expose the agent through a web API that accepts multipart form data:

from fastapi import FastAPI, UploadFile, File, Form
from typing import Annotated

app = FastAPI()
agent = MultiInputAgent()

@app.post("/chat")
async def chat_endpoint(
    message: Annotated[str, Form()],
    files: list[UploadFile] = File(default=[]),
):
    file_data = []
    for f in files:
        content = await f.read()
        file_data.append((f.filename, content))

    response = await agent.chat(message, file_data)
    return {"response": response}

FAQ

How do I handle very large files that exceed the LLM context window?

For large files, implement a summarization or chunking strategy. For text and code files, truncate to the first and last sections with a note about what was omitted. For CSVs, show the schema plus a statistical summary (column types, min, max, mean) instead of raw rows. For PDFs, extract only the pages most relevant to the user's question using keyword matching against the query.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

What security considerations are important for file upload agents?

Never execute uploaded files or evaluate their contents as code. Validate file sizes (reject uploads over a reasonable limit like 50MB). Scan for malware if the system is exposed to the public. Sanitize filenames to prevent path traversal attacks. Process files in isolated temporary directories and clean them up after processing. Never store raw uploads permanently unless explicitly required.

Can this agent maintain context across multiple messages with different file uploads?

Yes. Add a conversation history that stores both messages and processed file contexts. On each new message, include the relevant prior context in the prompt. For efficiency, store processed file summaries rather than raw file contents in the history, and allow the user to reference previously uploaded files by name without re-uploading them.

#MultiInputAgent #FileProcessing #FormatDetection #FastAPI #Python #AgenticAI #LearnAI #AIEngineering

Building a Multi-Input Agent: Combining User Text with Uploaded Files for Rich Interactions

The Multi-Input Problem

File Format Detection

Category-Specific Processors

The Unified Processing Pipeline

FastAPI Endpoint

FAQ

How do I handle very large files that exceed the LLM context window?

What security considerations are important for file upload agents?

Can this agent maintain context across multiple messages with different file uploads?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Build a Voice Agent on AWS App Runner with FastAPI + Bedrock (2026)

Deploy a Voice Agent on Modal with Python and Serverless GPU

Build a Voice Agent on Render: FastAPI + OpenAI Realtime (2026)

Build a Voice Agent on Railway: One-Click FastAPI Deploy (2026)