Skip to content
Learn Agentic AI
Learn Agentic AI12 min read3 views

Building a Multi-Input Agent: Combining User Text with Uploaded Files for Rich Interactions

Build a multi-input AI agent that handles user text alongside uploaded files of any format. Learn file upload handling, automatic format detection, unified processing pipelines, and how to generate contextual responses from mixed inputs.

The Multi-Input Problem

Most AI chat interfaces accept text only. But real user needs often involve files: "Here is my resume — help me improve it," "What does this error log mean," or "Analyze this CSV and tell me the trends." A multi-input agent must accept text and files together, detect what each file contains, process it appropriately, and generate a response that meaningfully integrates all inputs.

File Format Detection

The first step is reliably identifying what the user uploaded. MIME type detection combined with content inspection handles the vast majority of formats:

flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway<br/>auth plus rate limit"]
    APP["FastAPI app<br/>handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer<br/>business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
import mimetypes
import magic  # python-magic
from dataclasses import dataclass
from enum import Enum
from pathlib import Path

class FileCategory(Enum):
    TEXT = "text"
    IMAGE = "image"
    AUDIO = "audio"
    VIDEO = "video"
    DOCUMENT = "document"  # PDF, DOCX
    SPREADSHEET = "spreadsheet"  # CSV, XLSX
    CODE = "code"
    ARCHIVE = "archive"
    UNKNOWN = "unknown"

@dataclass
class DetectedFile:
    filename: str
    mime_type: str
    category: FileCategory
    size_bytes: int
    content: bytes

MIME_CATEGORY_MAP = {
    "application/pdf": FileCategory.DOCUMENT,
    "application/vnd.openxmlformats-officedocument"
    ".wordprocessingml.document": FileCategory.DOCUMENT,
    "text/csv": FileCategory.SPREADSHEET,
    "application/vnd.openxmlformats-officedocument"
    ".spreadsheetml.sheet": FileCategory.SPREADSHEET,
}

CODE_EXTENSIONS = {
    ".py", ".js", ".ts", ".java", ".go", ".rs",
    ".rb", ".cpp", ".c", ".h", ".sql", ".sh",
}

def detect_file(filename: str, content: bytes) -> DetectedFile:
    """Detect the type and category of an uploaded file."""
    mime = magic.from_buffer(content, mime=True)
    ext = Path(filename).suffix.lower()

    # Check extension-based overrides
    if ext in CODE_EXTENSIONS:
        category = FileCategory.CODE
    elif mime in MIME_CATEGORY_MAP:
        category = MIME_CATEGORY_MAP[mime]
    elif mime.startswith("image/"):
        category = FileCategory.IMAGE
    elif mime.startswith("audio/"):
        category = FileCategory.AUDIO
    elif mime.startswith("video/"):
        category = FileCategory.VIDEO
    elif mime.startswith("text/"):
        category = FileCategory.TEXT
    else:
        category = FileCategory.UNKNOWN

    return DetectedFile(
        filename=filename,
        mime_type=mime,
        category=category,
        size_bytes=len(content),
        content=content,
    )

Category-Specific Processors

Each file category has a dedicated processor that extracts content into a text representation the LLM can reason over:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
import csv
import io

async def process_text_file(file: DetectedFile) -> str:
    text = file.content.decode("utf-8", errors="replace")
    if len(text) > 50000:
        text = text[:50000] + "\n... [truncated]"
    return f"Contents of {file.filename}:\n{text}"

async def process_code_file(file: DetectedFile) -> str:
    code = file.content.decode("utf-8", errors="replace")
    ext = Path(file.filename).suffix.lstrip(".")
    return (
        f"Code file: {file.filename}\n"
        f"Language: {ext}\n"
        f"Lines: {code.count(chr(10)) + 1}\n"
        f"~~~{ext}\n{code}\n~~~"
    )

async def process_csv_file(file: DetectedFile) -> str:
    text = file.content.decode("utf-8", errors="replace")
    reader = csv.reader(io.StringIO(text))
    rows = list(reader)

    if not rows:
        return f"{file.filename}: empty CSV"

    header = rows[0]
    preview_rows = rows[1:11]  # First 10 data rows

    lines = [
        f"CSV file: {file.filename}",
        f"Columns: {', '.join(header)}",
        f"Total rows: {len(rows) - 1}",
        "",
        "Preview (first 10 rows):",
        "| " + " | ".join(header) + " |",
        "| " + " | ".join("---" for _ in header) + " |",
    ]
    for row in preview_rows:
        lines.append("| " + " | ".join(row) + " |")

    return "\n".join(lines)

PROCESSORS = {
    FileCategory.TEXT: process_text_file,
    FileCategory.CODE: process_code_file,
    FileCategory.SPREADSHEET: process_csv_file,
}

The Unified Processing Pipeline

Bring file detection, processing, and LLM reasoning together:

import openai

class MultiInputAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def _process_file(self, file: DetectedFile) -> str:
        processor = PROCESSORS.get(file.category)
        if processor:
            return await processor(file)

        # Fallback: describe what we received
        return (
            f"File: {file.filename} "
            f"({file.category.value}, {file.size_bytes} bytes)"
        )

    async def chat(
        self,
        user_message: str,
        files: list[tuple[str, bytes]] | None = None,
    ) -> str:
        """Process user text and optional file uploads."""
        # Detect and process all files
        file_contexts = []
        image_parts = []

        for filename, content in (files or []):
            detected = detect_file(filename, content)

            if detected.category == FileCategory.IMAGE:
                import base64
                b64 = base64.b64encode(content).decode()
                image_parts.append({
                    "type": "image_url",
                    "image_url": {
                        "url": (
                            f"data:{detected.mime_type};"
                            f"base64,{b64}"
                        )
                    },
                })
            else:
                processed = await self._process_file(detected)
                file_contexts.append(processed)

        # Build the prompt
        parts = []
        if file_contexts:
            parts.append(
                "Uploaded file contents:\n\n"
                + "\n\n---\n\n".join(file_contexts)
            )
        parts.append(f"User message: {user_message}")

        content = [{"type": "text", "text": "\n\n".join(parts)}]
        content.extend(image_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a helpful assistant that analyzes "
                        "user messages along with any uploaded files. "
                        "Reference specific file contents in your "
                        "response."
                    ),
                },
                {"role": "user", "content": content},
            ],
        )
        return response.choices[0].message.content

FastAPI Endpoint

Expose the agent through a web API that accepts multipart form data:

from fastapi import FastAPI, UploadFile, File, Form
from typing import Annotated

app = FastAPI()
agent = MultiInputAgent()

@app.post("/chat")
async def chat_endpoint(
    message: Annotated[str, Form()],
    files: list[UploadFile] = File(default=[]),
):
    file_data = []
    for f in files:
        content = await f.read()
        file_data.append((f.filename, content))

    response = await agent.chat(message, file_data)
    return {"response": response}

FAQ

How do I handle very large files that exceed the LLM context window?

For large files, implement a summarization or chunking strategy. For text and code files, truncate to the first and last sections with a note about what was omitted. For CSVs, show the schema plus a statistical summary (column types, min, max, mean) instead of raw rows. For PDFs, extract only the pages most relevant to the user's question using keyword matching against the query.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

What security considerations are important for file upload agents?

Never execute uploaded files or evaluate their contents as code. Validate file sizes (reject uploads over a reasonable limit like 50MB). Scan for malware if the system is exposed to the public. Sanitize filenames to prevent path traversal attacks. Process files in isolated temporary directories and clean them up after processing. Never store raw uploads permanently unless explicitly required.

Can this agent maintain context across multiple messages with different file uploads?

Yes. Add a conversation history that stores both messages and processed file contexts. On each new message, include the relevant prior context in the prompt. For efficiency, store processed file summaries rather than raw file contents in the history, and allow the user to reference previously uploaded files by name without re-uploading them.


#MultiInputAgent #FileProcessing #FormatDetection #FastAPI #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.