Skip to content
Learn Agentic AI
Learn Agentic AI10 min read16 views

Prompt Compression Techniques: Reducing Token Count by 50% Without Quality Loss

Master prompt compression methods including LLMLingua, selective context pruning, and abstractive compression to halve your token costs while maintaining output quality. Practical Python implementations included.

The Token Cost Problem

Every token in your prompt costs money. For agents that include conversation history, RAG context, tool outputs, and system instructions, prompts routinely hit 10,000–50,000 tokens. At GPT-4o’s input pricing, a 30,000-token prompt costs about $0.075 per request. Serve 100,000 requests per day and that is $7,500 monthly just for input tokens.

Prompt compression reduces token count while preserving the information the model needs. Done well, you can cut token counts by 40–60% with negligible quality impact.

Technique 1: Selective Context Pruning

Not all context is equally important. Prune low-relevance content before sending it to the model.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost<br/>per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted<br/>at 24 by 7 coverage"]
        C2["Receptionist payroll<br/>displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue<br/>per month"]
        O2["Operating cost saved"]
        O3((Net ROI<br/>monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff
from typing import List, Tuple
import numpy as np

class SelectiveContextPruner:
    """Prune context passages by relevance score."""

    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens

    def estimate_tokens(self, text: str) -> int:
        return len(text.split()) * 4 // 3  # rough approximation

    def prune_by_relevance(
        self,
        passages: List[Tuple[str, float]],  # (text, relevance_score)
    ) -> List[str]:
        sorted_passages = sorted(passages, key=lambda x: x[1], reverse=True)
        selected = []
        total_tokens = 0
        for text, score in sorted_passages:
            tokens = self.estimate_tokens(text)
            if total_tokens + tokens <= self.max_tokens:
                selected.append(text)
                total_tokens += tokens
            else:
                break
        return selected

    def prune_conversation_history(
        self,
        messages: List[dict],
        keep_last_n: int = 4,
        keep_system: bool = True,
    ) -> List[dict]:
        system_msgs = [m for m in messages if m["role"] == "system"] if keep_system else []
        non_system = [m for m in messages if m["role"] != "system"]
        recent = non_system[-keep_last_n:] if len(non_system) > keep_last_n else non_system
        return system_msgs + recent

pruner = SelectiveContextPruner(max_tokens=3000)
passages = [
    ("The product supports SSO via SAML 2.0 and OIDC.", 0.92),
    ("Our office is located in San Francisco.", 0.15),
    ("Pricing starts at $49/month per seat.", 0.88),
    ("The company was founded in 2019.", 0.20),
    ("API rate limits are 1000 req/min on the Pro plan.", 0.85),
]
selected = pruner.prune_by_relevance(passages)
print(f"Kept {len(selected)} of {len(passages)} passages")

Technique 2: Abstractive Compression

Use a cheap model to summarize verbose context before passing it to the main model. This trades a small cheap-model call for significant token savings on the expensive call.

import openai

class AbstractiveCompressor:
    def __init__(self, client: openai.OpenAI, model: str = "gpt-4o-mini"):
        self.client = client
        self.model = model

    def compress_context(self, context: str, max_summary_tokens: int = 500) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Compress the following context into a dense summary. "
                        "Preserve all facts, numbers, names, and relationships. "
                        "Remove filler words, redundancies, and formatting. "
                        "Output only the compressed version."
                    ),
                },
                {"role": "user", "content": context},
            ],
            max_tokens=max_summary_tokens,
            temperature=0,
        )
        return response.choices[0].message.content

    def compress_if_beneficial(
        self,
        context: str,
        threshold_tokens: int = 2000,
    ) -> Tuple[str, dict]:
        est_tokens = len(context.split()) * 4 // 3
        if est_tokens <= threshold_tokens:
            return context, {"compressed": False, "original_tokens": est_tokens}
        compressed = self.compress_context(context)
        compressed_tokens = len(compressed.split()) * 4 // 3
        return compressed, {
            "compressed": True,
            "original_tokens": est_tokens,
            "compressed_tokens": compressed_tokens,
            "reduction_pct": round((1 - compressed_tokens / est_tokens) * 100, 1),
        }

Technique 3: Structural Compression

Remove formatting that consumes tokens without adding information value.

import re

def compress_structural(text: str) -> str:
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r' {2,}', ' ', text)
    text = re.sub(r'#{1,6} ', '', text)  # remove markdown headers
    text = re.sub(r'\*{1,2}([^*]+)\*{1,2}', r'\1', text)  # remove bold/italic
    text = re.sub(r'^[-*] ', '', text, flags=re.MULTILINE)  # remove list markers
    return text.strip()

def compress_json_output(json_str: str) -> str:
    """Remove whitespace from JSON tool outputs."""
    import json
    try:
        data = json.loads(json_str)
        return json.dumps(data, separators=(',', ':'))
    except json.JSONDecodeError:
        return json_str

Measuring Compression Quality

Always validate that compression does not degrade response quality. Run an A/B test comparing full-context and compressed-context responses.

@dataclass
class CompressionResult:
    original_tokens: int
    compressed_tokens: int
    quality_score: float  # 0.0 to 1.0
    cost_saved_per_request: float

    @property
    def compression_ratio(self) -> float:
        return 1 - (self.compressed_tokens / self.original_tokens)

    @property
    def is_acceptable(self) -> bool:
        return self.quality_score >= 0.85 and self.compression_ratio >= 0.25

FAQ

How much quality degradation should I accept from compression?

Target less than 5% quality degradation as measured by automated evaluation or human review. If your quality score drops below 0.85 on a 0–1 scale, the compression is too aggressive. Start conservative and increase compression gradually while monitoring quality metrics.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Is it worth using a paid API call just to compress the context?

Yes, when the context is large enough. If compressing 10,000 tokens of context down to 3,000 tokens costs $0.001 with GPT-4o-mini but saves $0.017 in GPT-4o input costs, the net saving is $0.016 per request. At scale, this compounds significantly.

Should I compress system prompts or just user context?

System prompts are usually already concise and carefully tuned, so compressing them risks degrading the model’s behavior. Focus compression on RAG context, conversation history, and tool outputs — these are the sources of token bloat in most agent systems.


#PromptCompression #TokenOptimization #CostReduction #LLMLingua #ContextManagement #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like