Attention Mechanisms Explained: From Self-Attention to Multi-Query
The evolution of attention from the original transformer to 2026's multi-query and grouped-query variants — what changed and why it matters.
What Self-Attention Does
Self-attention lets each token attend to every other token in the sequence. It is the operation that gave transformers their power: tokens can directly reference each other regardless of distance in the sequence.
By 2026 the original "attention is all you need" formulation has evolved through many variants. This piece walks through the lineage: self-attention → multi-head → multi-query → grouped-query → multi-head latent.
Self-Attention
For a sequence of N tokens with hidden dimension D:
flowchart LR
Tokens[N tokens] --> Q[Q matrix]
Tokens --> K[K matrix]
Tokens --> V[V matrix]
Q --> Score[Q dot K]
Score --> Soft[softmax]
Soft --> Apply[apply to V]
Apply --> Out[Output]
Each token produces a query (Q), key (K), and value (V) vector. Attention scores are computed as Q · K / sqrt(D), softmaxed, and used to weight V.
Cost: O(N²) compute and memory. Manageable for short sequences; expensive at long ones.
Multi-Head Attention
Run self-attention multiple times in parallel with different projections. Each "head" learns a different attention pattern. Concatenate the outputs.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Benefit: different heads can specialize (some focus on syntax, some on semantics).
Cost: H times the parameter count for H heads.
Multi-Query Attention (MQA)
The KV cache (cached K and V vectors during inference) is the dominant memory cost at long contexts. Multi-Query Attention reduces it: all heads share the same K and V projections, but each has its own Q.
flowchart TB
Heads[H heads] --> SepQ[Separate Q per head]
Heads --> ShareKV[Shared K, V across all heads]
Memory savings: Hx less for K and V cache.
Quality cost: small but measurable; fine for many production models.
Grouped-Query Attention (GQA)
Compromise between MHA and MQA: heads are organized into groups. Each group shares K and V; queries differ per head.
- MHA: H groups (one per head)
- MQA: 1 group (all heads share)
- GQA: configurable, typically 4-8 groups for 32-64 heads
GQA is the dominant pattern in 2026 production models (Llama 3+, Claude 3+, GPT-4o family). It hits the sweet spot of memory savings and quality preservation.
Multi-Head Latent Attention (MLA)
DeepSeek V2-V4's innovation. K and V are projected to a low-dimensional latent space, then back. Cache is in the latent space (much smaller).
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Memory savings: substantial — 4-8x smaller KV cache than GQA at comparable quality.
Quality: matches MHA on benchmarks while being more memory-efficient.
Why It Matters Operationally
flowchart LR
Mem[KV cache memory] --> Cost[Inference cost]
Mem --> Length[Max context]
Mem --> Concur[Concurrent users]
Smaller KV cache means:
- Longer contexts at the same memory
- More concurrent users on the same hardware
- Lower per-token inference cost
The 2024-2026 shift from MHA to GQA / MQA / MLA is part of why LLM inference cost dropped so much.
Implementations
- Llama 3 / 4: GQA
- Claude 3+: GQA (publicly inferred)
- GPT-4 family: GQA (publicly inferred)
- DeepSeek V2-V4: MLA
- Mistral: GQA / MQA
For most teams running self-hosted, GQA is the default; MLA is for the cost-extreme.
How This Affects Your Application
For application developers, attention type is mostly transparent. It affects:
- Long-context cost
- Throughput per dollar
- Available concurrency
You do not configure it; you choose models that already use the right one for your workload.
Beyond Attention
Some 2026 architectures (Mamba, hybrid SSM-transformer) reduce or replace attention entirely. They have their own tradeoffs (covered elsewhere). For pure-transformer architectures, attention variants are how the field gets cheaper.
Sources
- "Attention Is All You Need" Vaswani et al. — https://arxiv.org/abs/1706.03762
- "Multi-Query Attention" Shazeer — https://arxiv.org/abs/1911.02150
- "GQA" Ainslie et al. — https://arxiv.org/abs/2305.13245
- DeepSeek-V2 paper — https://arxiv.org/abs/2405.04434
- Llama 3 technical report — https://arxiv.org/abs/2407.21783
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.