The transformer is not just 'attention'. In production LLMs it is an optimization problem over expressivity, memory bandwidth, cache shape, kernel availability, gradient stability, and context-length extrapolation.
The decoder block as a residual dynamical system
A contemporary LLM block is usually pre-normalized: `x_{l+1} = x_l + Attn(RMSNorm(x_l))`, followed by `x_{l+2} = x_{l+1} + MLP(RMSNorm(x_{l+1}))`. Pre-norm keeps the residual stream close to an identity path, which makes very deep stacks trainable because gradients can flow through the skip path without being repeatedly multiplied by unstable layer Jacobians.
RMSNorm replaces full LayerNorm by normalizing with the root mean square of activations, typically without subtracting the mean. The practical gain is fewer operations and a stable scale control mechanism. The mathematical point is that the residual stream becomes a shared coordinate system: attention writes relational information into it, while the MLP performs high-dimensional feature transformation at each token position.
- Attention mixes information across sequence positions.
- The MLP expands and contracts each token representation, often using SwiGLU-style gated activations.
- The residual stream carries state across layers; normalization controls the magnitude before each write.
Self-attention as content-addressed routing
For a hidden state matrix `X in R^{n x d_model}`, a head forms `Q = XW_q`, `K = XW_k`, and `V = XW_v`, with head dimension `d_h`. Causal attention computes `softmax((QK^T / sqrt(d_h)) + M)V`, where `M_ij = -infinity` when token `i` is not allowed to attend to token `j`. This is a differentiable routing operator: queries score keys, then copy and blend value vectors into the current representation.
Multi-head attention gives multiple routing subspaces. Each head owns separate projections, so the dot product implements a different learned similarity metric. The final output projection recombines those subspaces into the residual stream. Attention heads can specialize into patterns such as local syntax, long-range reference, delimiter tracking, code indentation, or retrieval-like copying, although individual-head interpretations should be treated as empirical observations rather than guaranteed semantics.
- Per-head score matrix: `S = QK^T / sqrt(d_h) + M`.
- Per-head output: `A = softmax(S)V`.
- Layer output: concatenate head outputs, then apply the output projection `W_o`.
Position is injected into the bilinear form
Vanilla attention is permutation-invariant; without position, a model cannot distinguish `dog bites man` from `man bites dog` by order alone. RoPE solves this by rotating query and key pairs using position-dependent rotation matrices before the dot product. The resulting attention score depends on relative offsets while keeping the attention kernel structurally simple.
The limitation is that standard RoPE is static: the transformation is a function of token index, not token content. Recent work such as PaTH Attention replaces static rotation with accumulated data-dependent Householder-like transformations, making positional interaction depend on the sequence itself. That is important for state-tracking tasks where the relevant coordinate frame changes as commands, entities, or memory states evolve.
Why GQA and KV-cache shape matter
At inference time, autoregressive decoding stores keys and values for every previous token. Standard multi-head attention caches one K/V pair per attention head; grouped-query attention shares K/V heads across groups of query heads. If a model has `L` layers, context length `T`, KV heads `h_kv`, head dimension `d_h`, and `b` bytes per scalar, the KV cache is approximately `2 * L * T * h_kv * d_h * b` bytes. Reducing `h_kv` is therefore a direct memory-bandwidth intervention.
The bottleneck during decoding is often not raw FLOPs but memory traffic: each generated token repeatedly reads the KV cache. That is why GQA, multi-query attention, paged attention, quantized caches, FlashAttention-style kernels, and latent attention variants matter. They change the system-level feasibility of long contexts.
The converged 2023-2026 stack
Across many major open-weight model families, the stack has crystallized around pre-norm residual blocks, RMSNorm, RoPE or RoPE variants, SwiGLU-style MLPs, GQA/MQA for inference efficiency, and in some cases MoE layers to decouple total parameters from activated parameters. The field still varies more in long-context strategy, expert routing, cache compression, attention variants, and whether recurrence or external memory is introduced.
The engineering lesson is that architecture must be read with hardware. A mathematically elegant attention variant is incomplete until one asks how it trains in parallel, how it affects KV-cache layout, whether kernels exist, and whether it improves perplexity, reasoning, retrieval, and length extrapolation under fixed compute.
Implementation notes
- When building a small transformer from scratch, implement causal masking, RoPE, RMSNorm, SwiGLU, and KV caching before experimenting with exotic mechanisms.
- Measure tokens/sec, KV-cache bytes/token, perplexity, and long-context retrieval accuracy separately; one metric can improve while another collapses.
- For deployment, benchmark prefill and decode independently. Prefill is dominated by parallel attention over the prompt; decode is dominated by repeated cache reads.