Articles

Technical writing

Research notes on transformer architecture, language-model improvement, context engineering, middleware, long-context systems, and evaluation for regulated AI products.

These are deliberately technical essays. They treat language models as mathematical and production systems: attention geometry, cache economics, retrieval constraints, typed middleware, policy boundaries, and evaluation loops.

18 min read

2026

The transformer is not just 'attention'. In production LLMs it is an optimization problem over expressivity, memory bandwidth, cache shape, kernel availability, gradient stability, and context-length extrapolation.

The decoder block as a residual dynamical system

A contemporary LLM block is usually pre-normalized: `x_{l+1} = x_l + Attn(RMSNorm(x_l))`, followed by `x_{l+2} = x_{l+1} + MLP(RMSNorm(x_{l+1}))`. Pre-norm keeps the residual stream close to an identity path, which makes very deep stacks trainable because gradients can flow through the skip path without being repeatedly multiplied by unstable layer Jacobians.

RMSNorm replaces full LayerNorm by normalizing with the root mean square of activations, typically without subtracting the mean. The practical gain is fewer operations and a stable scale control mechanism. The mathematical point is that the residual stream becomes a shared coordinate system: attention writes relational information into it, while the MLP performs high-dimensional feature transformation at each token position.

  • Attention mixes information across sequence positions.
  • The MLP expands and contracts each token representation, often using SwiGLU-style gated activations.
  • The residual stream carries state across layers; normalization controls the magnitude before each write.

Self-attention as content-addressed routing

For a hidden state matrix `X in R^{n x d_model}`, a head forms `Q = XW_q`, `K = XW_k`, and `V = XW_v`, with head dimension `d_h`. Causal attention computes `softmax((QK^T / sqrt(d_h)) + M)V`, where `M_ij = -infinity` when token `i` is not allowed to attend to token `j`. This is a differentiable routing operator: queries score keys, then copy and blend value vectors into the current representation.

Multi-head attention gives multiple routing subspaces. Each head owns separate projections, so the dot product implements a different learned similarity metric. The final output projection recombines those subspaces into the residual stream. Attention heads can specialize into patterns such as local syntax, long-range reference, delimiter tracking, code indentation, or retrieval-like copying, although individual-head interpretations should be treated as empirical observations rather than guaranteed semantics.

  • Per-head score matrix: `S = QK^T / sqrt(d_h) + M`.
  • Per-head output: `A = softmax(S)V`.
  • Layer output: concatenate head outputs, then apply the output projection `W_o`.

Position is injected into the bilinear form

Vanilla attention is permutation-invariant; without position, a model cannot distinguish `dog bites man` from `man bites dog` by order alone. RoPE solves this by rotating query and key pairs using position-dependent rotation matrices before the dot product. The resulting attention score depends on relative offsets while keeping the attention kernel structurally simple.

The limitation is that standard RoPE is static: the transformation is a function of token index, not token content. Recent work such as PaTH Attention replaces static rotation with accumulated data-dependent Householder-like transformations, making positional interaction depend on the sequence itself. That is important for state-tracking tasks where the relevant coordinate frame changes as commands, entities, or memory states evolve.

Why GQA and KV-cache shape matter

At inference time, autoregressive decoding stores keys and values for every previous token. Standard multi-head attention caches one K/V pair per attention head; grouped-query attention shares K/V heads across groups of query heads. If a model has `L` layers, context length `T`, KV heads `h_kv`, head dimension `d_h`, and `b` bytes per scalar, the KV cache is approximately `2 * L * T * h_kv * d_h * b` bytes. Reducing `h_kv` is therefore a direct memory-bandwidth intervention.

The bottleneck during decoding is often not raw FLOPs but memory traffic: each generated token repeatedly reads the KV cache. That is why GQA, multi-query attention, paged attention, quantized caches, FlashAttention-style kernels, and latent attention variants matter. They change the system-level feasibility of long contexts.

The converged 2023-2026 stack

Across many major open-weight model families, the stack has crystallized around pre-norm residual blocks, RMSNorm, RoPE or RoPE variants, SwiGLU-style MLPs, GQA/MQA for inference efficiency, and in some cases MoE layers to decouple total parameters from activated parameters. The field still varies more in long-context strategy, expert routing, cache compression, attention variants, and whether recurrence or external memory is introduced.

The engineering lesson is that architecture must be read with hardware. A mathematically elegant attention variant is incomplete until one asks how it trains in parallel, how it affects KV-cache layout, whether kernels exist, and whether it improves perplexity, reasoning, retrieval, and length extrapolation under fixed compute.

Implementation notes

  • When building a small transformer from scratch, implement causal masking, RoPE, RMSNorm, SwiGLU, and KV caching before experimenting with exotic mechanisms.
  • Measure tokens/sec, KV-cache bytes/token, perplexity, and long-context retrieval accuracy separately; one metric can improve while another collapses.
  • For deployment, benchmark prefill and decode independently. Prefill is dominated by parallel attention over the prompt; decode is dominated by repeated cache reads.

Research references

16 min read

2026

Model quality is a product of the model, the data distribution, the inference policy, and the surrounding control system. Scaling helps, but it is not the only lever and often not the cheapest lever.

Data quality beats indiscriminate token volume

A base model learns a compressed statistical representation of its training distribution. If the distribution is polluted with duplicated boilerplate, broken code, low-signal SEO text, or inconsistent mathematical notation, the model spends capacity modelling noise. Deduplication, curriculum design, source weighting, synthetic-data filtering, and contamination audits directly alter the learned function.

For technical and mathematical domains, the important issue is not only whether the corpus contains the right facts. It is whether derivations, code paths, counterexamples, and proof obligations appear in a form the model can learn. Models get better at reasoning when training traces force them to preserve variables, invariants, and intermediate state.

Post-training mostly reshapes the policy surface

Instruction tuning and preference optimization reshape how a model surfaces knowledge and follows tasks. SFT teaches response format and task following; RLHF, DPO, RLAIF, and related methods optimize preference-ranked behavior. These methods can also alter internal representations and factual behavior, so it is too simple to say they only change style. The more careful statement is that they optimize the model's response policy under a preference or demonstration distribution.

The technical constraint is that post-training should be evaluated on task distributions, not vibes. For code, run unit tests. For retrieval QA, measure grounded answer rate and citation precision. For regulated enterprise workflows, measure refusal correctness, PII leakage, policy adherence, latency, and auditability.

Retrieval turns missing parameters into external memory

RAG improves factuality when the needed information is private, recent, sparse, or too expensive to memorize. The hard part is not vector search alone. It is query rewriting, hybrid retrieval, metadata filtering, reranking, context compression, citation binding, and answer verification.

A strong RAG system treats the context window as scarce working memory. The pipeline should ask: what exact claim must be answered, which corpus has authority, which chunks carry the proof, how much context can be injected, and how will unsupported claims be rejected?

Inference-time compute is an algorithmic lever

Language models can be improved at inference through decomposition, self-consistency, verifier-guided search, tool use, scratchpad isolation, and multi-model routing. A small model can handle classification, extraction, or routing while a larger model handles high-entropy synthesis. The effective system becomes a mixture of policies rather than a single monolith.

The danger is uncontrolled recursion. Every extra model call increases latency, cost, and attack surface. The correct design is bounded: declare which steps are allowed, attach confidence thresholds, and evaluate whether additional calls improve task success enough to justify their cost.

  • Router: choose direct answer, retrieval, tool call, or escalation.
  • Verifier: reject unsupported answers, invalid schemas, or failed tests.
  • Budget controller: cap calls, tokens, latency, and tool side effects.

Serving optimizations change what is practical

Quantization, speculative decoding, continuous batching, prefix caching, semantic caching, and KV-cache compression can make a better model economically usable. These techniques do not improve the underlying distribution learned by the model, but they improve the feasible operating point: lower latency, lower cost, larger context, or more requests per GPU.

A production model strategy should therefore couple offline quality metrics with online economics. The best model is rarely the absolute largest model; it is the best point on the curve defined by quality, latency, marginal cost, data policy, and reliability.

Implementation notes

  • Start model improvement projects with an evaluation harness and error taxonomy before changing prompts, data, or architecture.
  • Use retrieval only when the answer depends on external knowledge; otherwise retrieval can inject distractors and reduce reasoning quality.
  • Route by task entropy: deterministic extraction, classification, and formatting often belong on smaller models or rules; ambiguous reasoning may need larger models and verifiers.

Research references

20 min read

2026

A model does not understand an enterprise request from text alone. It understands through the context contract surrounding the text: ontology, permissions, retrieved evidence, tool schemas, output constraints, and feedback signals.

Middleware is the model's epistemic boundary

In a production LLM system, middleware sits between user intent and model execution. It transforms a raw utterance into an inference payload: system rules, selected tools, retrieved documents, user metadata, conversation state, response schema, and policy constraints. This is not a prompt trick; it is the application-level construction of the model's observable world.

The core abstraction is `request -> context -> model call -> validated response`. Each arrow is programmable. Middleware can redact PII, classify intent, decide whether retrieval is required, select a model, compress conversation history, attach domain definitions, gate tools, and validate output before the user sees anything.

Intent and ontology before retrieval

A common failure mode is retrieving too early. The query `show me disputed accounts from last quarter` means different things in collections, banking, CRM, or legal case management. Middleware should first map language into a domain ontology: entities, time windows, metrics, permissions, and ambiguity classes.

This can be implemented using a lightweight classifier, embedding router, rules over authenticated user state, or a small structured-output model. The output is not prose; it is a typed intent object such as `{domain, task, entities, time_range, required_permissions, retrieval_plan}`.

  • Entity grounding: map phrases to canonical IDs and database fields.
  • Temporal grounding: normalize relative time into auditable intervals.
  • Policy grounding: attach what the user may access before retrieval occurs.
  • Ambiguity grounding: ask a clarifying question when the intent object is underdetermined.

Context assembly as constrained optimization

The context window is finite, so context assembly is an optimization problem. Given a token budget `B`, choose context items that maximize expected answer utility subject to constraints: authority, recency, diversity, permissions, and non-duplication. A naive top-k vector search does not solve this because high semantic similarity can return redundant or unauthoritative chunks.

A stronger pipeline performs query rewriting, hybrid sparse+dense retrieval, metadata filtering, cross-encoder reranking, clustering for diversity, and compression. The final prompt should include the minimal evidence needed to answer, with stable identifiers so the output can cite which records support which claims.

  • Objective: maximize expected answer utility under a token budget.
  • Hard constraints: permissions, policy, data residency, source authority.
  • Soft constraints: diversity, recency, novelty, compression loss.

Tool filtering reduces cognitive load

Tool-calling models degrade when the tool catalog is too large or overlapping. Middleware should not expose every function on every request. It should select a small tool subset based on intent, user permissions, and tool preconditions. This improves model reliability because the model sees fewer irrelevant affordances.

A secure tool middleware wraps each call with argument validation, idempotency checks, destructive-action confirmation, rate limits, and audit logs. The model proposes an action; the middleware decides whether that action is syntactically valid, semantically permitted, and operationally safe.

Memory is not chat history

Raw conversation history is a poor long-term memory substrate. It is verbose, repetitive, and full of stale local context. Middleware should maintain layered memory: short-term transcript, episodic summaries, user preferences, durable domain facts, and task state. Each layer needs different retention and trust rules.

For language understanding, memory should preserve commitments and unresolved variables: definitions the user accepted, entities already disambiguated, constraints already stated, and outputs already delivered. Compressing a conversation into a pleasant narrative summary is less useful than preserving a typed state vector.

Observability closes the learning loop

Middleware should log the intent object, selected model, retrieval queries, retrieved document IDs, tool calls, guardrail decisions, validation errors, latency, token use, and user feedback. Without this trace, failures become impossible to diagnose because the model output is only the final symptom.

The evaluation unit is the whole inference transaction. Did the router pick the right model? Did retrieval fetch the authoritative documents? Did the model use the documents? Did validation catch unsupported claims? Did the final answer satisfy the task under policy? These questions belong in regression tests.

Implementation notes

  • Represent middleware state as typed objects, not concatenated strings. The final prompt can be rendered from structured state at the edge of the model call.
  • Separate model-visible context from audit-only metadata. The model does not need every policy trace, but the system does.
  • Treat tool use as a capability lease: expose only the tools, scopes, and records needed for the current task.

Research references

15 min read

2026

A million-token window is not equivalent to memory. Long-context models need mechanisms for selection, compression, state tracking, and forgetting.

Length generalization is a positional problem

A model trained mostly on short sequences can fail when evaluated at longer lengths even if the architecture accepts the tokens. Positional encodings determine whether attention scores remain meaningful outside the training range. RoPE extrapolation, NTK-aware scaling, YaRN-style interpolation, ALiBi, NoPE variants, and data-dependent encodings all attack this problem differently.

The mathematical issue is that position changes the geometry of query-key similarity. In RoPE, query/key channels are rotated at different frequencies; at long lengths, frequency choices and training distribution affect whether relative-position signals remain usable. If rotations alias or if the model never learned long-range dependencies during training, simply increasing `max_position_embeddings` does not create robust long-context reasoning.

Attention cost and cache cost diverge

During prompt prefill, full attention over `n` tokens has quadratic interaction unless optimized by kernels or sparse/block strategies. During decode, the new token attends over cached K/V states, so the main cost becomes reading a cache that grows linearly with context length and layer count.

This distinction matters because a method that improves prefill may not improve decode. Sliding-window attention, block-sparse attention, chunked prefill, paged KV caches, GQA, and latent KV compression should be evaluated against the phase they are meant to optimize.

Forgetting is a feature

Long contexts become noisy. If every old token remains equally eligible for attention, the model can retrieve distractors, stale instructions, or irrelevant examples. Forgetting mechanisms deliberately down-weight old or low-utility information. This is closer to how working memory should behave in an agentic system: retain state, not every observation.

PaTH-FoX-style combinations point to a useful direction: make positional interaction content-aware and allow the network to modulate what should decay. The question becomes not 'how many tokens fit?' but 'which facts remain causally active?'

State streams and recurrence

Stateful transformer proposals add persistent latent state alongside token context. Instead of rebuilding all internal computation from text at each step, the model maintains recurrent vectors across layers or time. A state stream may carry compressed reasoning traces, task state, or latent summaries without appending everything to the visible token sequence.

This is research-stage, not settled practice. Recurrence introduces stability, credit assignment, and parallelization problems, and reported gains can be difficult to separate from training procedure, benchmark choice, or additional inference-time compute. Any stateful mechanism must show that it improves reasoning or long-horizon coherence without becoming an opaque cache that drifts, amplifies errors, or resists inspection.

Retrieval is still necessary

Even excellent long-context models should not ingest entire enterprise corpora on every request. Retrieval supplies authority, access control, recency, and cost discipline. The strongest designs combine long-context capacity with retrieval-aware context assembly: enough window to include rich evidence, enough retrieval discipline to avoid irrelevant memory.

Implementation notes

  • Evaluate long context with needle retrieval, multi-hop dependency tracking, recency conflict tests, and distractor-heavy tasks.
  • Track attention to unsupported context: a model can cite the right document ID while deriving the wrong answer.
  • Prefer explicit memory objects for durable facts; do not rely on a growing transcript as the only memory mechanism.

Research references

14 min read

2026

In regulated enterprise AI, a demo is not evidence. The system needs mathematical evaluation loops that convert behavior into measurable risk.

Define the unit of evaluation

The correct unit is not only the final answer. It is the transaction: user request, authenticated state, middleware decisions, retrieval results, tool calls, model output, validation results, and audit trace. A bad answer may originate from retrieval, routing, prompt construction, permissions, model reasoning, or output validation.

Each transaction should be replayable. If the same input state is passed through a new model or prompt, the system should report which metric changed and whether the change is acceptable.

Grounding metrics

For RAG and agent systems, measure citation precision, citation recall, unsupported-claim rate, answer completeness, refusal correctness, and contradiction handling. A grounded answer should map each material claim to evidence; the metric should penalize both missing citations and citations that do not actually support the claim. If the evidence is absent or the user lacks permission, the correct answer is refusal or clarification.

Use adversarial retrieval tests: near-duplicate documents with conflicting dates, stale policies, records from unauthorized accounts, and documents with semantically similar but legally different terms.

Tool safety metrics

Tool calls need their own evaluation. Measure argument validity, permission correctness, idempotency, destructive-action gating, human-approval triggers, and recovery from tool errors. The model should not be trusted to decide policy alone; middleware should enforce hard constraints.

For collections, lending, banking, or insurance workflows, a tool call may have legal consequences. That makes auditability a model requirement, not a back-office convenience.

Prompt-injection and data-exfiltration tests

Enterprise agents must be tested against instruction hierarchy violations: retrieved documents that tell the model to ignore system policy, user messages that request hidden prompts, tool outputs that contain malicious instructions, and attempts to retrieve data outside access boundaries.

A useful test suite separates detection from behavior. It is not enough to flag an attack; the system must preserve task utility when safe, refuse when necessary, and avoid leaking sensitive context in the refusal.

Regression gates

Every release should run a fixed set of golden tasks and adversarial tasks. Metrics should be sliced by domain, product, user role, language, document type, and tool path. Aggregate accuracy can hide a failure mode that matters legally.

The evaluation harness should store prompts, model versions, retrieval corpus snapshots, embeddings version, middleware version, and output schema version. Without versioning, failures cannot be reproduced.

Implementation notes

  • Maintain separate offline and online evaluation: offline for deterministic regression, online for drift, latency, user feedback, and policy events.
  • Attach severity to failures. A formatting error is not equivalent to unauthorized disclosure or an unsafe collection action.
  • Prefer typed output contracts and validators over prose-only instructions for regulated workflows.

Research references