Chunkwise linear attention and DeltaNet (Part 3)
Linear Attention Part 3 — Chunkwise kernels, the WY representation, and the MQAR benchmark
1. Recap
In Part 1 we derived from softmax attention the recurrence relation using a finite-dim kernel and state with capacity , and further motivated and derived the delta rule and gating. In Part 2 we expanded this concept into four complementary directions (memory architecture, objective, optimizer, retention) and showed how every finite-sized memory model fits into the picture. In this part we cover the engineering part of training them performantly on GPUs, and then benchmark them to test our theoretical picture empirically.
2. Chunkwise
Looking back at where we came from, we motivated linear attention for the regime . For example, reaches ~1M for the newest models (Claude Opus 4.7).
The recurrence
is nice for inference as we pay just per step instead of over the sequence, but training is now the problem. There we have the full sequence but the recurrence still forces us to compute every in order. On modern accelerators built around big matmul operations this wastes a lot of compute and shows up as low MFU. For just one sequence this means thousands of sequential kernel launches, each one a tiny matmul where the launch overhead dominates the actual work. What we want is the opposite extreme: one big matmul over the whole sequence at once.
The full parallel form for linear attention is possible and can fully utilize the GPU when is manageable. Pretraining at 4–8K is fine, finetuning at 32K or 128K still works, but ideally we’d train on 256K–1M sequences to stay close to inference length. At that scale the score matrix doesn’t fit, and we need a way to chunk along .
The bigger problem: the full parallel form only exists for linear attention, not for DeltaNet.
Linear attention
Starting with linear attention. Let’s split our sequence of length into chunks of size , with and the query/key/value matrices for chunk (covering positions to ). The state is the starting state of chunk , before any of chunk ‘s information is in .
Leaning on softmax attention and Part 1, we can compute the output inside a chunk in parallel form:
But this only sees writes inside the current chunk. We add the readout from all previous chunks by reading from :
What’s left is the next state, , which is just chunk ‘s contribution accumulated into :
Final chunk output: . The whole algorithm is sequential over chunks (the hand-off), but the work inside each chunk is mostly matmuls, which makes it GPU-friendly.
Total cost: chunks with three matmuls per chunk of size , giving overall. This is more FLOPs than the recurrent form, but we amortize the wall-clock through big matmuls inside each chunk and stay viable at sequence lengths where the parallel form’s score matrix wouldn’t fit.
DeltaNet
Recall DeltaNet’s recurrence:
Write for the rank-1 transition. Unrolling within chunk , the state at position is
Two things change. The cross-chunk hand-off is no longer additive: the carry has to pass through the product of rank-1 perturbations . And each write at position inside the chunk gets downscaled by the tail product of the ‘s after it.
If we materialize each tail product as a matrix the chunkwise cost goes to , a factor of worse than just running DeltaNet recurrently. Speedup gone. We need a way to represent the product of rank-1 perturbations in a smarter way.
3. WY representation
Each is identity minus rank-1, so a product of such factors is identity minus rank-:
This is the WY representation (Bischof & Van Loan, 1987) The WY Representation for Products of Householder Matrices Bischof, Van Loan · SIAM J. Sci. Stat. Comput. 8 , originally for products of Householder reflectors in QR. The win: we never have to materialize the different tail products explicitly, and applying to anything is two matrix-vector products of cost without ever forming a matrix.
The closed form for . We can derive and inductively, peeling one at a time off a known WY representation of the prefix.
Base case (). Match against . With , the natural choice is and .
Inductive step. Assume , with ‘s rows being the keys. Multiply by :
This matches exactly when we set and
” is with all prior ‘s already applied.” So ‘s rows are just the chunk’s keys, , and is built from this closed form.
Unfolding into a recursion. We don’t want to actually compute as a full matrix. But it’s already in WY form by induction, so we can apply it cheaply:
Each entry of the vector is just (since ‘s rows are the keys), so
Plugging back:
The entries form a matrix, the chunk’s key-gram . Different matrix from the intra-chunk attention’s , but the same shape and the same cost to compute.
Stack the recursion as a triangular system: strictly lower triangular with for . Then
Unit-triangular forward substitution on a system, applied across the output columns. Cost , the same order as the intra-chunk attention block. WY is essentially free.
This gives us the chunk’s prefix product in closed form: . Applying it to anything is two matrix products against and , never the matrix itself.
The twin matrix . That handled the inter-chunk hand-off through . The chunk also contributes its own writes, , and each has a different tail product. We can use the same trick: define with rows
giving the chunk’s write contribution as . Same coefficient matrix as , different right-hand side:
Two triangular solves with shared . Each is forward substitution on a system: row 1 of the output equals row 1 of the right-hand side; row subtracts off . Sequential in , parallel across the output columns ( for , for ). For typical this is a tiny system, handled by dedicated trsm kernels on GPU.
Final chunk update.
Three matmuls of - or -shaped objects against , plus the two triangular solves for and . Same asymptotic cost as linear-attention chunkwise; only a constant-factor overhead for the rank-1-perturbation transition.
The chunk’s outputs assemble in two pieces, same as linear attention. The intra-chunk (within-chunk) readout swaps for in §2’s formula:
The inter-chunk readout uses WY in the same way, since each query needs the partial prefix product applied before reading from . Final: .
MLP states (TTT, Titans).
WY relies on each being a rank-1 perturbation of identity; products of such factors stay low-rank. For MLP-based memory, has no at all, so there’s no rank-1 perturbation algebra to lean on.
TTT (Sun et al., 2024) Learning to (Learn at Test Time): RNNs with Expressive Hidden States Sun, Li, Geng, Hua, Wang, Zhao, Liu, Hardt, Chen, Pan, Lin, Wang, Han, Guestrin · 2024 arXiv:2407.04620 and Titans (Behrouz et al., 2024) Titans: Learning to Memorize at Test Time Behrouz, Zhong, Mirrokni · 2024 arXiv:2501.00663 dodge by giving up exactness inside the chunk: evaluate all gradients at the start-of-chunk state and average them into one update. The chunk becomes a mini-batch SGD step with the inner state frozen during the chunk. What this costs is the fine-grained delta dynamics: two tokens in the same chunk that write to overlapping addresses get their writes side-by-side instead of the second one rewriting the first. Chunk size becomes a mini-batch knob for the inner optimizer: bigger means more parallelism but coarser updates.
KDA (Moonshot AI, 2025) Kimi Linear: An Expressive, Efficient Attention Architecture Moonshot AI · 2025 arXiv:2510.26692 goes back to a structured-but-linear state (DPLR) precisely so it can keep an exact chunkwise kernel. The price is its own custom kernel: the rank-1 piece’s left and right vectors differ ( vs ), which breaks DeltaNet’s exact WY form, but the algorithmic shape stays close.
4. MQAR
The previous parts were about associative memory: a fixed-size state holds key-value pairs, and the model reads them back at query time. Softmax attention does this perfectly by never compressing the KV cache. Fixed-state recurrences trade exact recall for per step. We want a benchmark that measures how good a model can do associative recall.
Multi-Query Associative Recall (MQAR), introduced by (Arora et al., 2023) Zoology: Measuring and Improving Recall in Efficient Language Models Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Ré · 2023 arXiv:2312.04927 (Zoology), does that. The model is shown a prefix of key-value pairs, then asked to recall the value for each key when it reappears later in the sequence. This requires storing many associations, and retrieving them across distance filled with distracting noise.
The sequence has two halves. First a contiguous block of key-value pairs , then a longer query region where each of the keys reappears once at a random position, padded with noise tokens. At each query position the model has to emit the associated value. A concrete example with , vocab , :
An MQAR example with N=2, vocab=16, T=16.
(k₁,v₁)=(2,8) and (k₂,v₂)=(4,7), then the query region with query keys and noise. Loss is taken only at the two query positions.
Some design details of MQAR:
- Disjoint vocab for keys and values. Keys are drawn from the first half of the dictionary , values from the second , with token 0 reserved as a noise placeholder. Role (key vs value) is readable off token identity, so the task isolates content matching from role inference.
- Queries scattered through noise, not interleaved with their answers. The naive layout lets the model learn the shortcut “copy whatever came right after the last query.” Scattering queries among noise removes that shortcut.
- The target at a query position is the value, but the next input token is noise. At query position ,
targets[P] = vandmask[P] = 1, buttokens[P+1]is random noise. The value is never fed back as input, only scored as a lookup target.
Here are the accuracies of naive baselines:
- — random pick from the full vocabulary (≈ at ).
- — random pick from the prefix’s value set (≈ at , ≈ at ). This is a usual failure mode where the model collapses to one output value.
- — actual content-based retrieval.
Weak models that do not have the capacity to solve MQAR will plateau at , having learned to emit a value-shaped token but not the right one.
5. Implementation details
Some implementation details were necessary to get the architectures to solve MQAR.
Short causal conv before , , . Before each recurrent block we pass the per-position , , vectors through a depthwise causal 1D conv with kernel size 4. This is actually necessary for linear attention and DeltaNet to solve MQAR. Zoology explains this as a context split: the recurrence gives us global context, but we lack the local context to answer “is the previous token a query key that we need to read out?”. Looking at ‘s conv we see it usually learns a lookback.
Gating parameterization. Following Mamba2 (Dao & Gu, 2024) Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Dao, Gu · ICML 2024 arXiv:2405.21060 for the scalar (or diagonal) gate :
Bounded in by construction (no NaN risk from sign flips inside a recurrent product), and starts at so the model defaults to “remember everything” and only learns to forget where useful.
L2-normalize after the feature map. The rank-1 transition has its only non-unit eigenvalue along , equal to . If varies freely across positions, the effective overwrite strength varies with it. L2-normalizing post-feature-map pins that eigenvalue to , putting it cleanly in and decoupling the learned from key norm.
To produce the results we used the parallel form of LA and the recurrent form for DeltaNet. For production kernels of everything in §2 and §3 — fused chunkwise, WY recurrence, conv, gating — see FLA (github.com/sustcsonglin/flash-linear-attention), which has all the architectures in this post implemented in Triton.
6. Results
We ran two minimal experiments. The principle is the smallest setting where the mechanism is visible enough to expose the architectural difference.
Capacity ceiling. Pure linear attention’s hard cap is (Part 1, §3). We pick and use : one setting below the ceiling, one at above, with vocab , , two layers, four heads.
Below the ceiling with both solve cleanly. At the ceiling, LA collapses to — the “random pick from the value set” baseline. DN stays at 0.77 with the same theoretical capacity, but because its delta rule uses the same slots more cleanly, its accuracy degrades more slowly.
Retention. Gating only buys something when there is something to decay. We hold fixed (well below capacity) and increase the sequence size with , comparing DN and Gated DN at vocab .
At both score the same: there is so little post-prefix noise that the gate never needs to activate. At the gap opens: Gated DN’s on the noise tokens shrinks the accumulating cross-talk, and the model holds onto its prefix writes a few points longer.
7. Open questions
Some questions that came up during my research on this:
-
How do you know what to memorize beforehand? Softmax has the luxury of never throwing away context. So we can always attend to the past if it contains useful information. Any finite-sized model will have to make decisions. Even harder: Every model here writes to memory based on the current pair. How does it know, at time , what’s going to be useful later in the sequence? The outer-loop training has access to the downstream loss and presumably teaches the projections , to pre-emphasize useful patterns, but a principled analysis would be interesting.
-
Exp kernel under float32 should be finite. Float32 has distinguishable values. At what truncation order of does the residual fall below float32 precision for typical ? If small suffices, “softmax” and “polynomial linear attention” are empirically the same model in float32, which would soften the “infinite kernel” framing of Part 1.
Update (2026-05-17). For a typical head dimension , matching the exp kernel to float32 precision requires polynomial features of degree . Thats an effective and un-practical state size of .
-
What about LSTM? The MIRAS axes were derived to organize the modern crop, but the LSTM cell with input/forget/output gates is also a fixed-state recurrence with content-dependent gating.
-
Distillation of softmax attention. Take a softmax-trained model, train a fixed-state inner model per layer to mimic that layer’s attention, then swap in. This would enable us to train a full-parallel softmax (cheap), and then serve with the recurrence (cheap-er, linear at inference). The open question is how well does the distillation work, and whether there is a principled way to decide which layers to leave as softmax.
References
- Katharopoulos, Vyas, Pappas, Fleuret (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020. arXiv:2006.16236
- Yang, Wang, Zhang, Shen, Kim (2024). Parallelizing Linear Transformers with the Delta Rule over Sequence Length. NeurIPS 2024. arXiv:2406.06484
- Dao, Gu (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024. arXiv:2405.21060
- Yang, Kautz, Hatamizadeh (2024). Gated Delta Networks: Improving Mamba2 with Delta Rule. arXiv:2412.06464
- Sun, Li, Geng, Hua, Wang, Zhao, Liu, Hardt, Chen, Pan, Lin, Wang, Han, Guestrin (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv:2407.04620
- Behrouz, Zhong, Mirrokni (2024). Titans: Learning to Memorize at Test Time. arXiv:2501.00663
- Moonshot AI (2025). Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv:2510.26692
- Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Ré (2023). Zoology: Measuring and Improving Recall in Efficient Language Models. arXiv:2312.04927
- Bischof, Van Loan (1987). The WY Representation for Products of Householder Matrices. SIAM J. Sci. Stat. Comput. 8.