Chunkwise linear attention and DeltaNet (Part 3) • Erik Steiger

1. Recap

In Part 1 we derived from softmax attention the recurrence relation $S_t = S_{t-1} A_t + b_t k_t^\top$ using a finite-dim kernel and state $S$ with capacity $n \lesssim d_k$ , and further motivated and derived the delta rule and gating. In Part 2 we expanded this concept into four complementary directions (memory architecture, objective, optimizer, retention) and showed how every finite-sized memory model fits into the picture. In this part we cover the engineering part of training them performantly on GPUs, and then benchmark them to test our theoretical picture empirically.

2. Chunkwise

Looking back at where we came from, we motivated linear attention for the regime $T \gg d$ . For example, $T$ reaches ~1M for the newest models (Claude Opus 4.7).

The recurrence

S_t = S_{t-1} + v_t k_t^\top, \qquad o_t = S_t q_t

is nice for inference as we pay just $O(d^2)$ per step instead of $O(T^2)$ over the sequence, but training is now the problem. There we have the full sequence but the recurrence still forces us to compute every $S_t$ in order. On modern accelerators built around big matmul operations this wastes a lot of compute and shows up as low MFU. For just one sequence this means thousands of sequential kernel launches, each one a tiny matmul where the launch overhead dominates the actual work. What we want is the opposite extreme: one big matmul over the whole sequence at once.

The full parallel form for linear attention is possible and can fully utilize the GPU when $T$ is manageable. Pretraining at 4–8K is fine, finetuning at 32K or 128K still works, but ideally we’d train on 256K–1M sequences to stay close to inference length. At that scale the score matrix doesn’t fit, and we need a way to chunk along $T$ .

The bigger problem: the full parallel form only exists for linear attention, not for DeltaNet.

Linear attention

Starting with linear attention. Let’s split our sequence of length $T$ into chunks of size $C$ , with $Q_i, K_i \in \mathbb{R}^{C \times d_k}$ and $V_i \in \mathbb{R}^{C \times d_v}$ the query/key/value matrices for chunk $i$ (covering positions $iC$ to $(i+1)C - 1$ ). The state $S_i \in \mathbb{R}^{d_v \times d_k}$ is the starting state of chunk $i$ , before any of chunk $i$ ‘s information is in $S_i$ .

Leaning on softmax attention and Part 1, we can compute the output inside a chunk in parallel form:

O_i^{\text{intra}} = \bigl((Q_i K_i^\top) \odot M\bigr)\, V_i.

But this only sees writes inside the current chunk. We add the readout from all previous chunks by reading from $S_i$ :

O_i^{\text{prev}} = Q_i S_i^\top.

What’s left is the next state, $S_{i+1}$ , which is just chunk $i$ ‘s contribution accumulated into $S_i$ :

S_{i+1} = S_i + V_i^\top K_i.

Final chunk output: $O_i = O_i^{\text{prev}} + O_i^{\text{intra}}$ . The whole algorithm is sequential over $T/C$ chunks (the $S$ hand-off), but the work inside each chunk is mostly matmuls, which makes it GPU-friendly.

Total cost: $T/C$ chunks with three matmuls per chunk of size $Cd^2 + C^2 d + Cd^2$ , giving $O(Td^2 + TCd)$ overall. This is more FLOPs than the recurrent form, but we amortize the wall-clock through big matmuls inside each chunk and stay viable at sequence lengths where the parallel form’s score matrix wouldn’t fit.

Same time and memory comparison with chunkwise C=64 and C=512 added; chunkwise stays viable where parallel OOMs. — Same comparison with chunkwise C=64 and C=512 added. Chunkwise stays viable where parallel OOMs at T≥131K, and is close to parallel speed at smaller T.

DeltaNet

Recall DeltaNet’s recurrence:

S_t = S_{t-1}(I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top.

Write $A_t \equiv I - \beta_t k_t k_t^\top$ for the rank-1 transition. Unrolling within chunk $i$ , the state at position $iC + t$ is

S_{iC+t} = S_{iC} \cdot A_{iC+1} A_{iC+2} \cdots A_{iC+t} + \sum_{s=1}^{t} \beta_{iC+s}\, v_{iC+s} k_{iC+s}^\top \cdot A_{iC+s+1} \cdots A_{iC+t}.

Two things change. The cross-chunk hand-off is no longer additive: the carry $S_{iC} = S_i$ has to pass through the product of $C$ rank-1 perturbations $A_{iC+1} \cdots A_{iC+C}$ . And each write at position $iC + s$ inside the chunk gets downscaled by the tail product of the $A$ ‘s after it.

If we materialize each tail product as a $d \times d$ matrix the chunkwise cost goes to $O(T C d^2)$ , a factor of $C$ worse than just running DeltaNet recurrently. Speedup gone. We need a way to represent the product of $C$ rank-1 perturbations in a smarter way.

3. WY representation

Each $A_t$ is identity minus rank-1, so a product of $C$ such factors is identity minus rank- $\le C$ :

A_1 A_2 \cdots A_C = I - W^\top Y, \qquad W, Y \in \mathbb{R}^{C \times d_k}.

This is the WY representation (Bischof & Van Loan, 1987) , originally for products of Householder reflectors in QR. The win: we never have to materialize the $C$ different tail products $A_{s+1} \cdots A_t$ explicitly, and applying $I - W^\top Y$ to anything is two matrix-vector products of cost $Cd_k$ without ever forming a $d_k \times d_k$ matrix.

The closed form for $W$ . We can derive $W$ and $Y$ inductively, peeling one $A_t$ at a time off a known WY representation of the prefix.

Base case ( $C = 1$ ). Match $A_1 = I - \beta_1 k_1 k_1^\top$ against $I - W^\top Y$ . With $W, Y \in \mathbb{R}^{1 \times d_k}$ , the natural choice is $w_1 = \beta_1 k_1$ and $y_1 = k_1$ .

Inductive step. Assume $A_1 \cdots A_{t-1} = I - W_{<t}^\top Y_{<t}$ , with $Y$ ‘s rows being the keys. Multiply by $A_t$ :

A_1 \cdots A_t = (I - W_{<t}^\top Y_{<t})(I - \beta_t k_t k_t^\top) = I - W_{<t}^\top Y_{<t} - \beta_t \bigl((A_1 \cdots A_{t-1}) k_t\bigr) k_t^\top.

This matches $I - W_{\le t}^\top Y_{\le t}$ exactly when we set $y_t = k_t$ and

w_t = \beta_t \cdot (A_1 \cdots A_{t-1})\, k_t.

” $w_t$ is $\beta_t k_t$ with all prior $A$ ‘s already applied.” So $Y$ ‘s rows are just the chunk’s keys, $Y = K$ , and $W$ is built from this closed form.

Unfolding $w_t$ into a recursion. We don’t want to actually compute $A_1 \cdots A_{t-1}$ as a full matrix. But it’s already in WY form by induction, so we can apply it cheaply:

(A_1 \cdots A_{t-1})\, k_t = (I - W_{<t}^\top Y_{<t})\, k_t = k_t - W_{<t}^\top (Y_{<t}\, k_t).

Each entry of the vector $Y_{<t} k_t$ is just $k_s^\top k_t$ (since $Y$ ‘s rows are the keys), so

W_{<t}^\top (Y_{<t}\, k_t) = \sum_{s<t} (k_s^\top k_t)\, w_s.

Plugging back:

w_t = \beta_t\Bigl(k_t - \sum_{s < t}(k_s^\top k_t)\, w_s\Bigr).

The $k_s^\top k_t$ entries form a $C \times C$ matrix, the chunk’s key-gram $K_r K_r^\top$ . Different matrix from the intra-chunk attention’s $Q_r K_r^\top$ , but the same shape and the same $O(C^2 d_k)$ cost to compute.

Stack the recursion as a triangular system: $T_r \in \mathbb{R}^{C \times C}$ strictly lower triangular with $(T_r)_{ts} = \beta_t (k_s^\top k_t)$ for $s < t$ . Then

(I + T_r)\, W_r = \mathrm{diag}(\beta)\, K_r.

Unit-triangular forward substitution on a $C \times C$ system, applied across the $d_k$ output columns. Cost $O(C^2 d_k)$ , the same order as the intra-chunk attention block. WY is essentially free.

This gives us the chunk’s prefix product in closed form: $A_1 \cdots A_C = I - W_r^\top K_r$ . Applying it to anything is two matrix products against $W_r$ and $K_r$ , never the $d_k \times d_k$ matrix itself.

The twin matrix $U_r$ . That handled the inter-chunk hand-off through $W$ . The chunk also contributes its own writes, $\sum_j \beta_j v_j k_j^\top A_{j+1} \cdots A_C$ , and each $j$ has a different tail product. We can use the same trick: define $U_r \in \mathbb{R}^{C \times d_v}$ with rows

u_t = \beta_t\Bigl(v_t - \sum_{s<t}(k_s^\top k_t)\, u_s\Bigr),

giving the chunk’s write contribution as $U_r^\top K_r$ . Same coefficient matrix as $W_r$ , different right-hand side:

(I + T_r)\, U_r = \mathrm{diag}(\beta)\, V_r.

Two triangular solves with shared $T_r$ . Each is forward substitution on a $C \times C$ system: row 1 of the output equals row 1 of the right-hand side; row $t$ subtracts off $\sum_{s<t} T_{ts} \cdot (\text{row } s)$ . Sequential in $t$ , parallel across the output columns ( $d_k$ for $W_r$ , $d_v$ for $U_r$ ). For typical $C \le 256$ this is a tiny system, handled by dedicated trsm kernels on GPU.

Final chunk update.

S^{(r+1)} = S^{(r)} - \bigl(S^{(r)} W_r^\top\bigr) K_r + U_r^\top K_r.

Three matmuls of $S$ - or $V$ -shaped objects against $K_r$ , plus the two triangular solves for $W_r$ and $U_r$ . Same asymptotic cost as linear-attention chunkwise; only a constant-factor overhead for the rank-1-perturbation transition.

The chunk’s outputs $O_r$ assemble in two pieces, same as linear attention. The intra-chunk (within-chunk) readout swaps $V_r$ for $U_r$ in §2’s formula:

O_r^{\text{intra}} = \bigl((Q_r K_r^\top) \odot M\bigr)\, U_r.

The inter-chunk readout uses WY in the same way, since each query needs the partial prefix product $A_1 \cdots A_t$ applied before reading from $S^{(r)}$ . Final: $O_r = O_r^{\text{inter}} + O_r^{\text{prev}}$ .

MLP states (TTT, Titans).

WY relies on each $A_t$ being a rank-1 perturbation of identity; products of such factors stay low-rank. For MLP-based memory, $\theta_t = \theta_{t-1} - \beta_t \nabla L(\theta_{t-1})$ has no $A_t$ at all, so there’s no rank-1 perturbation algebra to lean on.

TTT (Sun et al., 2024) and Titans (Behrouz et al., 2024) dodge by giving up exactness inside the chunk: evaluate all $C$ gradients at the start-of-chunk state $\theta^{(r)}$ and average them into one update. The chunk becomes a mini-batch SGD step with the inner state frozen during the chunk. What this costs is the fine-grained delta dynamics: two tokens in the same chunk that write to overlapping addresses get their writes side-by-side instead of the second one rewriting the first. Chunk size $C$ becomes a mini-batch knob for the inner optimizer: bigger $C$ means more parallelism but coarser updates.

KDA (Moonshot AI, 2025) goes back to a structured-but-linear state (DPLR) precisely so it can keep an exact chunkwise kernel. The price is its own custom kernel: the rank-1 piece’s left and right vectors differ ( $\alpha_t \odot k_t$ vs $k_t$ ), which breaks DeltaNet’s exact WY form, but the algorithmic shape stays close.

4. MQAR

The previous parts were about associative memory: a fixed-size state $S$ holds key-value pairs, and the model reads them back at query time. Softmax attention does this perfectly by never compressing the KV cache. Fixed-state recurrences trade exact recall for $O(d^2)$ per step. We want a benchmark that measures how good a model can do associative recall.

Multi-Query Associative Recall (MQAR), introduced by (Arora et al., 2023) (Zoology), does that. The model is shown a prefix of $N$ key-value pairs, then asked to recall the value for each key when it reappears later in the sequence. This requires storing many associations, and retrieving them across distance filled with distracting noise.

The sequence has two halves. First a contiguous block of key-value pairs $k_1 v_1 k_2 v_2 \ldots k_N v_N$ , then a longer query region where each of the $N$ keys reappears once at a random position, padded with noise tokens. At each query position the model has to emit the associated value. A concrete example with $N=2$ , vocab $=16$ , $T=16$ :

An MQAR example with N=2, vocab=16, T=16.

KV prefix

query region

role

k₁

v₁

k₂

v₂

q₂

q₁

pos

tokens

target

key value / target noise query

KV prefix with pairs (k₁,v₁)=(2,8) and (k₂,v₂)=(4,7), then the query region with query keys and noise. Loss is taken only at the two query positions.

Some design details of MQAR:

Disjoint vocab for keys and values. Keys are drawn from the first half of the dictionary $[1, V/2)$ , values from the second $[V/2, V)$ , with token 0 reserved as a noise placeholder. Role (key vs value) is readable off token identity, so the task isolates content matching from role inference.
Queries scattered through noise, not interleaved with their answers. The naive layout $[k_1 v_1 \ldots k_N v_N\, q_1 a_1 q_2 a_2 \ldots]$ lets the model learn the shortcut “copy whatever came right after the last query.” Scattering queries among noise removes that shortcut.
The target at a query position is the value, but the next input token is noise. At query position $P$ , targets[P] = v and mask[P] = 1, but tokens[P+1] is random noise. The value $v$ is never fed back as input, only scored as a lookup target.

Here are the accuracies of naive baselines:

$1/V$ — random pick from the full vocabulary (≈ $0.004$ at $V=256$ ).
$1/N_{KV}$ — random pick from the prefix’s value set (≈ $0.25$ at $N_{KV}=4$ , ≈ $0.03$ at $N_{KV}=32$ ). This is a usual failure mode where the model collapses to one output value.
$\approx 1$ — actual content-based retrieval.

Weak models that do not have the capacity to solve MQAR will plateau at $1/N_{KV}$ , having learned to emit a value-shaped token but not the right one.

5. Implementation details

Some implementation details were necessary to get the architectures to solve MQAR.

Short causal conv before $Q$ , $K$ , $V$ . Before each recurrent block we pass the per-position $q_t$ , $k_t$ , $v_t$ vectors through a depthwise causal 1D conv with kernel size 4. This is actually necessary for linear attention and DeltaNet to solve MQAR. Zoology explains this as a context split: the recurrence gives us global context, but we lack the local context to answer “is the previous token a query key that we need to read out?”. Looking at $K$ ‘s conv we see it usually learns a lookback.

Gating parameterization. Following Mamba2 (Dao & Gu, 2024) for the scalar (or diagonal) gate $\alpha_t$ :

\alpha_t = \exp\!\big(-\mathrm{softplus}(\Delta) \cdot \sigma(W_\alpha x_t + b_\alpha)\big), \qquad \Delta \approx -10 \text{ at init.}

Bounded in $(0, 1]$ by construction (no NaN risk from sign flips inside a recurrent product), and starts at $\alpha_t \approx 1$ so the model defaults to “remember everything” and only learns to forget where useful.

L2-normalize $k$ after the feature map. The rank-1 transition $A_t = I - \beta_t k_t k_t^\top$ has its only non-unit eigenvalue along $k_t$ , equal to $1 - \beta_t \|k_t\|^2$ . If $\|k_t\|$ varies freely across positions, the effective overwrite strength varies with it. L2-normalizing $k$ post-feature-map pins that eigenvalue to $1 - \beta_t$ , putting it cleanly in $[1-\beta_t, 1]$ and decoupling the learned $\beta_t$ from key norm.

To produce the results we used the parallel form of LA and the recurrent form for DeltaNet. For production kernels of everything in §2 and §3 — fused chunkwise, WY recurrence, conv, gating — see FLA (github.com/sustcsonglin/flash-linear-attention), which has all the architectures in this post implemented in Triton.

6. Results

We ran two minimal experiments. The principle is the smallest setting where the mechanism is visible enough to expose the architectural difference.

Capacity ceiling. Pure linear attention’s hard cap is $n \lesssim d_k$ (Part 1, §3). We pick $d_k = 16$ and use $N_{KV} \in \{4, 32\}$ : one setting below the ceiling, one at $2\times$ above, with vocab $=256$ , $T = 128$ , two layers, four heads.

MQAR accuracy for Linear Attention vs DeltaNet at N_KV in {4, 32} with d_k=16. LA collapses to ~1/N_KV at N_KV=32; DN holds at 0.77. — MQAR capacity at d_k=16. Below the ceiling both solve; at 2× the ceiling LA collapses toward the 1/N_KV baseline, DN degrades gracefully.

Below the ceiling with $N_{KV}=4$ both solve cleanly. At $2\times$ the ceiling, LA collapses to $\approx 1/N_{KV}$ — the “random pick from the value set” baseline. DN stays at 0.77 with the same theoretical capacity, but because its delta rule uses the same $d_k$ slots more cleanly, its accuracy degrades more slowly.

Retention. Gating only buys something when there is something to decay. We hold $N_{KV} = 4$ fixed (well below capacity) and increase the sequence size with $T \in \{64, 512\}$ , comparing DN and Gated DN at vocab $= 1024$ .

MQAR accuracy for DeltaNet vs Gated DeltaNet at fixed N_KV=4 and T in {64, 512}. Tied at T=64; Gated DN opens a small gap at T=512. — Retention sweep at fixed N_KV=4. Tied at T=64 (no noise to decay); Gated DN opens a small gap at T=512.

At $T=64$ both score the same: there is so little post-prefix noise that the gate never needs to activate. At $T=512$ the gap opens: Gated DN’s $\alpha_t < 1$ on the noise tokens shrinks the accumulating cross-talk, and the model holds onto its prefix writes a few points longer.

7. Open questions

Some questions that came up during my research on this:

How do you know what to memorize beforehand? Softmax has the luxury of never throwing away context. So we can always attend to the past if it contains useful information. Any finite-sized model will have to make decisions. Even harder: Every model here writes to memory based on the current $(k_t, v_t)$ pair. How does it know, at time $t$ , what’s going to be useful later in the sequence? The outer-loop training has access to the downstream loss and presumably teaches the projections $W_k$ , $W_v$ to pre-emphasize useful patterns, but a principled analysis would be interesting.
Exp kernel under float32 should be finite. Float32 has $\sim 2^{23}$ distinguishable values. At what truncation order $N$ of $\exp(q^\top k) \approx \sum_{n=0}^{N} (q^\top k)^n / n!$ does the residual fall below float32 precision for typical $\|q^\top k\|$ ? If small $N$ suffices, “softmax” and “polynomial linear attention” are empirically the same model in float32, which would soften the “infinite kernel” framing of Part 1.

Update (2026-05-17). For a typical head dimension $d_k = 128$ , matching the exp kernel to float32 precision requires polynomial features of degree $\approx 5$ . Thats an effective and un-practical state size of $\sim 3 \times 10^8$ .
What about LSTM? The MIRAS axes were derived to organize the modern crop, but the LSTM cell with input/forget/output gates is also a fixed-state recurrence with content-dependent gating.
Distillation of softmax attention. Take a softmax-trained model, train a fixed-state inner model per layer to mimic that layer’s attention, then swap in. This would enable us to train a full-parallel softmax (cheap), and then serve with the recurrence (cheap-er, linear at inference). The open question is how well does the distillation work, and whether there is a principled way to decide which layers to leave as softmax.

References

Katharopoulos, Vyas, Pappas, Fleuret (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020. arXiv:2006.16236
Yang, Wang, Zhang, Shen, Kim (2024). Parallelizing Linear Transformers with the Delta Rule over Sequence Length. NeurIPS 2024. arXiv:2406.06484
Dao, Gu (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024. arXiv:2405.21060
Yang, Kautz, Hatamizadeh (2024). Gated Delta Networks: Improving Mamba2 with Delta Rule. arXiv:2412.06464
Sun, Li, Geng, Hua, Wang, Zhao, Liu, Hardt, Chen, Pan, Lin, Wang, Han, Guestrin (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv:2407.04620
Behrouz, Zhong, Mirrokni (2024). Titans: Learning to Memorize at Test Time. arXiv:2501.00663
Moonshot AI (2025). Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv:2510.26692
Arora, Eyuboglu, Timalsina, Johnson, Poli, Zou, Rudra, Ré (2023). Zoology: Measuring and Improving Recall in Efficient Language Models. arXiv:2312.04927
Bischof, Van Loan (1987). The WY Representation for Products of Householder Matrices. SIAM J. Sci. Stat. Comput. 8.