Softmax attention is a local constant estimator
From weighted averages to local linear corrections and Parallax
TL;DR
Softmax attention can be read as a kernel regression estimator: given a query , it predicts the value at by taking a similarity-weighted average of all previous values. Someone from statistics would call this also a Nadaraya-Watson estimator, and it is a local constant estimator: around each query, we fit a constant value; meaning other query keys will have roughly the same value as . In this context “local” means not sequence locality, but locality in key/query representation space.
This lets us construct a failure mode. A weighted average works well when the nearby values are roughly flat around the query (assuming our query/keys are in 1D). But if the values have a local slope, and especially if the query is near the boundary of the key cloud, the weighted average lags behind the correct value. Local Linear Attention [quote] fixes this by fitting an affine function instead of a constant. Interestingly, instead of solving Local Linear Attention, Parallax [quote] adds a learned query-like projection on top of ordinary softmax attention to estimate the slope.
1. Starting from softmax
For one query token, we have previous key-value pairs
and a query vector . Softmax attention returns
with
Here is the bandwidth, or in transformer language the inverse attention temperature. In vanilla attention we usually write the scale as , so this corresponds to (Vaswani et al., 2017) Attention Is All You Need Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · NeurIPS 2017 arXiv:1706.03762 .
Small makes attention sharp. Large makes attention diffuse:
This answers the first small but important confusion: softmax is global over the causal context, but the weights are local in representation space. Every previous token is eligible, but keys with larger get exponentially larger weights. Inverse but similar how we treat outliers in linear regression with L2 loss.
For two keys and ,
If is larger by , then gets more weight than . That is the “nearby keys get larger weights” statement. Nearby means “nearby under the attention score”, not nearby in token position.
2. Softmax as weighted constant fitting
Now ignore the transformer vocabulary for a moment and pretend we are doing regression. The keys are input points, the values are labels, and is the test point where we want a prediction:
Suppose that around this query , the function from keys to values is approximately constant:
For this one query, fit by minimizing the weighted squared error
where
The derivative is
Set it to zero:
so
Define normalized weights
and we get
which is exactly softmax attention.
This is what people mean when they say softmax attention is the Nadaraya-Watson estimator (Wang, Shi, Fox, 2025) Test-time regression: a unifying framework for designing sequence models with associative memory Wang, Shi, Fox · 2025 arXiv:2501.12352 . It is kernel regression where the kernel is
The estimator is local because the weights depend on similarity to . It is constant because the local model we fit is just one value .
3. Where the lag comes from
The weighted average is not wrong by default. If the values are truly flat around , the local constant estimator is the right estimator. The problem appears when the values have a local trend.
Let’s use a 1D key for intuition. Assume values come from a smooth function:
Softmax/Nadaraya-Watson predicts
Now Taylor expand around :
Plug this into the weighted average:
Because ,
Let
Then the leading bias is
This is the lag.
If the weighted neighborhood is symmetric around , then and this first-order bias disappears. But if sits near the boundary of the key cloud, most of the mass can lie on one side. If is to the right of the weighted key center, then
If the function is increasing, then
So
The local average underpredicts. It lags behind the trend.
4. Local linear attention
The natural next move is simple: instead of fitting a local constant, fit a local affine function.
Softmax attention fits
Local Linear Attention fits
in 1D, or
in higher dimensions (Zuo et al., 2025) Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression Zuo, Yin, Zeng, Li, Zhu, Wang · 2025 arXiv:2510.01450 .
For the 1D case, minimize
At the query itself, , so the fitted value is
So the linear term does not appear directly in the final evaluation point. But it changes the fitted intercept , because and are fitted together. The slope absorbs the local trend, so the intercept no longer has to equal the raw weighted average.
The normal equations give the useful form
with
So the local linear prediction is
where
is the usual softmax output.
Look at the correction term. If the query is to the right of the weighted key center, , and values increase with keys, , then the correction is positive. It raises the softmax estimate and corrects the lag.
This is unrelated to “linear attention” in the finite-state sense from my earlier posts. Annoying naming collision. Local Linear Attention is not saying “sub-quadratic recurrent attention”. It is saying “local linear regression around each query”.
5. What exact LLA costs
In higher dimensions, the local linear correction has the shape
Here
is the ordinary softmax output,
is the local key-value covariance under the softmax weights, and is the direction that says how to probe that covariance.
Exact Local Linear Attention gets this probe by solving a linear system:
That is the beautiful version mathematically and the annoying version computationally. The Parallax paper lists the practical problems:
- A per-query solve is expensive and I/O-heavy.
- The regularization is a real tradeoff. Too large and the correction collapses back toward softmax; too small and the solve can become unstable.
- Iterative solvers are not what modern low-precision attention kernels want to spend their life doing.
So we have the correction we want, but the exact computation is awkward.
6. Parallax: learn the probe
Parallax (Zuo et al., 2026) Parallax: Parameterized Local Linear Attention for Language Modeling Zuo, Pai, Zeng, Dewulf, Hu, Wang · 2026 arXiv:2605.29157 keeps the softmax part and replaces the exact solve with a learned projection.
Normal attention has
Parallax adds
For a query position , write
Then Parallax uses
So the model no longer solves for . It learns a query-like probe during training.
Softmax gives the local weighted average. Parallax subtracts a learned covariance correction.
The paper also drops the boundary amplification factor from exact LLA. That factor is well behaved when comes from the exact constrained solve. Once is just , the same normalization can blow up or flip sign.
Parallax is not replacing softmax the way linear attention or DeltaNet replace softmax. It still uses softmax weights and still streams the full KV cache. Instead, it changes the estimator sitting on top of the same softmax neighborhood:
This matters because softmax can only de-emphasize a token by assigning it small positive probability. Parallax can subtract value directions. In the paper’s analysis, this changes attention patterns: the base softmax becomes more diffuse, the correction branch handles sharper discrimination, and the model relies less on the usual attention sink behavior.
7. Why more FLOPs can be faster
The other initially weird claim in the Parallax paper is that the model does more FLOPs but can match or beat FlashAttention decoding.
This is possible because decoding attention is often memory-bandwidth-bound. The slow part is not always the arithmetic; it is streaming the KV cache from HBM.
FlashAttention loads K/V tiles and computes
Parallax adds a second branch
But it reuses the same K/V tiles that were already loaded for the softmax branch. So the kernel performs more flops per byte moved from memory. We have higher aritmhetic intensity
but if the GPU was waiting on memory, doing more flops on already-loaded data can be close to free, or even faster if it improves utilization.
References
- Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
- Wang, Shi, Fox (2025). Test-time regression: a unifying framework for designing sequence models with associative memory. arXiv:2501.12352
- Zuo, Yin, Zeng, Li, Zhu, Wang (2025). Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression. arXiv:2510.01450
- Zuo, Pai, Zeng, Dewulf, Hu, Wang (2026). Parallax: Parameterized Local Linear Attention for Language Modeling. arXiv:2605.29157