Gated DeltaNet, From First Principles (Fable-5 vs Opus 4.8)

09 Jun, 2026

A tutorial that builds up everything you need — linear attention, state-space models, Mamba, the delta rule — and then assembles them into Gated DeltaNet (Yang, Kautz & Hatamizadeh, 2024; ICLR 2025).

Part 0 — The problem everyone is trying to solve

A standard Transformer's attention has two costs that hurt at long context:

Training cost scales quadratically: attending every token to every other token is $O (L^{2})$ in sequence length $L$ .
Inference cost grows linearly in memory: the KV cache stores keys and values for every past token, so generating token $t$ requires $O (t)$ memory and $O (t)$ compute per step.

The dream is a model that behaves like an RNN at inference time — a fixed-size state that gets updated once per token, giving $O (1)$ memory and compute per generated token — while still being trainable in parallel like a Transformer. Linear attention, state-space models (Mamba), and DeltaNet are all members of this family. Gated DeltaNet is, in a precise sense, the merger of Mamba2 and DeltaNet, and to understand it you need both parents.

The unifying mental model for this whole family:

The model maintains a matrix-valued memory $S_{t} \in ℝ^{d_{v} \times d_{k}}$ — think of it as a little key→value lookup table compressed into a single matrix. Each architecture differs only in how it writes to this matrix at each timestep.

Keep that sentence in mind; the rest of the tutorial is just filling in the write rules.

Part 1 — Linear attention: attention as a fast-weight RNN

1.1 From softmax attention to linear attention

Causal softmax attention computes, for query $q_{t}$ :

$o_{t} = \frac{\sum_{i \leq t} \exp (q_{t}^{⊤} k_{i}) v_{i}}{\sum_{i \leq t} \exp (q_{t}^{⊤} k_{i})}$

The $\exp (\cdot)$ is what forces us to keep every $k_{i}, v_{i}$ around. Linear attention (Katharopoulos et al., 2020) drops the exponential (or replaces it with a feature map $ϕ$ , which we'll fold into $k$ and $q$ for simplicity):

$o_{t} = \sum_{i \leq t} (q_{t}^{⊤} k_{i}) v_{i} = (\underset{S_{t}}{\underset{⏟}{\sum_{i \leq t} v_{i} k_{i}^{⊤}}}) q_{t}$

That regrouping is the whole trick. Define the state $S_{t} = \sum_{i \leq t} v_{i} k_{i}^{⊤}$ and you get a recurrence:

$S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}, o_{t} = S_{t} q_{t}$

This is an RNN whose hidden state is a $d_{v} \times d_{k}$ matrix, updated by adding a rank-1 outer product each step. At inference you only ever store $S_{t}$ — constant memory regardless of context length.

1.2 The associative-memory reading

$S_{t}$ is exactly a classical linear associative memory (a "fast weight" matrix, in Schmidhuber's 1990s terminology). Writing $v_{t} k_{t}^{⊤}$ stores the association "key $k_{t}$ → value $v_{t}$ ". Reading with $S_{t} q_{t}$ retrieves a weighted blend of stored values: if $q_{t} \approx k_{i}$ and keys are roughly orthonormal, $S_{t} q_{t} \approx v_{i}$ . This reading is what makes the later "delta rule" idea natural.

1.3 What's wrong with vanilla linear attention

The update $S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}$ is purely additive. Nothing is ever erased. Two failure modes follow:

Memory collision / interference. A $d_{v} \times d_{k}$ matrix can store at most about $d_{k}$ key–value pairs cleanly. Past that, retrievals blur together. With keys that repeat (which language is full of), old and new values for the same key get summed, not replaced.
No recency control. The model can't choose to discount stale information, so performance degrades on long sequences and on tasks requiring up-to-date tracking of changing facts.

The two repairs to this problem are exactly the two parents of Gated DeltaNet:

Repair	Idea	Architecture family
Gating (forget)	Multiply $S_{t - 1}$ by a decay before adding	Mamba / Mamba2, GLA, RetNet
Delta rule (replace)	Erase the old value at this key before writing the new one	DeltaNet

We take them one at a time. Gating is where the Mamba lineage comes in, so that's our prerequisite detour.

Part 2 — Mamba prerequisites: from state-space models to a scalar gate

You don't need all of Mamba to understand Gated DeltaNet, but you do need to see how the SSM lineage arrives at the update $S_{t} = α_{t} S_{t - 1} + v_{t} k_{t}^{⊤}$ , because Gated DeltaNet's $α_{t}$ gate is precisely Mamba2's. This section builds that bridge.

2.1 Continuous state-space models

A (linear, time-invariant) state-space model from control theory maps an input signal $u (t) \in ℝ$ to an output $y (t)$ through a hidden state $h (t) \in ℝ^{N}$ :

$h^{'} (t) = A h (t) + B u (t), y (t) = C h (t)$

$A \in ℝ^{N \times N}$ governs how the state evolves on its own (its eigenvalues set decay rates / memory timescales), $B$ controls how input enters, $C$ how the state is read out.

2.2 Discretization

To run this on token sequences we discretize with a step size $Δ$ . Using zero-order hold (ZOH):

$\bar{A} = \exp (Δ A), \bar{B} = (Δ A)^{- 1} (\exp (Δ A) - I) Δ B$

giving the discrete recurrence:

$h_{t} = \bar{A} h_{t - 1} + \bar{B} u_{t}, y_{t} = C h_{t}$

Two things to internalize:

$\bar{A} = \exp (Δ A)$ means the state is exponentially decayed each step. With well-chosen $A$ (e.g., negative real parts), $‖ \bar{A} ‖ < 1$ : this is a forgetting mechanism baked into the dynamics.
$Δ$ acts like a knob between "ignore this token" ( $Δ \to 0$ means $\bar{A} \to I$ , $\bar{B} \to 0$ : state coasts unchanged) and "reset on this token" ( $Δ$ large: state dominated by current input).

2.3 S4 → Mamba: making the SSM selective

S4 (Gu et al., 2021) used fixed $A, B, C, Δ$ per channel. Because everything is time-invariant, the recurrence equals a convolution and can be computed very fast — but the model treats every token identically. It cannot decide, based on content, "this token matters, remember it" or "this is filler, skip it."

Mamba (Mamba1, Gu & Dao 2023) made $B_{t}$ , $C_{t}$ , and crucially $Δ_{t}$ functions of the current input token ("selective SSM"). Now ${\bar{A}}_{t} = \exp (Δ_{t} A)$ varies per token: the model dynamically chooses how much of its state to keep versus overwrite, conditioned on what it's reading. This input-dependence breaks the convolution trick, so Mamba introduced a hardware-aware parallel scan to keep training fast.

If you map SSM language onto attention language, a tight analogy appears: $B_{t}$ plays the role of key $k_{t}$ , $C_{t}$ the role of query $q_{t}$ , $u_{t}$ the value, and ${\bar{A}}_{t}$ is a data-dependent forget gate on the state.

2.4 Mamba2: collapsing the gate to a scalar

Mamba1's $A$ is a diagonal matrix per channel — each state dimension decays at its own rate. Mamba2 (Dao & Gu, 2024, "Transformers are SSMs" / the SSD framework) restricts $A$ further to a scalar times identity: every dimension of the state shares one decay $α_{t} \in (0, 1)$ per timestep.

That sounds like a downgrade, but it buys an enormous structural win: with a scalar gate, the recurrence over the matrix state becomes

$S_{t} = α_{t} S_{t - 1} + v_{t} k_{t}^{⊤}, o_{t} = S_{t} q_{t}$

which is exactly gated linear attention with a scalar data-dependent decay. Because $α_{t}$ is a scalar, the whole sequence computation can be rewritten as attention with a multiplicative causal decay mask (the "state-space duality"), which maps onto matrix multiplications — the thing GPUs and their tensor cores are actually fast at. Mamba2 trains faster than Mamba1 at larger state sizes for this reason.

This update rule — scalar gate $α_{t}$ , additive rank-1 write — is the entire Mamba inheritance Gated DeltaNet needs. Interpretation: at each token the model can globally fade its memory (small $α_{t}$ ≈ "context switch, clear the board"; $α_{t} \approx 1$ ≈ "keep everything"), then add the new association on top.

2.5 What gating still can't do

Gating's weakness is that it is indiscriminate. $α_{t}$ scales the whole matrix. Suppose the model stored "the user's name → Alice" and later reads "actually, call me Bob." The right operation is surgical: erase the value stored at the key "user's name" and write "Bob," leaving everything else intact. A scalar gate can only fade all memories together to make room. Rapid erase-and-update of a single association requires nuking unrelated memories too. That surgical operation is the delta rule.

Part 3 — The delta rule: DeltaNet

3.1 Memory update as online learning

Reframe the memory problem as online regression: at each step we want the memory matrix $S$ to map keys to values, i.e., minimize $ℒ_{t} (S) = \frac{1}{2} ‖ S k_{t} - v_{t} ‖^{2}$ on the current pair. Take one gradient step with learning rate $β_{t}$ :

$S_{t} = S_{t - 1} - β_{t} \nabla_{S} ℒ_{t} (S_{t - 1}) = S_{t - 1} - β_{t} (S_{t - 1} k_{t} - v_{t}) k_{t}^{⊤}$

This is the classical delta rule (Widrow & Hoff, 1960), applied per token with a data-dependent learning rate $β_{t} \in (0, 1)$ predicted from the input. Rearranged into the form that will matter:

$S_{t} = S_{t - 1} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤}$

3.2 Reading the update: retrieve, erase, write

Decompose it (assume $‖ k_{t} ‖ = 1$ for intuition):

Retrieve the value currently stored at key $k_{t}$ : $v_{t}^{old} = S_{t - 1} k_{t}$ .
Erase a $β_{t}$ -fraction of it and write a $β_{t}$ -fraction of the new value: the net write is $β_{t} (v_{t} - v_{t}^{old}) k_{t}^{⊤}$ — an error-correcting update. If memory already predicts $v_{t}$ perfectly, nothing is written at all.

So where vanilla linear attention blindly accumulates and Mamba2 fades everything uniformly, DeltaNet performs a targeted replacement along the direction of the current key, leaving (approximately) orthogonal memories untouched. With $β_{t} = 1$ it's a full overwrite of that key's slot; with $β_{t} = 0$ the token is ignored.

This is why DeltaNet dominates gated architectures on associative recall benchmarks (MQAR, in-context retrieval): updating "key X now maps to value Y" is its native operation.

3.3 The geometry: a generalized Householder transform

The transition matrix applied to the old state is $I - β_{t} k_{t} k_{t}^{⊤}$ . For $‖ k_{t} ‖ = 1$ and $β_{t} \in (0, 2)$ this is a (generalized) Householder matrix: identity in all directions orthogonal to $k_{t}$ , and a contraction (or at $β_{t} = 2$ , reflection) along $k_{t}$ . Two consequences:

DeltaNet's state transition is not diagonal — it's identity-plus-rank-1. This is strictly more expressive than the diagonal/scalar transitions of Mamba-family models (it can express things like state swaps that diagonal transitions provably cannot), which is the formal reason DeltaNet-style models climb a rung on expressivity hierarchies.
Eigenvalues are $1$ (multiplicity $d_{k} - 1$ ) and $1 - β_{t}$ . All in $[- 1, 1]$ for $β_{t} \in [0, 2]$ , so the recurrence is stable.

The flip side of "identity in orthogonal directions": DeltaNet never forgets globally. A memory written at key $k$ persists until a sufficiently aligned key arrives to overwrite it. There is no "new document, flush the cache" operation. Empirically DeltaNet is great at recall but mediocre at tasks where gated models shine — language modeling perplexity, contexts with hard topic switches, length extrapolation — precisely the regimes where uniform decay helps.

So we have perfectly complementary failure modes:

	Mamba2 (gate $α_{t}$ )	DeltaNet (delta rule $β_{t}$ )
Global forgetting / context switch	✅	❌
Precise per-key overwrite (recall)	❌	✅
State transition structure	scalar × identity	identity − rank-1

The obvious move is to take both. That's Gated DeltaNet.

Part 4 — Gated DeltaNet: the gated delta rule

4.1 The update rule

Gated DeltaNet equips the delta rule with Mamba2's scalar forget gate. Per token, the network predicts (from the input) a gate $α_{t} \in (0, 1)$ , a learning rate $β_{t} \in (0, 1)$ , and the usual $q_{t}, k_{t}, v_{t}$ :

$S_{t} = S_{t - 1} α_{t} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤} o_{t} = S_{t} q_{t}$

Equivalently: first decay everything by $α_{t}$ , then perform the delta-rule erase-and-write on the decayed state. It's also exactly one step of online gradient descent on $‖ S k_{t} - v_{t} ‖^{2}$ with weight decay — $α_{t}$ is an adaptive L2 decay on the fast weights, $β_{t}$ the adaptive learning rate. (This "memory update = online optimizer" framing is the seed of the follow-up line of work: test-time training, Titans, DeltaProduct, etc.)

Special cases make the lineage explicit:

$β_{t} \to 0$ in the transition, keep the write term ⇒ Mamba2: $S_{t} = α_{t} S_{t - 1} + β_{t} v_{t} k_{t}^{⊤}$ .
$α_{t} = 1$ ⇒ DeltaNet.
$α_{t} = 1, β_{t}$ absorbed ⇒ vanilla linear attention.

4.2 Why this combination behaves well

The model now has two independent, content-controlled knobs each step:

$α_{t}$ (clear the board): seeing an end-of-document token or a topic shift, the model can drive $α_{t}$ small and rapidly flush stale state — something pure DeltaNet structurally can't do.
$β_{t}$ (edit one slot): seeing "X is now Y," the model can overwrite the association at key $k_{t}$ without disturbing the rest of memory — something pure Mamba2 structurally can't do.

Empirically (1.3B/3B-scale models in the paper), Gated DeltaNet beats both Mamba2 and DeltaNet on language modeling perplexity, zero-shot commonsense suites, in-context retrieval, length extrapolation, and long-context (LongBench) tasks. The recall-heavy tasks show the delta rule's contribution; the long-context and extrapolation results show the gate's.

4.3 The layer around the recurrence

The recurrence lives inside a fairly standard modern token-mixing block (mirroring Mamba2/GLA conventions):

$q_{t}, k_{t}$ from linear projections, passed through SiLU and L2-normalized (normalizing $k$ keeps the Householder transition well-conditioned and the eigenvalue story of §3.3 valid).
$v_{t}$ from a linear projection; multiple heads, each with its own $S_{t}$ .
$α_{t} = \exp (- Δ_{t} \cdot softplus (a)) \cdot σ (\cdot)$ -style parameterization inherited from Mamba2's discretization (i.e., the gate is produced the same way Mamba2 produces ${\bar{A}}_{t}$ ), $β_{t} = σ (w_{β}^{⊤} x_{t})$ .
A short depthwise causal convolution (kernel ~4) over $q, k, v$ paths — borrowed from Mamba; cheap and consistently helpful.
Output gating + RMSNorm on $o_{t}$ before the output projection (the "gated output" of GLA/Mamba2 blocks — note this output gate is a different thing from the state gate $α_{t}$ ).

4.4 Hybrids: Gated DeltaNet-H1 / H2

Because a fixed-size state can never match exact lookup over arbitrarily long contexts, the paper also builds hybrids that interleave Gated DeltaNet layers with sliding-window attention (H1) or with Mamba2 layers + a few SWA layers (H2). The local-attention layers handle precise short-range token interactions; the recurrent layers carry compressed long-range state. Hybrids improve quality further while keeping inference cost essentially linear. This recipe is what production models adopted — e.g., Qwen3-Next interleaves Gated DeltaNet layers with full-attention layers at a 3:1 ratio, and Kimi Linear builds on a Gated DeltaNet variant (KDA).

Part 5 — Why it's still fast: chunkwise parallel training

A recurrence that must run strictly token-by-token would waste GPUs during training. The standard solution across this family is chunkwise parallelism: split the sequence into chunks of length $C$ (e.g., 64); within a chunk, compute everything with dense matmuls (parallel, tensor-core friendly); across chunks, pass the state $S$ recurrently. Cost is $O (L C d^{2})$ -ish — linear in $L$ .

The complication unique to (Gated) DeltaNet is that its transition matrices $α_{t} (I - β_{t} k_{t} k_{t}^{⊤})$ don't commute and aren't diagonal, so naively composing them inside a chunk looks expensive. The fix (from the earlier "Parallelizing Linear Transformers with the Delta Rule" paper) is the classical WY representation of products of Householder matrices: a product $\prod_{i} (I - β_{i} k_{i} k_{i}^{⊤})$ can be written as $I - \sum_{i} w_{i} k_{i}^{⊤}$ — identity minus a low-rank correction — computable with triangular matmuls (the "UT transform") instead of sequential rank-1 updates. Gated DeltaNet extends this WY machinery to carry the cumulative gate products $\prod α_{i}$ through the same algebra, at negligible extra cost. The upshot:

Training throughput comparable to Mamba2, better than DeltaNet's original kernel.
Inference: constant memory, one small matmul-and-update per token.
Implementations live in the flash-linear-attention (FLA) library and NVIDIA's released kernels.

You don't need the kernel details to use or reason about the model — just know that "Householder structure ⇒ WY trick ⇒ matmul-rich chunkwise form" is what makes the delta rule practical at scale.

Part 6 — One-page summary

The whole tutorial compresses into a table of write rules for the matrix memory $S_{t} \in ℝ^{d_{v} \times d_{k}}$ , read out as $o_{t} = S_{t} q_{t}$ :

Model	Update rule	Forgetting	Per-key editing
Linear attention	$S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}$	none	none
RetNet	$S_{t} = γ S_{t - 1} + v_{t} k_{t}^{⊤}$	fixed decay	none
Mamba2 / gated LA	$S_{t} = α_{t} S_{t - 1} + v_{t} k_{t}^{⊤}$	learned, global	none
DeltaNet	$S_{t} = S_{t - 1} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤}$	none	yes
Gated DeltaNet	$S_{t} = S_{t - 1} α_{t} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤}$	learned, global	yes

Mental model in one sentence: Gated DeltaNet is an online learner whose "weights" are the memory matrix — $β_{t}$ is its per-token learning rate (delta rule), $α_{t}$ is its per-token weight decay (Mamba2's gate) — wrapped in a hardware-efficient chunkwise algorithm.

Self-check questions

Why can't vanilla linear attention ever "correct" a stored association, even in principle?
In Mamba2's update, what happens to retrieval quality when many keys are nearly parallel? Why does the delta rule help exactly there?
Show that with $‖ k_{t} ‖ = 1$ , the Gated DeltaNet transition matrix has eigenvalues ${α_{t} (\times d_{k} - 1), α_{t} (1 - β_{t})}$ . What does each eigenvalue "do" to memory?
Why does restricting Mamba2's $A$ to a scalar enable a matmul-based (attention-dual) formulation, and why doesn't Mamba1's diagonal $A$ enjoy the same trick as cleanly?
Why do hybrid layouts (Gated DeltaNet + sliding-window attention) outperform pure recurrent stacks on retrieval-heavy long-context tasks?

Opus 4.8 High

Understanding Gated DeltaNet (from the ground up)

This tutorial builds Gated DeltaNet layer by layer. Each section only assumes what came before it. The plan:

Linear attention — the shared foundation everything else modifies.
Mamba / Mamba2 — what "gating" means and the prerequisites you asked for.
DeltaNet — what the "delta rule" means.
Gated DeltaNet — fusing the two, which is literally what the paper does.
Chunkwise training — how it's made fast (the hard part, kept conceptual).
Why it works — the benchmark intuition.

Throughout, the state is a matrix $S \in ℝ^{d_{v} \times d_{k}}$ — think of it as an associative memory that maps keys to values. $d_{k}$ is the key/query head dimension, $d_{v}$ the value head dimension. Keep that picture in mind; it's the thread tying everything together.

1. The foundation: linear attention

1.1 Why standard attention is expensive

Softmax attention computes, for each query, a weighted sum over all keys/values:

$o_{t} = \sum_{i = 1}^{t} \frac{\exp (q_{t}^{⊤} k_{i})}{\sum_{j} \exp (q_{t}^{⊤} k_{j})} v_{i}$

The problem is the $\exp (q_{t}^{⊤} k_{i})$ term. Because it couples every query to every key through the softmax, you cannot precompute anything reusable — each of the $L$ queries does $O (L)$ work, giving $O (L^{2})$ time and (during training) an $L \times L$ attention matrix in memory.

1.2 The trick: drop the softmax

Suppose we remove the softmax (or replace $\exp$ with a simple feature map, then ignore normalization for clarity). Attention becomes:

$o_{t} = \sum_{i = 1}^{t} (q_{t}^{⊤} k_{i}) v_{i} = \sum_{i = 1}^{t} v_{i} (k_{i}^{⊤} q_{t})$

The right-hand rewrite is the whole game. Associativity of matrix multiplication lets us regroup: instead of computing scalar scores $k_{i}^{⊤} q_{t}$ first, pull the sum over $i$ inward:

$o_{t} = (\underset{S_{t}}{\underset{⏟}{\sum_{i = 1}^{t} v_{i} k_{i}^{⊤}}}) q_{t} = S_{t} q_{t}$

Now $S_{t} = \sum_{i \leq t} v_{i} k_{i}^{⊤}$ is a single matrix that summarizes the entire past. It doesn't grow with sequence length — it's always $d_{v} \times d_{k}$ .

1.3 The recurrent view (the mental model that matters)

Because $S_{t}$ is just $S_{t - 1}$ plus one more term, we get a recurrence:

$S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}, o_{t} = S_{t} q_{t}$

This is the form to internalize. At every timestep you:

Write the new association by adding an outer product $v_{t} k_{t}^{⊤}$ . Reading this back with a query $q$ that points at $k_{t}$ gives you roughly $v_{t}$ — that's why it's an associative memory.
Read by multiplying the accumulated state by the query.

This is now a linear RNN with a fixed-size matrix state, so inference is $O (1)$ memory per step and $O (L)$ total — no quadratic blowup.

1.4 The catch that motivates everything else

$S$ stores key→value associations as a superposition of outer products. The number of (roughly orthogonal) associations you can pack into a $d_{k}$ -dimensional key space before they start interfering is bounded by $d_{k}$ . Once the sequence is longer than that, new writes collide with old ones — memory collisions — and exact retrieval degrades. Vanilla linear attention also never forgets: every write accumulates forever, so stale information piles up.

Two independent fixes to this problem are exactly what Mamba2 and DeltaNet represent. Gated DeltaNet combines them.

2. Mamba prerequisites: gating / adaptive forgetting

You asked specifically for the Mamba background, so here is the minimum you need, and why it matters for Gated DeltaNet.

2.1 What Mamba (S6) is, in one paragraph

Mamba comes from the state space model (SSM) lineage, not the attention lineage — but they converge. A classic SSM evolves a hidden state with fixed, input-independent dynamics: $h_{t} = A h_{t - 1} + B x_{t}$ , $y_{t} = C h_{t}$ . Mamba's key move ("S6", selective SSM) made $B$ , $C$ , and the timestep $Δ$ functions of the input. That data-dependence is what lets the model decide, per token, whether to absorb or ignore information — the "selective" in selective SSM. This is what made Mamba competitive with Transformers at language modeling while keeping linear-time, constant-memory inference.

2.2 Mamba2 and "state space duality"

Mamba2's central theoretical contribution (SSD — State Space Duality) is the observation that a selective SSM with a scalar state-transition can be written exactly as a form of linear attention with decay. That is: the SSM recurrence and the attention-style parallel form are two views of the same computation. This is what lets us discuss Mamba2 in the same language as Section 1.

Concretely, Mamba2's recurrence is linear attention with one extra term:

$S_{t} = α_{t} S_{t - 1} + v_{t} k_{t}^{⊤}, o_{t} = S_{t} q_{t}$

The only change from Section 1.3 is the scalar $α_{t} \in (0, 1)$ multiplying the previous state. That's the gate (also called a decay or forget gate).

2.3 What the gate actually does

$α_{t}$ is data-dependent — produced from the current input — and it scales down the entire state before the new write. Unrolling the recurrence makes the effect clear: an association written at time $i$ has been multiplied by $\prod_{j = i + 1}^{t} α_{j}$ by the time you reach $t$ . Old information decays geometrically. Define the cumulative product $γ_{t} = \prod_{j = 1}^{t} α_{j}$ ; then

$o_{t} = \sum_{i = 1}^{t} \frac{γ_{t}}{γ_{i}} v_{i} (k_{i}^{⊤} q_{t}) .$

So gating gives the model adaptive forgetting: when $α_{t}$ is small, it rapidly wipes the memory (useful at a topic/context switch); when $α_{t} \to 1$ , it keeps everything (back to vanilla linear attention). This directly addresses the "stale information piles up forever" half of Section 1.4.

2.4 The limitation of gating (this is the whole reason Gated DeltaNet exists)

The decay is a single scalar applied uniformly to the whole state. If you want to forget one specific key→value association, you can't — shrinking $α_{t}$ fades everything equally. Gating is a blunt instrument: great at "clear the board," bad at "edit one cell." Hold that thought.

3. DeltaNet: precise, targeted updates

DeltaNet attacks the other half of the problem — the memory-collision / clumsy- write issue — using a classical idea.

3.1 The delta rule (Widrow–Hoff)

Vanilla linear attention writes by blindly adding $v_{t} k_{t}^{⊤}$ . The delta rule says: before writing, look at what the memory already returns for this key, and only write the correction.

Step through it. The value currently stored under key $k_{t}$ is what you'd read back: $v_{t}^{old} = S_{t - 1} k_{t}$ . We want to move it toward the target $v_{t}$ by a fraction $β_{t} \in (0, 1)$ (the writing strength), giving a new value $v_{t}^{new} = β_{t} v_{t} + (1 - β_{t}) v_{t}^{old}$ . We erase the old association and write the new one:

$ $S_{t} = S_{t - 1} - \underset{v_{t}^{old}}{\underset{⏟}{(S_{t - 1} k_{t})}} k_{t}^{⊤} + \underset{v_{t}^{new}}{\underset{⏟}{(β_{t} v_{t} + (1 - β_{t}) S_{t - 1} k_{t})}} k_{t}^{⊤}$ $

Collecting terms gives the clean form:

$S_{t} = S_{t - 1} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤}$

The matrix $(I - β_{t} k_{t} k_{t}^{⊤})$ is a generalized Householder transform: it selectively removes the component of the old memory that lies along $k_{t}$ , leaving everything orthogonal to $k_{t}$ untouched. Then $β_{t} v_{t} k_{t}^{⊤}$ writes the fresh association. This is the targeted edit that gating couldn't do.

3.2 Why this is "one step of gradient descent"

There's an illuminating second interpretation. Define a per-step loss measuring how well the memory recalls the right value for the current key:

$ $ℒ_{t} (S) = \frac{1}{2} ‖ S k_{t} - v_{t} ‖^{2}, \nabla_{S} ℒ_{t} = (S k_{t} - v_{t}) k_{t}^{⊤} .$ $

Take one gradient step from $S_{t - 1}$ with learning rate $β_{t}$ :

$ $S_{t} = S_{t - 1} - β_{t} (S_{t - 1} k_{t} - v_{t}) k_{t}^{⊤} = S_{t - 1} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤} .$ $

Identical to the boxed rule. So DeltaNet is doing online gradient descent on an associative-recall objective at every token, with $β_{t}$ as the learning rate. This "fast weights as online learning" lens is the unifying framework the Gated DeltaNet paper leans on, and it's worth carrying forward.

3.3 What DeltaNet buys, and its limitation

The delta rule corrects errors and overwrites specific stale associations instead of letting them collide — it's strong at in-context retrieval and associative recall. But notice what's missing: there is no global decay term. DeltaNet can edit individual entries beautifully, yet it has no fast way to clear the whole context when the document switches topics. It is the mirror image of Mamba2: precise but unable to do bulk erasure.

4. Gated DeltaNet: combining the two

Now the punchline is almost trivial, which is the point — the paper's insight is that these two mechanisms are complementary, so you just put both in.

4.1 The gated delta rule

Take the DeltaNet update and apply the scalar gate $α_{t}$ to the transition on the previous state:

$S_{t} = S_{t - 1} (α_{t} (I - β_{t} k_{t} k_{t}^{⊤})) + β_{t} v_{t} k_{t}^{⊤}$

Read off the two knobs and their extremes:

$α_{t}$ — the forget gate (from Mamba2). Multiplies the whole state. $α_{t} \to 0$ promptly clears memory (bulk erasure); $α_{t} \to 1$ disables decay.
$β_{t}$ — the writing strength (from DeltaNet). Controls the targeted delta edit. $β_{t} \to 0$ means "don't touch this key"; larger means "overwrite confidently."

The two corner cases recover the parents exactly:

Setting	Reduces to
$α_{t} \to 1$	pure DeltaNet (targeted edits, no forgetting)
$β_{t} \to 1$ , drop the Householder	pure Mamba2 gated rule
$α_{t} \to 0$	hard reset of memory

So the model gets both rapid bulk erasure and precise per-key editing, and it learns per-token when to use each.

4.2 The unifying online-learning view

Using the "recurrence = closed-form online update" framework, every model in this tutorial is one row of the same table — each differs only in its per-step objective. (Here $⟨ A, B ⟩$ denotes the Frobenius inner product.)

Model	Online objective	Resulting update
Linear attn	$‖ S_{t} - S_{t - 1} ‖_{F}^{2} - 2 ⟨ S_{t} k_{t}, v_{t} ⟩$	$S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}$
Mamba2	$‖ S_{t} - α_{t} S_{t - 1} ‖_{F}^{2} - 2 ⟨ S_{t} k_{t}, v_{t} ⟩$	$S_{t} = α_{t} S_{t - 1} + v_{t} k_{t}^{⊤}$
DeltaNet	$‖ S_{t} - S_{t - 1} ‖_{F}^{2} - 2 ⟨ S_{t} k_{t}, β_{t} (v_{t} - S_{t - 1} k_{t}) ⟩$	$S_{t} = S_{t - 1} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤}$
Gated DeltaNet	$‖ S_{t} - α_{t} S_{t - 1} ‖_{F}^{2} - 2 ⟨ S_{t} k_{t}, β_{t} (v_{t} - α_{t} S_{t - 1} k_{t}) ⟩$	$S_{t} = S_{t - 1} (α_{t} (I - β_{t} k_{t} k_{t}^{⊤})) + β_{t} v_{t} k_{t}^{⊤}$

The pattern: the first term is a retention regularizer ("stay close to the previous state"). Mamba2 and Gated DeltaNet relax it with $α_{t}$ , allowing controlled deviation from the past — that's forgetting expressed as loosening regularization. The second term is the recall objective; the delta variants fit the residual $v_{t} - α_{t} S_{t - 1} k_{t}$ rather than $v_{t}$ directly, which is what makes the write targeted.

4.3 The actual neural block (practical details)

The recurrence above is the core. The full layer wraps it with the usual GLA/Mamba2-style machinery:

Short causal 1-D convolution on $q, k, v$ before the recurrence (local mixing, consistently helps these models).
SiLU activations, and crucially L2-normalization of the keys $k_{t}$ — the delta rule's stability depends on well-scaled keys.
$α_{t}$ is parameterized in the Mamba2 style (from a learned scalar $A$ and a data-dependent $Δ_{t}$ , so $α_{t} = \exp (- Δ_{t} s o f t p l u s (A))$ -ish); $β_{t}$ is a simple linear projection through a sigmoid.
Multi-head structure (each head its own small $S$ ), then an output gate + normalization before the projection out.

The paper also builds hybrids: interleaving Gated DeltaNet layers with a few sliding-window-attention or Mamba2 layers, which improves both quality and training throughput.

5. Making it fast: chunkwise training (conceptual)

This is the genuinely hard engineering and you can treat it as optional on a first pass — but here's the shape of it.

The dilemma. The pure recurrence (Sections 1–4) is sequential — bad for GPUs, which want big matrix multiplies. The pure parallel/attention form is matmul-heavy but $O (L^{2})$ . Neither is ideal for training.

The fix — chunkwise parallel form. Split the sequence into chunks of size $C$ . Then:

Between chunks, carry the state $S$ forward recurrently (only $L / C$ steps).
Within a chunk, compute everything as dense matmuls in parallel.

This gives linear time and tensor-core-friendly matmuls. For plain linear attention / Mamba2 this is straightforward. The complication for the delta rule is that within a chunk you have a product of Householder matrices $\prod_{i} (I - β_{i} k_{i} k_{i}^{⊤})$ , which is not a simple sum.

WY representation + UT transform. A classical result (Bischof & Van Loan) lets you write that product of Householder matrices compactly as $I - W^{⊤} K$ , and the chunk's accumulated write as $U^{⊤} K$ , where $W$ and $U$ are obtained by inverting a small $C \times C$ triangular matrix (the "UT transform"). This turns the awkward cumulative product into a few clean matmuls per chunk. Gated DeltaNet's technical contribution is folding the $α_{t}$ gates into this machinery — by decaying each vector to a chunk boundary (the " $\vec{\cdot}$ / $\overset{\leftarrow}{\cdot}$ " notation in the paper) so the chunkwise algorithm still works with the extra decay terms. Net result: same hardware efficiency as Mamba2/DeltaNet, now with both mechanisms.

You don't need to re-derive this to use or understand the model — but knowing it exists explains why Gated DeltaNet trains at speeds comparable to Mamba2 despite the richer update.

6. Why it works: the benchmark intuition

The cleanest evidence is the S-NIAH ("needle in a haystack") suite, which isolates two skills: long-term retention of a needle, and filtering out distractors. The reported behavior at long context lines up exactly with the theory above:

DeltaNet — strong retention (it can hold a precise association), but weak when distractors accumulate, because it has no bulk-clearing mechanism.
Mamba2 — good at filtering (the gate decays junk), but retention collapses on longer sequences because the same decay also fades the needle.
Gated DeltaNet — balances both: the delta rule preserves the needle while the gate clears distractors, so it stays strong on the hardest, longest cases (e.g. retrieving UUID-valued needles, where Mamba2 falls apart).

That's the entire thesis in one experiment: gating and the delta rule fix different failure modes, so combining them dominates either alone — across language modeling, commonsense reasoning, in-context retrieval, and length extrapolation, not just the synthetic probe.

7. One-page summary

Linear attention: $S_{t} = S_{t - 1} + v_{t} k_{t}^{⊤}$ , $o_{t} = S_{t} q_{t}$ . A fixed-size associative memory; linear time; but never forgets and suffers memory collisions.
Mamba2 (gating): $S_{t} = α_{t} S_{t - 1} + v_{t} k_{t}^{⊤}$ . Scalar data-dependent decay $α_{t}$ → adaptive bulk forgetting. Can't target one memory.
DeltaNet (delta rule): $S_{t} = S_{t - 1} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤}$ . One step of online gradient descent on recall; precise targeted edits via writing strength $β_{t}$ . Can't do bulk forgetting.
Gated DeltaNet: $S_{t} = S_{t - 1} (α_{t} (I - β_{t} k_{t} k_{t}^{⊤})) + β_{t} v_{t} k_{t}^{⊤}$ . Both knobs at once — $α_{t}$ for rapid erasure, $β_{t}$ for surgical updates. Trained efficiently via a gated extension of the WY/UT chunkwise algorithm.

If you remember one thing: the state $S$ is a key→value memory, and the literature is a sequence of better answers to "how should I write to and forget from this memory each step?" — add (linear attn) → decay-then-add (Mamba2) → edit-in-place (DeltaNet) → decay-and-edit (Gated DeltaNet).

Where to go next

Paper: Gated Delta Networks: Improving Mamba2 with Delta Rule (Yang, Kautz, Hatamizadeh; ICLR 2025), arXiv:2412.06464.
Prerequisite reading if a section felt thin: the original DeltaNet parallelization paper (Yang et al., 2024, "Parallelizing Linear Transformers with the Delta Rule"), and Mamba2 / SSD (Dao & Gu, 2024).
Code: the NVlabs/GatedDeltaNet repo, and the flash-linear-attention (FLA) library, which has tuned kernels for all of these models side by side — reading its implementations is the fastest way to make the chunkwise math concrete.