sankalp's blog

Gated DeltaNet, From First Principles (Fable-5 vs Opus 4.8)

A tutorial that builds up everything you need — linear attention, state-space models, Mamba, the delta rule — and then assembles them into Gated DeltaNet (Yang, Kautz & Hatamizadeh, 2024; ICLR 2025).


Part 0 — The problem everyone is trying to solve

A standard Transformer's attention has two costs that hurt at long context:

  1. Training cost scales quadratically: attending every token to every other token is O(L2) in sequence length L.
  2. Inference cost grows linearly in memory: the KV cache stores keys and values for every past token, so generating token t requires O(t) memory and O(t) compute per step.

The dream is a model that behaves like an RNN at inference time — a fixed-size state that gets updated once per token, giving O(1) memory and compute per generated token — while still being trainable in parallel like a Transformer. Linear attention, state-space models (Mamba), and DeltaNet are all members of this family. Gated DeltaNet is, in a precise sense, the merger of Mamba2 and DeltaNet, and to understand it you need both parents.

The unifying mental model for this whole family:

The model maintains a matrix-valued memory Stdv×dk — think of it as a little key→value lookup table compressed into a single matrix. Each architecture differs only in how it writes to this matrix at each timestep.

Keep that sentence in mind; the rest of the tutorial is just filling in the write rules.


Part 1 — Linear attention: attention as a fast-weight RNN

1.1 From softmax attention to linear attention

Causal softmax attention computes, for query qt:

ot=itexp(qtki)viitexp(qtki)

The exp(·) is what forces us to keep every ki,vi around. Linear attention (Katharopoulos et al., 2020) drops the exponential (or replaces it with a feature map ϕ, which we'll fold into k and q for simplicity):

ot=it(qtki)vi=(itvikiSt)qt

That regrouping is the whole trick. Define the state St=itviki and you get a recurrence:

St=St1+vtkt,ot=Stqt

This is an RNN whose hidden state is a dv×dk matrix, updated by adding a rank-1 outer product each step. At inference you only ever store St — constant memory regardless of context length.

1.2 The associative-memory reading

St is exactly a classical linear associative memory (a "fast weight" matrix, in Schmidhuber's 1990s terminology). Writing vtkt stores the association "key kt → value vt". Reading with Stqt retrieves a weighted blend of stored values: if qtki and keys are roughly orthonormal, Stqtvi. This reading is what makes the later "delta rule" idea natural.

1.3 What's wrong with vanilla linear attention

The update St=St1+vtkt is purely additive. Nothing is ever erased. Two failure modes follow:

The two repairs to this problem are exactly the two parents of Gated DeltaNet:

Repair Idea Architecture family
Gating (forget) Multiply St1 by a decay before adding Mamba / Mamba2, GLA, RetNet
Delta rule (replace) Erase the old value at this key before writing the new one DeltaNet

We take them one at a time. Gating is where the Mamba lineage comes in, so that's our prerequisite detour.


Part 2 — Mamba prerequisites: from state-space models to a scalar gate

You don't need all of Mamba to understand Gated DeltaNet, but you do need to see how the SSM lineage arrives at the update St=αtSt1+vtkt, because Gated DeltaNet's αt gate is precisely Mamba2's. This section builds that bridge.

2.1 Continuous state-space models

A (linear, time-invariant) state-space model from control theory maps an input signal u(t) to an output y(t) through a hidden state h(t)N:

h(t)=Ah(t)+Bu(t),y(t)=Ch(t)

AN×N governs how the state evolves on its own (its eigenvalues set decay rates / memory timescales), B controls how input enters, C how the state is read out.

2.2 Discretization

To run this on token sequences we discretize with a step size Δ. Using zero-order hold (ZOH):

A¯=exp(ΔA),B¯=(ΔA)1(exp(ΔA)I)ΔB

giving the discrete recurrence:

ht=A¯ht1+B¯ut,yt=Cht

Two things to internalize:

  1. A¯=exp(ΔA) means the state is exponentially decayed each step. With well-chosen A (e.g., negative real parts), A¯<1: this is a forgetting mechanism baked into the dynamics.
  2. Δ acts like a knob between "ignore this token" (Δ0 means A¯I, B¯0: state coasts unchanged) and "reset on this token" (Δ large: state dominated by current input).

2.3 S4 → Mamba: making the SSM selective

S4 (Gu et al., 2021) used fixed A,B,C,Δ per channel. Because everything is time-invariant, the recurrence equals a convolution and can be computed very fast — but the model treats every token identically. It cannot decide, based on content, "this token matters, remember it" or "this is filler, skip it."

Mamba (Mamba1, Gu & Dao 2023) made Bt, Ct, and crucially Δt functions of the current input token ("selective SSM"). Now A¯t=exp(ΔtA) varies per token: the model dynamically chooses how much of its state to keep versus overwrite, conditioned on what it's reading. This input-dependence breaks the convolution trick, so Mamba introduced a hardware-aware parallel scan to keep training fast.

If you map SSM language onto attention language, a tight analogy appears: Bt plays the role of key kt, Ct the role of query qt, ut the value, and A¯t is a data-dependent forget gate on the state.

2.4 Mamba2: collapsing the gate to a scalar

Mamba1's A is a diagonal matrix per channel — each state dimension decays at its own rate. Mamba2 (Dao & Gu, 2024, "Transformers are SSMs" / the SSD framework) restricts A further to a scalar times identity: every dimension of the state shares one decay αt(0,1) per timestep.

That sounds like a downgrade, but it buys an enormous structural win: with a scalar gate, the recurrence over the matrix state becomes

St=αtSt1+vtkt,ot=Stqt

which is exactly gated linear attention with a scalar data-dependent decay. Because αt is a scalar, the whole sequence computation can be rewritten as attention with a multiplicative causal decay mask (the "state-space duality"), which maps onto matrix multiplications — the thing GPUs and their tensor cores are actually fast at. Mamba2 trains faster than Mamba1 at larger state sizes for this reason.

This update rule — scalar gate αt, additive rank-1 write — is the entire Mamba inheritance Gated DeltaNet needs. Interpretation: at each token the model can globally fade its memory (small αt ≈ "context switch, clear the board"; αt1 ≈ "keep everything"), then add the new association on top.

2.5 What gating still can't do

Gating's weakness is that it is indiscriminate. αt scales the whole matrix. Suppose the model stored "the user's name → Alice" and later reads "actually, call me Bob." The right operation is surgical: erase the value stored at the key "user's name" and write "Bob," leaving everything else intact. A scalar gate can only fade all memories together to make room. Rapid erase-and-update of a single association requires nuking unrelated memories too. That surgical operation is the delta rule.


Part 3 — The delta rule: DeltaNet

3.1 Memory update as online learning

Reframe the memory problem as online regression: at each step we want the memory matrix S to map keys to values, i.e., minimize t(S)=12Sktvt2 on the current pair. Take one gradient step with learning rate βt:

St=St1βtSt(St1)=St1βt(St1ktvt)kt

This is the classical delta rule (Widrow & Hoff, 1960), applied per token with a data-dependent learning rate βt(0,1) predicted from the input. Rearranged into the form that will matter:

St=St1(Iβtktkt)+βtvtkt

3.2 Reading the update: retrieve, erase, write

Decompose it (assume kt=1 for intuition):

  1. Retrieve the value currently stored at key kt: vtold=St1kt.
  2. Erase a βt-fraction of it and write a βt-fraction of the new value: the net write is βt(vtvtold)kt — an error-correcting update. If memory already predicts vt perfectly, nothing is written at all.

So where vanilla linear attention blindly accumulates and Mamba2 fades everything uniformly, DeltaNet performs a targeted replacement along the direction of the current key, leaving (approximately) orthogonal memories untouched. With βt=1 it's a full overwrite of that key's slot; with βt=0 the token is ignored.

This is why DeltaNet dominates gated architectures on associative recall benchmarks (MQAR, in-context retrieval): updating "key X now maps to value Y" is its native operation.

3.3 The geometry: a generalized Householder transform

The transition matrix applied to the old state is Iβtktkt. For kt=1 and βt(0,2) this is a (generalized) Householder matrix: identity in all directions orthogonal to kt, and a contraction (or at βt=2, reflection) along kt. Two consequences:

3.4 DeltaNet's blind spot

The flip side of "identity in orthogonal directions": DeltaNet never forgets globally. A memory written at key k persists until a sufficiently aligned key arrives to overwrite it. There is no "new document, flush the cache" operation. Empirically DeltaNet is great at recall but mediocre at tasks where gated models shine — language modeling perplexity, contexts with hard topic switches, length extrapolation — precisely the regimes where uniform decay helps.

So we have perfectly complementary failure modes:

Mamba2 (gate αt) DeltaNet (delta rule βt)
Global forgetting / context switch
Precise per-key overwrite (recall)
State transition structure scalar × identity identity − rank-1

The obvious move is to take both. That's Gated DeltaNet.


Part 4 — Gated DeltaNet: the gated delta rule

4.1 The update rule

Gated DeltaNet equips the delta rule with Mamba2's scalar forget gate. Per token, the network predicts (from the input) a gate αt(0,1), a learning rate βt(0,1), and the usual qt,kt,vt:

St=St1αt(Iβtktkt)+βtvtktot=Stqt

Equivalently: first decay everything by αt, then perform the delta-rule erase-and-write on the decayed state. It's also exactly one step of online gradient descent on Sktvt2 with weight decayαt is an adaptive L2 decay on the fast weights, βt the adaptive learning rate. (This "memory update = online optimizer" framing is the seed of the follow-up line of work: test-time training, Titans, DeltaProduct, etc.)

Special cases make the lineage explicit:

4.2 Why this combination behaves well

The model now has two independent, content-controlled knobs each step:

Empirically (1.3B/3B-scale models in the paper), Gated DeltaNet beats both Mamba2 and DeltaNet on language modeling perplexity, zero-shot commonsense suites, in-context retrieval, length extrapolation, and long-context (LongBench) tasks. The recall-heavy tasks show the delta rule's contribution; the long-context and extrapolation results show the gate's.

4.3 The layer around the recurrence

The recurrence lives inside a fairly standard modern token-mixing block (mirroring Mamba2/GLA conventions):

4.4 Hybrids: Gated DeltaNet-H1 / H2

Because a fixed-size state can never match exact lookup over arbitrarily long contexts, the paper also builds hybrids that interleave Gated DeltaNet layers with sliding-window attention (H1) or with Mamba2 layers + a few SWA layers (H2). The local-attention layers handle precise short-range token interactions; the recurrent layers carry compressed long-range state. Hybrids improve quality further while keeping inference cost essentially linear. This recipe is what production models adopted — e.g., Qwen3-Next interleaves Gated DeltaNet layers with full-attention layers at a 3:1 ratio, and Kimi Linear builds on a Gated DeltaNet variant (KDA).


Part 5 — Why it's still fast: chunkwise parallel training

A recurrence that must run strictly token-by-token would waste GPUs during training. The standard solution across this family is chunkwise parallelism: split the sequence into chunks of length C (e.g., 64); within a chunk, compute everything with dense matmuls (parallel, tensor-core friendly); across chunks, pass the state S recurrently. Cost is O(LCd2)-ish — linear in L.

The complication unique to (Gated) DeltaNet is that its transition matrices αt(Iβtktkt) don't commute and aren't diagonal, so naively composing them inside a chunk looks expensive. The fix (from the earlier "Parallelizing Linear Transformers with the Delta Rule" paper) is the classical WY representation of products of Householder matrices: a product i(Iβikiki) can be written as Iiwiki — identity minus a low-rank correction — computable with triangular matmuls (the "UT transform") instead of sequential rank-1 updates. Gated DeltaNet extends this WY machinery to carry the cumulative gate products αi through the same algebra, at negligible extra cost. The upshot:

You don't need the kernel details to use or reason about the model — just know that "Householder structure ⇒ WY trick ⇒ matmul-rich chunkwise form" is what makes the delta rule practical at scale.


Part 6 — One-page summary

The whole tutorial compresses into a table of write rules for the matrix memory Stdv×dk, read out as ot=Stqt:

Model Update rule Forgetting Per-key editing
Linear attention St=St1+vtkt none none
RetNet St=γSt1+vtkt fixed decay none
Mamba2 / gated LA St=αtSt1+vtkt learned, global none
DeltaNet St=St1(Iβtktkt)+βtvtkt none yes
Gated DeltaNet St=St1αt(Iβtktkt)+βtvtkt learned, global yes

Mental model in one sentence: Gated DeltaNet is an online learner whose "weights" are the memory matrix — βt is its per-token learning rate (delta rule), αt is its per-token weight decay (Mamba2's gate) — wrapped in a hardware-efficient chunkwise algorithm.

Suggested reading order

  1. Katharopoulos et al., 2020 — Transformers are RNNs (linear attention).
  2. Gu & Dao, 2023 — Mamba (selective SSMs); skim §2–3 for discretization and selectivity.
  3. Dao & Gu, 2024 — Transformers are SSMs (Mamba2 / SSD; the scalar gate and duality).
  4. Yang et al., 2024 — Parallelizing Linear Transformers with the Delta Rule over Sequence Length (DeltaNet + WY/UT chunkwise algorithm).
  5. Yang, Kautz & Hatamizadeh, 2024 — Gated Delta Networks: Improving Mamba2 with Delta Rule (the paper this tutorial targets).
  6. Optional onward: Titans / test-time-training memory, DeltaProduct (multiple Householders per token), Kimi Linear's KDA — all generalizations of the "memory update as online optimization" view you now have.

Self-check questions

  1. Why can't vanilla linear attention ever "correct" a stored association, even in principle?
  2. In Mamba2's update, what happens to retrieval quality when many keys are nearly parallel? Why does the delta rule help exactly there?
  3. Show that with kt=1, the Gated DeltaNet transition matrix has eigenvalues {αt (×dk1), αt(1βt)}. What does each eigenvalue "do" to memory?
  4. Why does restricting Mamba2's A to a scalar enable a matmul-based (attention-dual) formulation, and why doesn't Mamba1's diagonal A enjoy the same trick as cleanly?
  5. Why do hybrid layouts (Gated DeltaNet + sliding-window attention) outperform pure recurrent stacks on retrieval-heavy long-context tasks?

Opus 4.8 High

Understanding Gated DeltaNet (from the ground up)

This tutorial builds Gated DeltaNet layer by layer. Each section only assumes what came before it. The plan:

  1. Linear attention — the shared foundation everything else modifies.
  2. Mamba / Mamba2 — what "gating" means and the prerequisites you asked for.
  3. DeltaNet — what the "delta rule" means.
  4. Gated DeltaNet — fusing the two, which is literally what the paper does.
  5. Chunkwise training — how it's made fast (the hard part, kept conceptual).
  6. Why it works — the benchmark intuition.

Throughout, the state is a matrix Sdv×dk — think of it as an associative memory that maps keys to values. dk is the key/query head dimension, dv the value head dimension. Keep that picture in mind; it's the thread tying everything together.


1. The foundation: linear attention

1.1 Why standard attention is expensive

Softmax attention computes, for each query, a weighted sum over all keys/values:

ot=i=1texp(qtki)jexp(qtkj)vi

The problem is the exp(qtki) term. Because it couples every query to every key through the softmax, you cannot precompute anything reusable — each of the L queries does O(L) work, giving O(L2) time and (during training) an L×L attention matrix in memory.

1.2 The trick: drop the softmax

Suppose we remove the softmax (or replace exp with a simple feature map, then ignore normalization for clarity). Attention becomes:

ot=i=1t(qtki)vi=i=1tvi(kiqt)

The right-hand rewrite is the whole game. Associativity of matrix multiplication lets us regroup: instead of computing scalar scores kiqt first, pull the sum over i inward:

ot=(i=1tvikiSt)qt=Stqt

Now St=itviki is a single matrix that summarizes the entire past. It doesn't grow with sequence length — it's always dv×dk.

1.3 The recurrent view (the mental model that matters)

Because St is just St1 plus one more term, we get a recurrence:

St=St1+vtkt,ot=Stqt

This is the form to internalize. At every timestep you:

This is now a linear RNN with a fixed-size matrix state, so inference is O(1) memory per step and O(L) total — no quadratic blowup.

1.4 The catch that motivates everything else

S stores key→value associations as a superposition of outer products. The number of (roughly orthogonal) associations you can pack into a dk-dimensional key space before they start interfering is bounded by dk. Once the sequence is longer than that, new writes collide with old ones — memory collisions — and exact retrieval degrades. Vanilla linear attention also never forgets: every write accumulates forever, so stale information piles up.

Two independent fixes to this problem are exactly what Mamba2 and DeltaNet represent. Gated DeltaNet combines them.


2. Mamba prerequisites: gating / adaptive forgetting

You asked specifically for the Mamba background, so here is the minimum you need, and why it matters for Gated DeltaNet.

2.1 What Mamba (S6) is, in one paragraph

Mamba comes from the state space model (SSM) lineage, not the attention lineage — but they converge. A classic SSM evolves a hidden state with fixed, input-independent dynamics: ht=Aht1+Bxt, yt=Cht. Mamba's key move ("S6", selective SSM) made B, C, and the timestep Δ functions of the input. That data-dependence is what lets the model decide, per token, whether to absorb or ignore information — the "selective" in selective SSM. This is what made Mamba competitive with Transformers at language modeling while keeping linear-time, constant-memory inference.

2.2 Mamba2 and "state space duality"

Mamba2's central theoretical contribution (SSD — State Space Duality) is the observation that a selective SSM with a scalar state-transition can be written exactly as a form of linear attention with decay. That is: the SSM recurrence and the attention-style parallel form are two views of the same computation. This is what lets us discuss Mamba2 in the same language as Section 1.

Concretely, Mamba2's recurrence is linear attention with one extra term:

St=αtSt1+vtkt,ot=Stqt

The only change from Section 1.3 is the scalar αt(0,1) multiplying the previous state. That's the gate (also called a decay or forget gate).

2.3 What the gate actually does

αt is data-dependent — produced from the current input — and it scales down the entire state before the new write. Unrolling the recurrence makes the effect clear: an association written at time i has been multiplied by j=i+1tαj by the time you reach t. Old information decays geometrically. Define the cumulative product γt=j=1tαj; then

ot=i=1tγtγivi(kiqt).

So gating gives the model adaptive forgetting: when αt is small, it rapidly wipes the memory (useful at a topic/context switch); when αt1, it keeps everything (back to vanilla linear attention). This directly addresses the "stale information piles up forever" half of Section 1.4.

2.4 The limitation of gating (this is the whole reason Gated DeltaNet exists)

The decay is a single scalar applied uniformly to the whole state. If you want to forget one specific key→value association, you can't — shrinking αt fades everything equally. Gating is a blunt instrument: great at "clear the board," bad at "edit one cell." Hold that thought.


3. DeltaNet: precise, targeted updates

DeltaNet attacks the other half of the problem — the memory-collision / clumsy- write issue — using a classical idea.

3.1 The delta rule (Widrow–Hoff)

Vanilla linear attention writes by blindly adding vtkt. The delta rule says: before writing, look at what the memory already returns for this key, and only write the correction.

Step through it. The value currently stored under key kt is what you'd read back: vtold=St1kt. We want to move it toward the target vt by a fraction βt(0,1) (the writing strength), giving a new value vtnew=βtvt+(1βt)vtold. We erase the old association and write the new one:

$St=St1(St1kt)vtoldkt+(βtvt+(1βt)St1kt)vtnewkt$

Collecting terms gives the clean form:

St=St1(Iβtktkt)+βtvtkt

The matrix (Iβtktkt) is a generalized Householder transform: it selectively removes the component of the old memory that lies along kt, leaving everything orthogonal to kt untouched. Then βtvtkt writes the fresh association. This is the targeted edit that gating couldn't do.

3.2 Why this is "one step of gradient descent"

There's an illuminating second interpretation. Define a per-step loss measuring how well the memory recalls the right value for the current key:

$t(S)=12Sktvt2,St=(Sktvt)kt.$

Take one gradient step from St1 with learning rate βt:

$St=St1βt(St1ktvt)kt=St1(Iβtktkt)+βtvtkt.$

Identical to the boxed rule. So DeltaNet is doing online gradient descent on an associative-recall objective at every token, with βt as the learning rate. This "fast weights as online learning" lens is the unifying framework the Gated DeltaNet paper leans on, and it's worth carrying forward.

3.3 What DeltaNet buys, and its limitation

The delta rule corrects errors and overwrites specific stale associations instead of letting them collide — it's strong at in-context retrieval and associative recall. But notice what's missing: there is no global decay term. DeltaNet can edit individual entries beautifully, yet it has no fast way to clear the whole context when the document switches topics. It is the mirror image of Mamba2: precise but unable to do bulk erasure.


4. Gated DeltaNet: combining the two

Now the punchline is almost trivial, which is the point — the paper's insight is that these two mechanisms are complementary, so you just put both in.

4.1 The gated delta rule

Take the DeltaNet update and apply the scalar gate αt to the transition on the previous state:

St=St1(αt(Iβtktkt))+βtvtkt

Read off the two knobs and their extremes:

The two corner cases recover the parents exactly:

Setting Reduces to
αt1 pure DeltaNet (targeted edits, no forgetting)
βt1, drop the Householder pure Mamba2 gated rule
αt0 hard reset of memory

So the model gets both rapid bulk erasure and precise per-key editing, and it learns per-token when to use each.

4.2 The unifying online-learning view

Using the "recurrence = closed-form online update" framework, every model in this tutorial is one row of the same table — each differs only in its per-step objective. (Here A,B denotes the Frobenius inner product.)

Model Online objective Resulting update
Linear attn StSt1F22Stkt,vt St=St1+vtkt
Mamba2 StαtSt1F22Stkt,vt St=αtSt1+vtkt
DeltaNet StSt1F22Stkt,βt(vtSt1kt) St=St1(Iβtktkt)+βtvtkt
Gated DeltaNet StαtSt1F22Stkt,βt(vtαtSt1kt) St=St1(αt(Iβtktkt))+βtvtkt

The pattern: the first term is a retention regularizer ("stay close to the previous state"). Mamba2 and Gated DeltaNet relax it with αt, allowing controlled deviation from the past — that's forgetting expressed as loosening regularization. The second term is the recall objective; the delta variants fit the residual vtαtSt1kt rather than vt directly, which is what makes the write targeted.

4.3 The actual neural block (practical details)

The recurrence above is the core. The full layer wraps it with the usual GLA/Mamba2-style machinery:

The paper also builds hybrids: interleaving Gated DeltaNet layers with a few sliding-window-attention or Mamba2 layers, which improves both quality and training throughput.


5. Making it fast: chunkwise training (conceptual)

This is the genuinely hard engineering and you can treat it as optional on a first pass — but here's the shape of it.

The dilemma. The pure recurrence (Sections 1–4) is sequential — bad for GPUs, which want big matrix multiplies. The pure parallel/attention form is matmul-heavy but O(L2). Neither is ideal for training.

The fix — chunkwise parallel form. Split the sequence into chunks of size C. Then:

This gives linear time and tensor-core-friendly matmuls. For plain linear attention / Mamba2 this is straightforward. The complication for the delta rule is that within a chunk you have a product of Householder matrices i(Iβikiki), which is not a simple sum.

WY representation + UT transform. A classical result (Bischof & Van Loan) lets you write that product of Householder matrices compactly as IWK, and the chunk's accumulated write as UK, where W and U are obtained by inverting a small C×C triangular matrix (the "UT transform"). This turns the awkward cumulative product into a few clean matmuls per chunk. Gated DeltaNet's technical contribution is folding the αt gates into this machinery — by decaying each vector to a chunk boundary (the "· / ·" notation in the paper) so the chunkwise algorithm still works with the extra decay terms. Net result: same hardware efficiency as Mamba2/DeltaNet, now with both mechanisms.

You don't need to re-derive this to use or understand the model — but knowing it exists explains why Gated DeltaNet trains at speeds comparable to Mamba2 despite the richer update.


6. Why it works: the benchmark intuition

The cleanest evidence is the S-NIAH ("needle in a haystack") suite, which isolates two skills: long-term retention of a needle, and filtering out distractors. The reported behavior at long context lines up exactly with the theory above:

That's the entire thesis in one experiment: gating and the delta rule fix different failure modes, so combining them dominates either alone — across language modeling, commonsense reasoning, in-context retrieval, and length extrapolation, not just the synthetic probe.


7. One-page summary

If you remember one thing: the state S is a key→value memory, and the literature is a sequence of better answers to "how should I write to and forget from this memory each step?" — add (linear attn) → decay-then-add (Mamba2) → edit-in-place (DeltaNet) → decay-and-edit (Gated DeltaNet).


Where to go next