From SGD to Muon: An Incremental Tutorial (Fable-5 vs Opus 4.8)

09 Jun, 2026

Fable-5 High

You can find Opus 4.8 output after this.

From SGD to Muon: An Incremental Tutorial

This tutorial builds up to the Muon optimizer one idea at a time. Each section adds exactly one concept on top of the previous one, so by the time we reach Muon, every design decision should feel inevitable rather than mysterious.

The path we'll take:

Gradient descent, and weights as matrices
Momentum
Preconditioning (why Adam is a "diagonal" preconditioner)
Full-matrix AdaGrad — the ideal we can't afford
Shampoo — Kronecker-factored preconditioning
A surprising simplification: instantaneous Shampoo = orthogonalization
Why orthogonalizing the update is a good idea on its own
Muon — momentum + cheap orthogonalization via Newton–Schulz
Practical details, code, and what Muon doesn't handle

Notation: a weight matrix is $W \in ℝ^{m \times n}$ (e.g., a linear layer mapping $n$ inputs to $m$ outputs), its gradient is $G = \nabla_{W} ℒ$ , also $m \times n$ . Learning rate is $η$ .

Step 1: Gradient descent, and taking matrices seriously

Plain SGD updates every parameter the same way:

$W \leftarrow W - η G$

Notice something most optimizers ignore: in a neural network, parameters aren't an unstructured bag of numbers. The vast majority live in 2D weight matrices, where rows and columns have meaning — column $j$ of $W$ touches input feature $j$ ; row $i$ produces output feature $i$ .

SGD, Adam, RMSProp, etc. all treat $W$ as a flat vector of $m n$ independent scalars. The central theme of this tutorial is: what do we gain by treating $G$ as a matrix? Shampoo and Muon are two answers to that question.

Step 2: Momentum

Stochastic gradients are noisy. Momentum smooths them with an exponential moving average:

$B_{t} = μ B_{t - 1} + G_{t}, W \leftarrow W - η B_{t}$

with $μ \approx 0.9$ – $0.95$ . Two ways to think about it:

Signal processing view: $B_{t}$ is a low-pass filter over recent gradients; consistent directions accumulate, noise cancels.
Physics view: a heavy ball rolling downhill, coasting through small bumps.

A common refinement is Nesterov momentum, which uses a "lookahead" version of the buffer — effectively replacing the update direction $B_{t}$ with $G_{t} + μ B_{t}$ . It tends to help slightly and Muon uses it by default.

Keep momentum in your pocket. It's orthogonal (pun intended) to everything that follows, and Muon will be exactly "momentum + one extra step."

Step 3: Preconditioning — why one learning rate isn't enough

In real loss landscapes, curvature differs wildly across directions: the loss might be a steep ravine in one direction and nearly flat in another. A single learning rate $η$ must be small enough for the steepest direction, which makes progress in flat directions painfully slow.

The fix is to multiply the gradient by a matrix $P$ (the preconditioner) that rescales different directions differently:

$w \leftarrow w - η P g (thinking of everything as flat vectors for now)$

The gold standard is Newton's method, $P = H^{- 1}$ (inverse Hessian), which makes the landscape look spherical. For a network with $d$ parameters, $H$ is $d \times d$ — utterly intractable when $d$ is billions.

Adam is a preconditioner too, just a very restricted one. Its second-moment estimate $v$ gives the update

$w \leftarrow w - η \frac{m}{\sqrt{v} + ϵ}$

which is $P = d i a g (v)^{- 1 / 2}$ applied to the momentum $m$ : a diagonal preconditioner. Each parameter gets its own learning rate, but Adam cannot rotate or mix directions — it can only stretch along the coordinate axes. If the ravine in your loss landscape is diagonal to the axes, Adam can't see it.

So the design space looks like:

Preconditioner $P$	Cost	Power
$I$ (SGD)	free	none
diagonal (Adam)	$O (d)$	per-coordinate scaling
full matrix	$O (d^{2})$ +	arbitrary rotation + scaling

Shampoo lives in the unexplored middle.

Step 4: Full-matrix AdaGrad — the ideal we can't afford

AdaGrad (2011) proposed a principled data-driven preconditioner. Accumulate the outer products of all gradients seen so far,

$H_{t} = \sum_{s = 1}^{t} g_{s} g_{s}^{⊤} \in ℝ^{d \times d},$

and update with its inverse square root:

$w \leftarrow w - η H_{t}^{- 1 / 2} g_{t} .$

Intuition: $H_{t}$ measures how much gradient "energy" has flowed in each direction. Directions with consistently large gradients get shrunk; directions with little accumulated signal get amplified. (The familiar AdaGrad — and by extension Adam's $\sqrt{v}$ — is just the diagonal of this.)

Full-matrix AdaGrad has strong theory behind it, but storing $H_{t}$ ( $d \times d$ ) and computing $H_{t}^{- 1 / 2}$ is hopeless at scale. For a single 4096×4096 layer, $d = 16.7$ M and $H$ would have $~ 2.8 \times 10^{14}$ entries.

The question becomes: can we approximate $H^{- 1 / 2} g$ without ever forming $H$ ?

Step 5: Shampoo — exploit the matrix structure

Shampoo (Gupta, Koren & Singer, 2018) answers yes, if we stop flattening. Treat the gradient of each layer as the matrix $G_{t} \in ℝ^{m \times n}$ it actually is, and maintain two small accumulators instead of one giant one:

$L_{t} = L_{t - 1} + G_{t} G_{t}^{⊤} (m \times m, "row" statistics)$ $R_{t} = R_{t - 1} + G_{t}^{⊤} G_{t} (n \times n, "column" statistics)$

The update preconditions the gradient from both sides:

$W \leftarrow W - η L_{t}^{- 1 / 4} G_{t} R_{t}^{- 1 / 4}$

Where does this come from?

It's full-matrix AdaGrad with one structural assumption. The claim is that the giant $m n \times m n$ matrix $H_{t}$ is well approximated by a Kronecker product of the two small ones:

$H_{t} \approx L_{t}^{1 / 2} \otimes R_{t}^{1 / 2} .$

Kronecker products have two lovely properties:

$(A \otimes B)^{p} = A^{p} \otimes B^{p}$ , so the inverse square root we need is $L^{- 1 / 4} \otimes R^{- 1 / 4}$ — and there are your funny-looking $- 1 / 4$ exponents: each side contributes half of the overall $- 1 / 2$ power.
Multiplying a Kronecker product into a flattened matrix is the same as multiplying the matrix from both sides: $(A \otimes B) v e c (G) = v e c (B G A^{⊤})$ — which is how the giant matrix–vector product collapses into the cheap two-sided update $L^{- 1 / 4} G R^{- 1 / 4}$ .

So Shampoo = full-matrix AdaGrad, compressed through the assumption that row-space curvature and column-space curvature factorize. The memory drops from $O (m^{2} n^{2})$ to $O (m^{2} + n^{2})$ — for the 4096×4096 layer, from $2.8 \times 10^{14}$ entries to $3.4 \times 10^{7}$ . (The name is a joke: it's a preconditioner you apply on both sides, like shampoo... and conditioner.)

What it does geometrically

Write the SVD of the gradient, $G = U Σ V^{⊤}$ . Up to the smoothing from accumulation, $L^{- 1 / 4} (\cdot) R^{- 1 / 4}$ shrinks the directions (singular vectors) where gradients have historically been large and boosts the ones where they've been small. It's a whitening operation on the gradient's row and column spaces.

Shampoo's costs

It works very well — a distributed Shampoo variant won the external tuning track of the 2024 AlgoPerf optimizer benchmark, beating Adam-family baselines by a solid margin. But:

You store $L$ , $R$ , and (in practice) momentum: several extra matrices per layer.
You must compute inverse fourth roots of $L$ and $R$ — an eigendecomposition or iterative root solver. Too expensive every step, so real implementations recompute the roots every ~50–100 steps and reuse them, adding staleness, hyperparameters, and engineering complexity (often the roots are computed in fp64 for numerical stability).

This is the state of the world Muon enters: bilateral preconditioning clearly helps; the bookkeeping clearly hurts. What's the smallest thing that keeps the benefit?

Step 6: The simplification — instantaneous Shampoo is just orthogonalization

Here is the pivotal observation. Suppose we strip Shampoo of its history: no accumulation, no epsilon — precondition each step using only the current gradient:

$L = G G^{⊤}, R = G^{⊤} G, update = (G G^{⊤})^{- 1 / 4} G (G^{⊤} G)^{- 1 / 4} .$

Plug in the SVD $G = U Σ V^{⊤}$ . Then $G G^{⊤} = U Σ^{2} U^{⊤}$ and $G^{⊤} G = V Σ^{2} V^{⊤}$ , so:

$ $(G G^{⊤})^{- 1 / 4} G (G^{⊤} G)^{- 1 / 4} = (U Σ^{- 1 / 2} U^{⊤}) (U Σ V^{⊤}) (V Σ^{- 1 / 2} V^{⊤}) = U Σ^{- 1 / 2} Σ Σ^{- 1 / 2} V^{⊤} = U V^{⊤} .$ $

Everything involving $Σ$ cancels. Single-step Shampoo replaces the gradient with $U V^{⊤}$ — the gradient with all its singular values snapped to 1. This is called orthogonalization (or "semi-orthogonalization" for rectangular matrices, or taking the matrix sign): keep the gradient's directions, discard its magnitudes. It is also exactly the orthogonal matrix nearest to $G$ .

This reframing is powerful because it suggests the accumulators were never the essential ingredient — the essential ingredient was the two-sided whitening, and in its purest form, whitening = orthogonalization. So maybe we can skip the $L$ / $R$ matrices, skip the fourth roots, and just... orthogonalize.

But wait — is throwing away the singular values actually a good idea? Step 7 argues yes, on independent grounds.

Step 7: Why orthogonalize? Two independent justifications

7a. Gradients are spectrally lopsided; rare directions matter

Empirically, gradient matrices (and momentum buffers) of transformer layers are extremely low-rank-ish: a few huge singular values and a long tail of small ones. Vanilla SGD's update is therefore dominated by a handful of directions, while the long tail — which may encode rare but important features (infrequent tokens, unusual patterns) — barely moves the weights.

Setting all singular values to 1 means every direction the gradient identifies gets an equal-sized step. The dominant directions are tempered; the rare ones are amplified. The motivating intuition behind Muon is precisely this boosting of "rare directions" that elementwise optimizers chronically under-serve.

7b. Steepest descent under the right norm

"Gradient descent follows the steepest direction" is only true relative to a norm. The general steepest-descent step solves

$Δ W^{⋆} = \arg {max}_{‖ Δ W ‖ \leq 1} ⟨ G, Δ W ⟩,$

and the answer depends on which $‖ \cdot ‖$ you pick:

Euclidean/Frobenius norm → $Δ W^{⋆} \propto G$ (plain SGD).
$ℓ_{\infty}$ -type norms → sign-based updates (Adam-ish, signSGD).
Spectral norm $‖ Δ W ‖_{2}$ (largest singular value) → $Δ W^{⋆} = U V^{⊤}$ . (Sketch: write $⟨ G, Δ W ⟩ = ⟨ Σ, U^{⊤} Δ W V ⟩$ ; with every singular value of $Δ W$ capped at 1, the inner product is maximized by making $U^{⊤} Δ W V = I$ , i.e., $Δ W = U V^{⊤}$ .)

Why is the spectral norm the right norm for a linear layer? Because what we ultimately care about is how much the layer's function changes, not how much its parameter vector moves. A layer computes $y = W x$ , and the spectral norm bounds exactly that: $‖ Δ W x ‖ \leq ‖ Δ W ‖_{2} ‖ x ‖$ . Controlling the spectral norm of the update controls the worst-case change in the layer's behavior. This perspective (developed by Bernstein, Newhouse and collaborators as "modular duality" / the modular norm) makes orthogonalized updates not a hack but steepest descent in the geometry natural to matrix-shaped parameters — and it also explains nice empirical side effects, like learning rates transferring across model widths better than with Adam.

So we have two arrows pointing at the same update: Shampoo's preconditioning logic (Step 6) and norm-aware steepest descent (Step 7). All that remains is computing $U V^{⊤}$ cheaply.

Step 8: Muon

Muon (MomentUm Orthogonalized by Newton–Schulz; Keller Jordan et al., 2024) is exactly the recipe the previous steps assembled, with one efficiency trick. Per 2D weight matrix, per step:

$B_{t} = μ B_{t - 1} + G_{t} (momentum, Step 2; Nesterov variant: use G_{t} + μ B_{t})$ $O_{t} = N e w t o n S c h u l z (B_{t}) \approx U V^{⊤} of B_{t} (orthogonalize, Steps 6–7)$ $W \leftarrow W - η O_{t}$

That's the whole optimizer. State: one momentum buffer, same as SGD-with-momentum. No $L$ , no $R$ , no eigendecompositions, no stale preconditioners. Note that we orthogonalize the momentum buffer, not the raw gradient — smoothing first, then shape.

The Newton–Schulz trick

Computing $U V^{⊤}$ via an exact SVD every step would be slow and is numerically unfriendly on GPUs in low precision. Newton–Schulz iteration computes it with only matrix multiplications — the one operation GPUs are unreasonably good at.

The idea: find an odd polynomial $p (x) = a x + b x^{3} + c x^{5}$ that, when applied repeatedly, pushes any value in $(0, 1]$ toward $1$ . Applying the matrix version

$X \leftarrow a X + b (X X^{⊤}) X + c (X X^{⊤})^{2} X$

acts independently on each singular value of $X$ (the singular vectors are untouched — that's the magic of odd polynomials in $X$ ). So iterate enough times and all singular values converge toward 1 while $U$ and $V$ stay fixed: the output approaches $U V^{⊤}$ .

Concretely, Muon:

Normalizes: $X_{0} = B / ‖ B ‖_{F}$ (guaranteeing all singular values are in $[0, 1]$ so the iteration is in its basin of convergence);
Runs 5 iterations of the quintic above with coefficients tuned to $(a, b, c) = (3.4445, - 4.7750, 2.0315)$ ;
Runs the whole thing in bfloat16 — it's stable enough, unlike Shampoo's root computations which often want fp64.

The coefficients were chosen to maximize how fast tiny singular values get inflated toward 1 (steep slope at 0), at the price of not converging exactly — singular values land in roughly $[0.7, 1.3]$ rather than at exactly $1$ . Empirically this sloppiness doesn't hurt at all; the update only needs to be approximately orthogonal. The overhead is a handful of matmuls per layer per step — typically well under 1% of total training FLOPs for a transformer, and the iteration parallelizes/distributes easily.

Two practical details that matter

Muon is for hidden weight matrices only. Embedding tables, the output/LM head, and all 1D parameters (biases, LayerNorm/RMSNorm gains) are not optimized with Muon — they're handed to AdamW. The spectral-norm story is about matrices that act as linear maps between activation spaces; embeddings and the unembedding are really lookup tables / per-token vectors with different geometry, and orthogonalizing them empirically hurts. Convolution kernels can be used by flattening their last dimensions to make them 2D. So in practice "training with Muon" means Muon for the hidden matrices + AdamW for the rest.

Shape-aware scaling. $U V^{⊤}$ has a fixed scale regardless of the layer's dimensions (every singular value ≈ 1, so the update's RMS entry size is about $1 / \sqrt{max (m, n)}$ ... which varies with shape). To make one learning rate work across differently-shaped layers, implementations rescale the orthogonalized update — the original Muon multiplies by $\sqrt{max (1, m / n)}$ , while Moonshot AI's large-scale variant multiplies by $0.2 \sqrt{max (m, n)}$ so the update RMS matches AdamW's typical scale (letting you reuse AdamW learning rates and weight-decay settings). Either way, the point is the same: consistent update magnitude across shapes, so $η$ transfers.

Reference implementation (simplified)

import torch

def newton_schulz5(B, steps=5, eps=1e-7):
    a, b, c = 3.4445, -4.7750, 2.0315
    X = B.bfloat16()
    transposed = X.size(0) > X.size(1)
    if transposed:               # work with the wide orientation
        X = X.T
    X = X / (X.norm() + eps)     # singular values into [0, 1]
    for _ in range(steps):
        A = X @ X.T
        X = a * X + (b * A + c * A @ A) @ X
    return (X.T if transposed else X).to(B.dtype)

@torch.no_grad()
def muon_step(W, G, buf, lr=0.02, momentum=0.95, nesterov=True):
    buf.mul_(momentum).add_(G)                 # B = mu*B + G
    upd = G.add(buf, alpha=momentum) if nesterov else buf
    O = newton_schulz5(upd)
    O *= max(1.0, W.size(0) / W.size(1)) ** 0.5   # shape-aware scaling
    W.add_(O, alpha=-lr)

(Production versions add weight decay, distribute the Newton–Schulz computation across GPUs, and pair this with AdamW for non-matrix parameters.)

Does it work?

Muon's debut was setting speed records on the NanoGPT-speedrun benchmark (~1.35× faster than tuned AdamW to equal validation loss), and it has since been validated at much larger scale — Moonshot AI's Moonlight (a 3B/16B-parameter MoE trained on 5.7T tokens) reported roughly 2× computational efficiency versus AdamW with the scaled, weight-decayed variant, and Muon-trained models like Kimi K2 followed. Beyond raw speed, reported side benefits include lower memory than Adam (one buffer vs. two), better learning-rate transfer across model sizes, and updates with controlled spectral norm.

Step 9: The whole picture in one table

	State per layer	Update rule	Expensive op	Key idea
SGD+momentum	$B$	$B$	—	smooth the gradient
Adam	$m, v$	$m / \sqrt{v}$	—	per-coordinate scaling (diagonal preconditioner)
Full AdaGrad	$H$ ( $d \times d$ )	$H^{- 1 / 2} g$	$d \times d$ root	whiten in full parameter space (intractable)
Shampoo	$L, R$ (+ momentum)	$L^{- 1 / 4} G R^{- 1 / 4}$	matrix 4th roots (amortized)	Kronecker-factored whitening
Muon	$B$	$N S 5 (B) \approx U V^{⊤}$	5 rounds of matmuls	orthogonalize the momentum

And the conceptual chain, compressed:

AdaGrad wants $H^{- 1 / 2} g$ → Shampoo factorizes it into two-sided whitening $L^{- 1 / 4} G R^{- 1 / 4}$ → with no accumulation that collapses to $U V^{⊤}$ → which is independently justified as steepest descent under the spectral norm → and $U V^{⊤}$ can be approximated with five rounds of bf16 matmuls → apply that to a momentum buffer and you have Muon.

Caveats and open edges

Not a drop-in for everything. Muon governs hidden matrices; you still run AdamW on embeddings, the output head, and 1D params, so you're tuning two optimizers (in practice the AdamW part is forgiving).
Hyperparameters differ from Adam. Typical original-flavor settings: $η \approx 0.02$ , $μ = 0.95$ , Nesterov on. The Moonshot-scaled variant intentionally matches AdamW's scale so you can reuse those settings instead. Weight decay turns out to matter at large scale (without it, weight/activation magnitudes can grow too much over long runs).
Shampoo isn't obsolete. Accumulated $L, R$ carry curvature history that pure orthogonalization discards; Shampoo-family methods (e.g., SOAP, which runs Adam in Shampoo's eigenbasis) remain strong, especially per-step. Muon's bet is that the simpler update wins on wall-clock and memory — a bet that has paid off in LLM pretraining so far, but the design space between "full whitening with history" and "memoryless orthogonalization" is still being explored.

Opus 4.8 (High)

From Gradient Descent to Muon (by way of Shampoo)

An incremental tutorial. Each section adds exactly one idea to the previous one. By the end you should be able to look at the three-line Muon update and see why every line is there.

The punchline, stated up front so you know where we're going:

Muon is Shampoo with the bookkeeping thrown away. Shampoo preconditions a matrix gradient using accumulated statistics. If you strip out the accumulation and apply the same idea to a single matrix (the momentum), the math collapses into one operation: orthogonalize the matrix — set all its singular values to 1. Muon does exactly that, using a cheap GPU-friendly iteration instead of an SVD.

We'll build up to that claim, then prove it in two lines of linear algebra.

0. Notation and the one reframing that matters

A neural net is a big pile of parameters $θ$ , a loss $L (θ)$ , and a gradient $g = \nabla L (θ)$ . Almost every optimizer you know treats $θ$ as one long flat vector.

The single most important idea in this whole tutorial is to stop flattening. The weights of a linear layer are naturally a matrix $W \in ℝ^{m \times n}$ (mapping an $n$ -dimensional input to an $m$ -dimensional output). Its gradient is therefore also a matrix:

$G = \frac{\partial L}{\partial W} \in ℝ^{m \times n} .$

SGD and Adam throw away this 2-D structure and process the $m n$ numbers as an unordered bag of scalars. Shampoo and Muon keep the matrix shape and exploit it. That is the entire conceptual fork in the road.

Throughout, $η$ is the learning rate and the SVD of a matrix is written $G = U Σ V^{⊤}$ , where $U$ and $V$ have orthonormal columns and $Σ$ is diagonal with the singular values $σ_{i} \geq 0$ .

1. SGD: the baseline

$θ_{t + 1} = θ_{t} - η g_{t} .$

Step downhill, proportional to the gradient. Simple, but it struggles when different directions have very different curvature: it bounces along steep directions while crawling along shallow ones. Every fix below is, in some sense, an attempt to rescale the step direction-by-direction.

2. Momentum: remember where you were going

Raw gradients are noisy and oscillate. Average them over time with an exponential moving average ("velocity"):

$m_{t} = μ m_{t - 1} + g_{t}, θ_{t + 1} = θ_{t} - η m_{t},$

with $μ \approx 0.9$ – $0.95$ . Momentum smooths the trajectory and accelerates along consistent directions. Hold on to this $m_{t}$ — it survives all the way into Muon, where it becomes the thing we orthogonalize.

3. Adam: a diagonal preconditioner

Adam gives every coordinate its own adaptive step size by tracking the running magnitude of each coordinate's gradient:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}, v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2},

θ_{t + 1} = θ_{t} - η \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ}

(with bias correction ${\hat{m}}_{t}, {\hat{v}}_{t}$ ). The key structural fact: dividing by $\sqrt{v_{t}}$ coordinate-wise is the same as multiplying the gradient by a diagonal matrix $P = diag (1 / \sqrt{v_{t}})$ :

$θ_{t + 1} = θ_{t} - η P {\hat{m}}_{t} .$

This $P$ is called a preconditioner. Adam's preconditioner is diagonal: it can stretch or shrink each axis independently, but it knows nothing about how coordinates relate to one another. For a weight matrix, that means it ignores all the row/column structure — it's still treating $W$ as a bag of scalars.

That limitation is the opening for everything that follows.

4. The general preconditioning idea (and why the honest version is impossible)

Adam's diagonal $P$ is a crude approximation to something better. What's the "ideal" preconditioner?

Newton's method says step with the inverse Hessian: $Δ = - H^{- 1} g$ . This corrects for curvature across coordinates, not just per-coordinate. Full-matrix AdaGrad is a first-order cousin that doesn't need the Hessian. It accumulates the outer products of gradients and preconditions by the inverse square root:

{\hat{G}}_{t} = \sum_{s \leq t} g_{s} g_{s}^{⊤} \in ℝ^{d \times d}, θ_{t + 1} = θ_{t} - η {\hat{G}}_{t}^{- 1 / 2} g_{t} .

This is a genuinely good (dense) preconditioner — it captures correlations between all coordinates.

It is also completely infeasible. If a single weight matrix has $d = m n =$ a few million entries, then ${\hat{G}}_{t}$ is a few-million-by-few-million matrix, and we'd need its inverse square root every step. No.

So the entire game becomes: approximate the full preconditioner ${\hat{G}}^{- 1 / 2}$ with something cheap that still respects matrix structure. Adam's answer ("just keep the diagonal") is one extreme. Shampoo's answer is much smarter.

5. Shampoo: a structured (Kronecker) preconditioner

Shampoo's idea: instead of one giant $m n \times m n$ preconditioner, keep two small ones — one for the row space, one for the column space of the matrix.

For a weight matrix $W \in ℝ^{m \times n}$ with matrix gradient $G_{t}$ :

L_{t} = \sum_{s \leq t} G_{s} G_{s}^{⊤} \in ℝ^{m \times m} (left / output-side, m \times m)

R_{t} = \sum_{s \leq t} G_{s}^{⊤} G_{s} \in ℝ^{n \times n} (right / input-side, n \times n)

and the update applies one to each side of the gradient:

W_{t + 1} = W_{t} - η L_{t}^{- 1 / 4} G_{t} R_{t}^{- 1 / 4}

Two things to notice.

(a) It's cheap-ish. We store and invert an $m \times m$ and an $n \times n$ matrix instead of an $m n \times m n$ one. For a $4096 \times 4096$ layer that's two 4096-size matrices instead of one 16-million-size one. This works because Shampoo implicitly approximates the full preconditioner $\hat{G}$ by a Kronecker product $R \otimes L$ — the structured assumption that row-correlations and column-correlations factorize.

(b) Why the strange exponent $- 1 / 4$ ? Each of $L$ and $R$ is built from products like $G G^{⊤}$ , so each already contains the gradient "squared." A power of $- 1 / 2$ on each side would over-correct. The $- 1 / 4$ is exactly the exponent that makes the two half-sided corrections compose into one proper $- 1 / 2$ -power preconditioning of the full gradient. The next section makes this concrete — and in doing so, hands us Muon.

6. The bridge: one step of Shampoo = orthogonalization

Here is the two-line calculation that the whole tutorial is built around. Take Shampoo with no accumulation — just the current gradient, $L = G G^{⊤}$ and $R = G^{⊤} G$ — and plug in the SVD $G = U Σ V^{⊤}$ .

The eigendecompositions fall right out:

L = G G^{⊤} = U Σ^{2} U^{⊤} \Rightarrow L^{- 1 / 4} = U Σ^{- 1 / 2} U^{⊤},

R = G^{⊤} G = V Σ^{2} V^{⊤} \Rightarrow R^{- 1 / 4} = V Σ^{- 1 / 2} V^{⊤} .

Now sandwich the gradient (watch the orthonormal factors cancel):

L^{- 1 / 4} G R^{- 1 / 4} = \underset{L^{- 1 / 4}}{\underset{⏟}{U Σ^{- 1 / 2} U^{⊤}}} \underset{G}{\underset{⏟}{U Σ V^{⊤}}} \underset{R^{- 1 / 4}}{\underset{⏟}{V Σ^{- 1 / 2} V^{⊤}}} = U Σ^{- 1 / 2} Σ Σ^{- 1 / 2} V^{⊤} = U Σ^{0} V^{⊤} = U V^{⊤} .

The singular values completely cancel. What's left, $U V^{⊤}$ , is the gradient with every singular value reset to 1.

(Aside: this also shows why the exponent had to be $1 / 4$ . Use $1 / 2$ instead and you'd get $U Σ^{- 1} V^{⊤} = G^{+}$ , the pseudoinverse — far too aggressive. The $1 / 4$ is precisely tuned to equalize the singular values rather than invert them.)

What is $U V^{⊤}$ ? It's the orthogonal polar factor of $G$ : the closest semi-orthogonal matrix to $G$ in Frobenius distance. Geometrically:

Ordinary gradient descent moves a lot along high-singular-value directions and barely at all along low ones — unbalanced.
Replacing $G$ by $U V^{⊤}$ keeps the directions ( $U$ and $V$ ) but flattens the magnitudes to be equal. Every direction gets an equally sized step.

That rebalancing is the entire benefit of preconditioning, distilled into one clean operation. And it costs no accumulated statistics at all — just the SVD of the current matrix.

So: one-step Shampoo = orthogonalize the gradient. Muon is what you get by orthogonalizing the momentum instead, and computing it without an SVD.

7. Muon = momentum + orthogonalization

The name says it: MomentUm Orthogonalized by Newton-schulz.

M_{t} = μ M_{t - 1} + G_{t} (momentum, from §2)

O_{t} = orthogonalize (M_{t}) \approx U_{M} V_{M}^{⊤} (the §6 operation)

W_{t + 1} = W_{t} - η O_{t} (step)

Read against everything above, Muon is: take the smoothed gradient (momentum), apply the single-step Shampoo preconditioner to it (which equals orthogonalization), and step. It drops Shampoo's running $L_{t}, R_{t}$ statistics entirely — it re-derives the preconditioner fresh from the current momentum each step.

Only one piece is missing: computing $U V^{⊤}$ without paying for an SVD. That's the "Newton-Schulz" part.

8. Computing the orthogonalization cheaply: Newton–Schulz

An SVD gives $U V^{⊤}$ exactly, but SVDs parallelize poorly on GPUs and are unstable in low precision (bf16). We don't need it exact — we just need all singular values pushed close to 1. A matrix polynomial iteration does this beautifully and runs entirely in fast matmuls.

The Newton–Schulz quintic iteration:

X_{k + 1} = a X_{k} + b (X_{k} X_{k}^{⊤}) X_{k} + c (X_{k} X_{k}^{⊤})^{2} X_{k} .

Why this works: every term is an odd polynomial in $X$ , so it acts on each singular value independently through the same scalar map

σ \mapsto a σ + b σ^{3} + c σ^{5} .

Choose the coefficients $(a, b, c)$ so that iterating this scalar map drives every $σ$ in $(0, 1]$ toward $\approx 1$ . Then iterating the matrix version drives $X$ toward $U (all-ones) V^{⊤} = U V^{⊤}$ , exactly the orthogonalization we want. The directions $U, V$ never change — only the singular values move — because the iteration only ever multiplies $X$ by functions of $X X^{⊤}$ .

Two practical points:

Normalize first. Divide the input by its norm (Frobenius norm is a safe choice, since it bounds the spectral norm) so all singular values start $\leq 1$ , inside the iteration's basin of attraction.
A handful of steps in bf16 is enough. A well-known tuned coefficient set is roughly $(a, b, c) \approx (3.4445, - 4.7750, 2.0315)$ with about 5 iterations. These are deliberately aggressive: they don't converge the singular values to exactly 1 but to a band around it. That's fine — approximate orthogonalization is all Muon needs, and trading exactness for speed in low precision is the whole point.

9. Why orthogonalize, really? The spectral-norm view

There's a second, independent justification for Muon that doesn't go through Shampoo at all, and it explains why equalizing singular values is the right thing for training stability.

Steepest descent depends on which norm you use to measure "step size." Formally, the steepest-descent direction is

Δ^{⋆} = \arg {max}_{‖ Δ ‖ \leq 1} ⟨ - G, Δ ⟩ .

Measure step size by the Frobenius norm and the answer is $Δ^{⋆} \propto - G$ — that's ordinary gradient descent.
Measure it by the spectral norm (the largest singular value) and the answer is $Δ^{⋆} = - U V^{⊤}$ — that's Muon.

(The second follows because the dual of the spectral norm is the nuclear norm, and $⟨ G, Δ ⟩$ is maximized over the spectral-norm ball precisely at the polar factor $U V^{⊤}$ .)

So Muon is steepest descent under the spectral norm. Why care? The spectral norm of a weight update controls how much the layer's output can change in the worst-case input direction. Bounding it keeps any single feature direction from getting blown out in one step, which tends to give more stable training and more balanced feature learning across directions — the same circle of ideas as spectral/μP-style scaling. Adam, by contrast, controls a per-coordinate (roughly $ℓ_{\infty}$ -flavored) quantity and lets the spectral norm drift.

Two routes, one destination: the Shampoo route says "orthogonalization is the cheap preconditioner," and the geometry route says "orthogonalization is the right-shaped step." They agree on $U V^{⊤}$ .

10. The practical recipe

Muon is not a drop-in replacement for every parameter — it's specifically a matrix optimizer. The standard setup is a hybrid:

Use Muon on the 2-D hidden weight matrices (the linear layers inside transformer blocks, attention projections, MLPs). This is where the matrix structure and the spectral-norm argument apply.
Use AdamW on everything else: token embeddings, the final unembedding / classifier head, biases, and LayerNorm/RMSNorm gains. These are 1-D or have semantics (e.g. per-token rows in an embedding table) where "orthogonalize the matrix" isn't the right operation. The input/output embeddings in particular behave better on Adam.
Conv filters: reshape the 4-D weight to 2-D before orthogonalizing.

Update-magnitude scaling matters. An orthogonalized update has all singular values $\approx 1$ , so its size is fixed by the matrix shape, not by the gradient magnitude. A semi-orthogonal $m \times n$ matrix has per-entry RMS of $1 / \sqrt{max (m, n)}$ . To make Muon's effective step size comparable to AdamW's (so you can reuse your learning-rate intuitions), implementations scale the update — a commonly cited choice is multiplying by about $0.2 \sqrt{max (m, n)}$ . Treat the exact constant as implementation-dependent and tune it.

Other defaults that work well in practice: momentum $μ \approx 0.95$ (often the Nesterov variant), decoupled AdamW-style weight decay (important at larger scale), and the cheap ~5-step Newton–Schulz above. The orthogonalization cost is negligible next to the layer's own matmuls, and it shards cleanly across devices.

11. The whole story in one ladder

Step	What it adds	Preconditioner on a matrix gradient $G$
SGD	—	identity
+ Momentum	temporal averaging	identity (applied to $m_{t}$ )
+ Adam	per-coordinate adaptive scale	diagonal $diag (1 / \sqrt{v})$
(ideal) Full AdaGrad	all cross-coordinate correlations	dense ${\hat{G}}^{- 1 / 2}$ — infeasible
+ Shampoo	Kronecker-structured approximation	$L^{- 1 / 4} (\cdot) R^{- 1 / 4}$ , $L, R$ accumulated
+ Muon	drop accumulation, cheap solve	$L^{- 1 / 4} (\cdot) R^{- 1 / 4}$ on one matrix $=$ orthogonalize, via Newton–Schulz

Every rung answers the same two questions slightly better: how much of the gradient's matrix structure do we exploit, and how cheaply can we exploit it? Adam exploits the diagonal. Shampoo exploits the Kronecker structure but pays to accumulate it. Muon notices that for a single matrix the expensive accumulation collapses into a parameter-free operation — orthogonalization — and then makes that operation cheap.

Appendix: minimal pseudocode

import torch

def newton_schulz(G, steps=5, eps=1e-7):
    """Approximate the orthogonal polar factor U V^T of G in bf16."""
    a, b, c = 3.4445, -4.7750, 2.0315          # tuned quintic coefficients
    X = G.bfloat16()
    X = X / (X.norm() + eps)                    # normalize: singular values <= 1
    transpose = X.size(0) > X.size(1)           # iterate on the smaller dimension
    if transpose:
        X = X.T
    for _ in range(steps):
        A = X @ X.T                             # acts on each sigma via sigma -> a*s + b*s^3 + c*s^5
        B = b * A + c * (A @ A)
        X = a * X + B @ X
    if transpose:
        X = X.T
    return X.to(G.dtype)

def muon_step(W, G, M, lr, momentum=0.95):
    """One Muon update for a 2-D weight matrix W with gradient G and momentum buffer M."""
    M.mul_(momentum).add_(G)                    # 1. momentum:        M = mu*M + G
    O = newton_schulz(M)                        # 2. orthogonalize:   O ~ U V^T
    scale = 0.2 * (max(W.shape) ** 0.5)         # 3. match AdamW-ish update RMS (tune this)
    W.add_(O, alpha=-lr * scale)                #    step:            W -= lr * scale * O
    return M

The three numbered lines in muon_step are the three lines from §7. Everything in this tutorial exists to explain why line 2 is newton_schulz and not something more complicated — because for a single matrix, the smartest known preconditioner (Shampoo) reduces to exactly that.

A note on what I'm confident about vs. not

The linear algebra (the §6 cancellation, the §9 spectral-norm derivation, the role of the $1 / 4$ and $1 / 2$ exponents) is exact and is the durable core of the tutorial. The specific hyperparameters — the Newton–Schulz coefficients, step count, and the $0.2 \sqrt{max (m, n)}$ scaling — are tuned engineering choices that vary across implementations and that people are still refining; use them as sensible starting points, not gospel. If you're implementing this for a real run, it's worth checking the current reference implementation for the latest recommended constants.

sankalp's blog