sankalp's blog

From SGD to Muon: An Incremental Tutorial (Fable-5 vs Opus 4.8)

Fable-5 High

You can find Opus 4.8 output after this.

From SGD to Muon: An Incremental Tutorial

This tutorial builds up to the Muon optimizer one idea at a time. Each section adds exactly one concept on top of the previous one, so by the time we reach Muon, every design decision should feel inevitable rather than mysterious.

The path we'll take:

  1. Gradient descent, and weights as matrices
  2. Momentum
  3. Preconditioning (why Adam is a "diagonal" preconditioner)
  4. Full-matrix AdaGrad — the ideal we can't afford
  5. Shampoo — Kronecker-factored preconditioning
  6. A surprising simplification: instantaneous Shampoo = orthogonalization
  7. Why orthogonalizing the update is a good idea on its own
  8. Muon — momentum + cheap orthogonalization via Newton–Schulz
  9. Practical details, code, and what Muon doesn't handle

Notation: a weight matrix is Wm×n (e.g., a linear layer mapping n inputs to m outputs), its gradient is G=W, also m×n. Learning rate is η.


Step 1: Gradient descent, and taking matrices seriously

Plain SGD updates every parameter the same way:

WWηG

Notice something most optimizers ignore: in a neural network, parameters aren't an unstructured bag of numbers. The vast majority live in 2D weight matrices, where rows and columns have meaning — column j of W touches input feature j; row i produces output feature i.

SGD, Adam, RMSProp, etc. all treat W as a flat vector of mn independent scalars. The central theme of this tutorial is: what do we gain by treating G as a matrix? Shampoo and Muon are two answers to that question.


Step 2: Momentum

Stochastic gradients are noisy. Momentum smooths them with an exponential moving average:

Bt=μBt1+Gt,WWηBt

with μ0.90.95. Two ways to think about it:

A common refinement is Nesterov momentum, which uses a "lookahead" version of the buffer — effectively replacing the update direction Bt with Gt+μBt. It tends to help slightly and Muon uses it by default.

Keep momentum in your pocket. It's orthogonal (pun intended) to everything that follows, and Muon will be exactly "momentum + one extra step."


Step 3: Preconditioning — why one learning rate isn't enough

In real loss landscapes, curvature differs wildly across directions: the loss might be a steep ravine in one direction and nearly flat in another. A single learning rate η must be small enough for the steepest direction, which makes progress in flat directions painfully slow.

The fix is to multiply the gradient by a matrix P (the preconditioner) that rescales different directions differently:

wwηPg(thinking of everything as flat vectors for now)

The gold standard is Newton's method, P=H1 (inverse Hessian), which makes the landscape look spherical. For a network with d parameters, H is d×d — utterly intractable when d is billions.

Adam is a preconditioner too, just a very restricted one. Its second-moment estimate v gives the update

wwηmv+ϵ

which is P=diag(v)1/2 applied to the momentum m: a diagonal preconditioner. Each parameter gets its own learning rate, but Adam cannot rotate or mix directions — it can only stretch along the coordinate axes. If the ravine in your loss landscape is diagonal to the axes, Adam can't see it.

So the design space looks like:

Preconditioner P Cost Power
I (SGD) free none
diagonal (Adam) O(d) per-coordinate scaling
full matrix O(d2)+ arbitrary rotation + scaling

Shampoo lives in the unexplored middle.


Step 4: Full-matrix AdaGrad — the ideal we can't afford

AdaGrad (2011) proposed a principled data-driven preconditioner. Accumulate the outer products of all gradients seen so far,

Ht=s=1tgsgsd×d,

and update with its inverse square root:

wwηHt1/2gt.

Intuition: Ht measures how much gradient "energy" has flowed in each direction. Directions with consistently large gradients get shrunk; directions with little accumulated signal get amplified. (The familiar AdaGrad — and by extension Adam's v — is just the diagonal of this.)

Full-matrix AdaGrad has strong theory behind it, but storing Ht (d×d) and computing Ht1/2 is hopeless at scale. For a single 4096×4096 layer, d=16.7M and H would have ~2.8×1014 entries.

The question becomes: can we approximate H1/2g without ever forming H?


Step 5: Shampoo — exploit the matrix structure

Shampoo (Gupta, Koren & Singer, 2018) answers yes, if we stop flattening. Treat the gradient of each layer as the matrix Gtm×n it actually is, and maintain two small accumulators instead of one giant one:

Lt=Lt1+GtGt(m×m, "row" statistics) Rt=Rt1+GtGt(n×n, "column" statistics)

The update preconditions the gradient from both sides:

WWηLt1/4GtRt1/4

Where does this come from?

It's full-matrix AdaGrad with one structural assumption. The claim is that the giant mn×mn matrix Ht is well approximated by a Kronecker product of the two small ones:

HtLt1/2Rt1/2.

Kronecker products have two lovely properties:

  1. (AB)p=ApBp, so the inverse square root we need is L1/4R1/4 — and there are your funny-looking 1/4 exponents: each side contributes half of the overall 1/2 power.
  2. Multiplying a Kronecker product into a flattened matrix is the same as multiplying the matrix from both sides: (AB)vec(G)=vec(BGA) — which is how the giant matrix–vector product collapses into the cheap two-sided update L1/4GR1/4.

So Shampoo = full-matrix AdaGrad, compressed through the assumption that row-space curvature and column-space curvature factorize. The memory drops from O(m2n2) to O(m2+n2) — for the 4096×4096 layer, from 2.8×1014 entries to 3.4×107. (The name is a joke: it's a preconditioner you apply on both sides, like shampoo... and conditioner.)

What it does geometrically

Write the SVD of the gradient, G=UΣV. Up to the smoothing from accumulation, L1/4(·)R1/4 shrinks the directions (singular vectors) where gradients have historically been large and boosts the ones where they've been small. It's a whitening operation on the gradient's row and column spaces.

Shampoo's costs

It works very well — a distributed Shampoo variant won the external tuning track of the 2024 AlgoPerf optimizer benchmark, beating Adam-family baselines by a solid margin. But:

This is the state of the world Muon enters: bilateral preconditioning clearly helps; the bookkeeping clearly hurts. What's the smallest thing that keeps the benefit?


Step 6: The simplification — instantaneous Shampoo is just orthogonalization

Here is the pivotal observation. Suppose we strip Shampoo of its history: no accumulation, no epsilon — precondition each step using only the current gradient:

L=GG,R=GG,update=(GG)1/4G(GG)1/4.

Plug in the SVD G=UΣV. Then GG=UΣ2U and GG=VΣ2V, so:

$(GG)1/4G(GG)1/4=(UΣ1/2U)(UΣV)(VΣ1/2V)=UΣ1/2ΣΣ1/2V=UV.$

Everything involving Σ cancels. Single-step Shampoo replaces the gradient with UV — the gradient with all its singular values snapped to 1. This is called orthogonalization (or "semi-orthogonalization" for rectangular matrices, or taking the matrix sign): keep the gradient's directions, discard its magnitudes. It is also exactly the orthogonal matrix nearest to G.

This reframing is powerful because it suggests the accumulators were never the essential ingredient — the essential ingredient was the two-sided whitening, and in its purest form, whitening = orthogonalization. So maybe we can skip the L/R matrices, skip the fourth roots, and just... orthogonalize.

But wait — is throwing away the singular values actually a good idea? Step 7 argues yes, on independent grounds.


Step 7: Why orthogonalize? Two independent justifications

7a. Gradients are spectrally lopsided; rare directions matter

Empirically, gradient matrices (and momentum buffers) of transformer layers are extremely low-rank-ish: a few huge singular values and a long tail of small ones. Vanilla SGD's update is therefore dominated by a handful of directions, while the long tail — which may encode rare but important features (infrequent tokens, unusual patterns) — barely moves the weights.

Setting all singular values to 1 means every direction the gradient identifies gets an equal-sized step. The dominant directions are tempered; the rare ones are amplified. The motivating intuition behind Muon is precisely this boosting of "rare directions" that elementwise optimizers chronically under-serve.

7b. Steepest descent under the right norm

"Gradient descent follows the steepest direction" is only true relative to a norm. The general steepest-descent step solves

ΔW=\argmaxΔW1G,ΔW,

and the answer depends on which · you pick:

Why is the spectral norm the right norm for a linear layer? Because what we ultimately care about is how much the layer's function changes, not how much its parameter vector moves. A layer computes y=Wx, and the spectral norm bounds exactly that: ΔWxΔW2x. Controlling the spectral norm of the update controls the worst-case change in the layer's behavior. This perspective (developed by Bernstein, Newhouse and collaborators as "modular duality" / the modular norm) makes orthogonalized updates not a hack but steepest descent in the geometry natural to matrix-shaped parameters — and it also explains nice empirical side effects, like learning rates transferring across model widths better than with Adam.

So we have two arrows pointing at the same update: Shampoo's preconditioning logic (Step 6) and norm-aware steepest descent (Step 7). All that remains is computing UV cheaply.


Step 8: Muon

Muon (MomentUm Orthogonalized by Newton–Schulz; Keller Jordan et al., 2024) is exactly the recipe the previous steps assembled, with one efficiency trick. Per 2D weight matrix, per step:

Bt=μBt1+Gt(momentum, Step 2; Nesterov variant: use Gt+μBt) Ot=NewtonSchulz(Bt)UV of Bt(orthogonalize, Steps 6–7) WWηOt

That's the whole optimizer. State: one momentum buffer, same as SGD-with-momentum. No L, no R, no eigendecompositions, no stale preconditioners. Note that we orthogonalize the momentum buffer, not the raw gradient — smoothing first, then shape.

The Newton–Schulz trick

Computing UV via an exact SVD every step would be slow and is numerically unfriendly on GPUs in low precision. Newton–Schulz iteration computes it with only matrix multiplications — the one operation GPUs are unreasonably good at.

The idea: find an odd polynomial p(x)=ax+bx3+cx5 that, when applied repeatedly, pushes any value in (0,1] toward 1. Applying the matrix version

XaX+b(XX)X+c(XX)2X

acts independently on each singular value of X (the singular vectors are untouched — that's the magic of odd polynomials in X). So iterate enough times and all singular values converge toward 1 while U and V stay fixed: the output approaches UV.

Concretely, Muon:

  1. Normalizes: X0=B/BF (guaranteeing all singular values are in [0,1] so the iteration is in its basin of convergence);
  2. Runs 5 iterations of the quintic above with coefficients tuned to (a,b,c)=(3.4445, 4.7750, 2.0315);
  3. Runs the whole thing in bfloat16 — it's stable enough, unlike Shampoo's root computations which often want fp64.

The coefficients were chosen to maximize how fast tiny singular values get inflated toward 1 (steep slope at 0), at the price of not converging exactly — singular values land in roughly [0.7,1.3] rather than at exactly 1. Empirically this sloppiness doesn't hurt at all; the update only needs to be approximately orthogonal. The overhead is a handful of matmuls per layer per step — typically well under 1% of total training FLOPs for a transformer, and the iteration parallelizes/distributes easily.

Two practical details that matter

Muon is for hidden weight matrices only. Embedding tables, the output/LM head, and all 1D parameters (biases, LayerNorm/RMSNorm gains) are not optimized with Muon — they're handed to AdamW. The spectral-norm story is about matrices that act as linear maps between activation spaces; embeddings and the unembedding are really lookup tables / per-token vectors with different geometry, and orthogonalizing them empirically hurts. Convolution kernels can be used by flattening their last dimensions to make them 2D. So in practice "training with Muon" means Muon for the hidden matrices + AdamW for the rest.

Shape-aware scaling. UV has a fixed scale regardless of the layer's dimensions (every singular value ≈ 1, so the update's RMS entry size is about 1/max(m,n)... which varies with shape). To make one learning rate work across differently-shaped layers, implementations rescale the orthogonalized update — the original Muon multiplies by max(1,m/n), while Moonshot AI's large-scale variant multiplies by 0.2max(m,n) so the update RMS matches AdamW's typical scale (letting you reuse AdamW learning rates and weight-decay settings). Either way, the point is the same: consistent update magnitude across shapes, so η transfers.

Reference implementation (simplified)

import torch

def newton_schulz5(B, steps=5, eps=1e-7):
    a, b, c = 3.4445, -4.7750, 2.0315
    X = B.bfloat16()
    transposed = X.size(0) > X.size(1)
    if transposed:               # work with the wide orientation
        X = X.T
    X = X / (X.norm() + eps)     # singular values into [0, 1]
    for _ in range(steps):
        A = X @ X.T
        X = a * X + (b * A + c * A @ A) @ X
    return (X.T if transposed else X).to(B.dtype)

@torch.no_grad()
def muon_step(W, G, buf, lr=0.02, momentum=0.95, nesterov=True):
    buf.mul_(momentum).add_(G)                 # B = mu*B + G
    upd = G.add(buf, alpha=momentum) if nesterov else buf
    O = newton_schulz5(upd)
    O *= max(1.0, W.size(0) / W.size(1)) ** 0.5   # shape-aware scaling
    W.add_(O, alpha=-lr)

(Production versions add weight decay, distribute the Newton–Schulz computation across GPUs, and pair this with AdamW for non-matrix parameters.)

Does it work?

Muon's debut was setting speed records on the NanoGPT-speedrun benchmark (~1.35× faster than tuned AdamW to equal validation loss), and it has since been validated at much larger scale — Moonshot AI's Moonlight (a 3B/16B-parameter MoE trained on 5.7T tokens) reported roughly 2× computational efficiency versus AdamW with the scaled, weight-decayed variant, and Muon-trained models like Kimi K2 followed. Beyond raw speed, reported side benefits include lower memory than Adam (one buffer vs. two), better learning-rate transfer across model sizes, and updates with controlled spectral norm.


Step 9: The whole picture in one table

State per layer Update rule Expensive op Key idea
SGD+momentum B B smooth the gradient
Adam m,v m/v per-coordinate scaling (diagonal preconditioner)
Full AdaGrad H (d×d) H1/2g d×d root whiten in full parameter space (intractable)
Shampoo L,R (+ momentum) L1/4GR1/4 matrix 4th roots (amortized) Kronecker-factored whitening
Muon B NS5(B)UV 5 rounds of matmuls orthogonalize the momentum

And the conceptual chain, compressed:

AdaGrad wants H1/2g → Shampoo factorizes it into two-sided whitening L1/4GR1/4 → with no accumulation that collapses to UV → which is independently justified as steepest descent under the spectral norm → and UV can be approximated with five rounds of bf16 matmuls → apply that to a momentum buffer and you have Muon.

Caveats and open edges

Suggested reading order

  1. Keller Jordan's blog post, "Muon: An optimizer for hidden layers in neural networks" (2024) — the primary source, very readable.
  2. Bernstein & Newhouse, "Old Optimizer, New Norm" and the modular-norm line of work — the steepest-descent-under-a-norm theory (Step 7b).
  3. Gupta, Koren & Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (2018).
  4. Moonshot AI, "Muon is Scalable for LLM Training" (Moonlight, 2025) — weight decay, RMS-matched scaling, distributed implementation.

Opus 4.8 (High)

From Gradient Descent to Muon (by way of Shampoo)

An incremental tutorial. Each section adds exactly one idea to the previous one. By the end you should be able to look at the three-line Muon update and see why every line is there.

The punchline, stated up front so you know where we're going:

Muon is Shampoo with the bookkeeping thrown away. Shampoo preconditions a matrix gradient using accumulated statistics. If you strip out the accumulation and apply the same idea to a single matrix (the momentum), the math collapses into one operation: orthogonalize the matrix — set all its singular values to 1. Muon does exactly that, using a cheap GPU-friendly iteration instead of an SVD.

We'll build up to that claim, then prove it in two lines of linear algebra.


0. Notation and the one reframing that matters

A neural net is a big pile of parameters θ, a loss L(θ), and a gradient g=L(θ). Almost every optimizer you know treats θ as one long flat vector.

The single most important idea in this whole tutorial is to stop flattening. The weights of a linear layer are naturally a matrix Wm×n (mapping an n-dimensional input to an m-dimensional output). Its gradient is therefore also a matrix:

G=LWm×n.

SGD and Adam throw away this 2-D structure and process the mn numbers as an unordered bag of scalars. Shampoo and Muon keep the matrix shape and exploit it. That is the entire conceptual fork in the road.

Throughout, η is the learning rate and the SVD of a matrix is written G=UΣV, where U and V have orthonormal columns and Σ is diagonal with the singular values σi0.


1. SGD: the baseline

θt+1=θtηgt.

Step downhill, proportional to the gradient. Simple, but it struggles when different directions have very different curvature: it bounces along steep directions while crawling along shallow ones. Every fix below is, in some sense, an attempt to rescale the step direction-by-direction.


2. Momentum: remember where you were going

Raw gradients are noisy and oscillate. Average them over time with an exponential moving average ("velocity"):

mt=μmt1+gt,θt+1=θtηmt,

with μ0.90.95. Momentum smooths the trajectory and accelerates along consistent directions. Hold on to this mt — it survives all the way into Muon, where it becomes the thing we orthogonalize.


3. Adam: a diagonal preconditioner

Adam gives every coordinate its own adaptive step size by tracking the running magnitude of each coordinate's gradient:

mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gt2,θt+1=θtηm^tv^t+ϵ

(with bias correction m^t,v^t). The key structural fact: dividing by vt coordinate-wise is the same as multiplying the gradient by a diagonal matrix P=diag(1/vt):

θt+1=θtηPm^t.

This P is called a preconditioner. Adam's preconditioner is diagonal: it can stretch or shrink each axis independently, but it knows nothing about how coordinates relate to one another. For a weight matrix, that means it ignores all the row/column structure — it's still treating W as a bag of scalars.

That limitation is the opening for everything that follows.


4. The general preconditioning idea (and why the honest version is impossible)

Adam's diagonal P is a crude approximation to something better. What's the "ideal" preconditioner?

Newton's method says step with the inverse Hessian: Δ=H1g. This corrects for curvature across coordinates, not just per-coordinate. Full-matrix AdaGrad is a first-order cousin that doesn't need the Hessian. It accumulates the outer products of gradients and preconditions by the inverse square root:

G^t=stgsgsd×d,θt+1=θtηG^t1/2gt.

This is a genuinely good (dense) preconditioner — it captures correlations between all coordinates.

It is also completely infeasible. If a single weight matrix has d=mn= a few million entries, then G^t is a few-million-by-few-million matrix, and we'd need its inverse square root every step. No.

So the entire game becomes: approximate the full preconditioner G^1/2 with something cheap that still respects matrix structure. Adam's answer ("just keep the diagonal") is one extreme. Shampoo's answer is much smarter.


5. Shampoo: a structured (Kronecker) preconditioner

Shampoo's idea: instead of one giant mn×mn preconditioner, keep two small ones — one for the row space, one for the column space of the matrix.

For a weight matrix Wm×n with matrix gradient Gt:

Lt=stGsGsm×m(left / output-side, m×m)Rt=stGsGsn×n(right / input-side, n×n)

and the update applies one to each side of the gradient:

Wt+1=WtηLt1/4GtRt1/4

Two things to notice.

(a) It's cheap-ish. We store and invert an m×m and an n×n matrix instead of an mn×mn one. For a 4096×4096 layer that's two 4096-size matrices instead of one 16-million-size one. This works because Shampoo implicitly approximates the full preconditioner G^ by a Kronecker product RL — the structured assumption that row-correlations and column-correlations factorize.

(b) Why the strange exponent 1/4? Each of L and R is built from products like GG, so each already contains the gradient "squared." A power of 1/2 on each side would over-correct. The 1/4 is exactly the exponent that makes the two half-sided corrections compose into one proper 1/2-power preconditioning of the full gradient. The next section makes this concrete — and in doing so, hands us Muon.


6. The bridge: one step of Shampoo = orthogonalization

Here is the two-line calculation that the whole tutorial is built around. Take Shampoo with no accumulation — just the current gradient, L=GG and R=GG — and plug in the SVD G=UΣV.

The eigendecompositions fall right out:

L=GG=UΣ2UL1/4=UΣ1/2U,R=GG=VΣ2VR1/4=VΣ1/2V.

Now sandwich the gradient (watch the orthonormal factors cancel):

L1/4GR1/4=UΣ1/2UL1/4UΣVGVΣ1/2VR1/4=UΣ1/2ΣΣ1/2V=UΣ0V=UV.

The singular values completely cancel. What's left, UV, is the gradient with every singular value reset to 1.

(Aside: this also shows why the exponent had to be 1/4. Use 1/2 instead and you'd get UΣ1V=G+, the pseudoinverse — far too aggressive. The 1/4 is precisely tuned to equalize the singular values rather than invert them.)

What is UV? It's the orthogonal polar factor of G: the closest semi-orthogonal matrix to G in Frobenius distance. Geometrically:

That rebalancing is the entire benefit of preconditioning, distilled into one clean operation. And it costs no accumulated statistics at all — just the SVD of the current matrix.

So: one-step Shampoo = orthogonalize the gradient. Muon is what you get by orthogonalizing the momentum instead, and computing it without an SVD.


7. Muon = momentum + orthogonalization

The name says it: MomentUm Orthogonalized by Newton-schulz.

Mt=μMt1+Gt(momentum, from §2)Ot=orthogonalize(Mt)UMVM(the §6 operation)Wt+1=WtηOt(step)

Read against everything above, Muon is: take the smoothed gradient (momentum), apply the single-step Shampoo preconditioner to it (which equals orthogonalization), and step. It drops Shampoo's running Lt,Rt statistics entirely — it re-derives the preconditioner fresh from the current momentum each step.

Only one piece is missing: computing UV without paying for an SVD. That's the "Newton-Schulz" part.


8. Computing the orthogonalization cheaply: Newton–Schulz

An SVD gives UV exactly, but SVDs parallelize poorly on GPUs and are unstable in low precision (bf16). We don't need it exact — we just need all singular values pushed close to 1. A matrix polynomial iteration does this beautifully and runs entirely in fast matmuls.

The Newton–Schulz quintic iteration:

Xk+1=aXk+b(XkXk)Xk+c(XkXk)2Xk.

Why this works: every term is an odd polynomial in X, so it acts on each singular value independently through the same scalar map

σaσ+bσ3+cσ5.

Choose the coefficients (a,b,c) so that iterating this scalar map drives every σ in (0,1] toward 1. Then iterating the matrix version drives X toward U(all-ones)V=UV, exactly the orthogonalization we want. The directions U,V never change — only the singular values move — because the iteration only ever multiplies X by functions of XX.

Two practical points:


9. Why orthogonalize, really? The spectral-norm view

There's a second, independent justification for Muon that doesn't go through Shampoo at all, and it explains why equalizing singular values is the right thing for training stability.

Steepest descent depends on which norm you use to measure "step size." Formally, the steepest-descent direction is

Δ=\argmaxΔ1G,Δ.

(The second follows because the dual of the spectral norm is the nuclear norm, and G,Δ is maximized over the spectral-norm ball precisely at the polar factor UV.)

So Muon is steepest descent under the spectral norm. Why care? The spectral norm of a weight update controls how much the layer's output can change in the worst-case input direction. Bounding it keeps any single feature direction from getting blown out in one step, which tends to give more stable training and more balanced feature learning across directions — the same circle of ideas as spectral/μP-style scaling. Adam, by contrast, controls a per-coordinate (roughly -flavored) quantity and lets the spectral norm drift.

Two routes, one destination: the Shampoo route says "orthogonalization is the cheap preconditioner," and the geometry route says "orthogonalization is the right-shaped step." They agree on UV.


10. The practical recipe

Muon is not a drop-in replacement for every parameter — it's specifically a matrix optimizer. The standard setup is a hybrid:

Update-magnitude scaling matters. An orthogonalized update has all singular values 1, so its size is fixed by the matrix shape, not by the gradient magnitude. A semi-orthogonal m×n matrix has per-entry RMS of 1/max(m,n). To make Muon's effective step size comparable to AdamW's (so you can reuse your learning-rate intuitions), implementations scale the update — a commonly cited choice is multiplying by about 0.2max(m,n). Treat the exact constant as implementation-dependent and tune it.

Other defaults that work well in practice: momentum μ0.95 (often the Nesterov variant), decoupled AdamW-style weight decay (important at larger scale), and the cheap ~5-step Newton–Schulz above. The orthogonalization cost is negligible next to the layer's own matmuls, and it shards cleanly across devices.


11. The whole story in one ladder

Step What it adds Preconditioner on a matrix gradient G
SGD identity
+ Momentum temporal averaging identity (applied to mt)
+ Adam per-coordinate adaptive scale diagonal diag(1/v)
(ideal) Full AdaGrad all cross-coordinate correlations dense G^1/2infeasible
+ Shampoo Kronecker-structured approximation L1/4(·)R1/4, L,R accumulated
+ Muon drop accumulation, cheap solve L1/4(·)R1/4 on one matrix = orthogonalize, via Newton–Schulz

Every rung answers the same two questions slightly better: how much of the gradient's matrix structure do we exploit, and how cheaply can we exploit it? Adam exploits the diagonal. Shampoo exploits the Kronecker structure but pays to accumulate it. Muon notices that for a single matrix the expensive accumulation collapses into a parameter-free operation — orthogonalization — and then makes that operation cheap.


Appendix: minimal pseudocode

import torch

def newton_schulz(G, steps=5, eps=1e-7):
    """Approximate the orthogonal polar factor U V^T of G in bf16."""
    a, b, c = 3.4445, -4.7750, 2.0315          # tuned quintic coefficients
    X = G.bfloat16()
    X = X / (X.norm() + eps)                    # normalize: singular values <= 1
    transpose = X.size(0) > X.size(1)           # iterate on the smaller dimension
    if transpose:
        X = X.T
    for _ in range(steps):
        A = X @ X.T                             # acts on each sigma via sigma -> a*s + b*s^3 + c*s^5
        B = b * A + c * (A @ A)
        X = a * X + B @ X
    if transpose:
        X = X.T
    return X.to(G.dtype)

def muon_step(W, G, M, lr, momentum=0.95):
    """One Muon update for a 2-D weight matrix W with gradient G and momentum buffer M."""
    M.mul_(momentum).add_(G)                    # 1. momentum:        M = mu*M + G
    O = newton_schulz(M)                        # 2. orthogonalize:   O ~ U V^T
    scale = 0.2 * (max(W.shape) ** 0.5)         # 3. match AdamW-ish update RMS (tune this)
    W.add_(O, alpha=-lr * scale)                #    step:            W -= lr * scale * O
    return M

The three numbered lines in muon_step are the three lines from §7. Everything in this tutorial exists to explain why line 2 is newton_schulz and not something more complicated — because for a single matrix, the smartest known preconditioner (Shampoo) reduces to exactly that.


A note on what I'm confident about vs. not

The linear algebra (the §6 cancellation, the §9 spectral-norm derivation, the role of the 1/4 and 1/2 exponents) is exact and is the durable core of the tutorial. The specific hyperparameters — the Newton–Schulz coefficients, step count, and the 0.2max(m,n) scaling — are tuned engineering choices that vary across implementations and that people are still refining; use them as sensible starting points, not gospel. If you're implementing this for a real run, it's worth checking the current reference implementation for the latest recommended constants.