EXP #006 Op Op Q-02 · ML Foundations from First Principles · step 2 ✓ Achievement

Optimisers: From SGD to Adam, One Fix at a Time

Every optimiser is a patch on the one before it. Learn the chain — momentum fixes zigzag, RMSprop fixes the dying learning rate, Adam fuses both — and you can derive any of them from memory.

2026-06-07 9 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ The optimiser family is not a list to memorise but a chain of fixes: each method addresses a specific failure of its predecessor, and Adam is exactly Momentum (direction) plus RMSprop (per-parameter scale) with bias correction.

METHOD

Second entry in ML Foundations from First Principles. I studied the optimiser family in Andrew Ng’s notation — α, dW, v_dW, s_dW — then sat a four-question mock: the mechanism, reading two bugs out of an Adam implementation, a 24 GB memory-budget scenario, and a curveball on why computer vision still trains with plain SGD.

The framing that made it stick: optimisers are a chain of fixes, not a list. Each one patches a specific failure of the one before it.

BUILDING BLOCK — EXPONENTIALLY WEIGHTED AVERAGES

interactive

Everything downstream is built on this one idea. To smooth a noisy signal:

$v_t = \beta\, v_{t-1} + (1-\beta)\,\theta_t$

It averages over roughly 1/(1−β) readings: β=0.9 ≈ 10, β=0.98 ≈ 50, β=0.5 ≈ 2. Because v₀ = 0, early values start too low — bias correction fixes the cold start:

$v_t^{\text{corrected}} = \frac{v_t}{1 - \beta^{\,t}}$

β = 0.90averages over ≈ 10 readings

NOISY (β=0.5)SMOOTH (β=0.98)

higher β = smoother but lags more · toggle bias correction to fix the cold start

↑ Drag β. Low β tracks every wobble (noisy); high β is smooth but lags the trend. Toggle bias correction and watch the cold-start dip lift.

THE CHAIN

Momentum — apply the EWMA to the gradient. A low-pass filter on direction:

$v_{dW} = \beta\, v_{dW} + (1-\beta)\,dW \qquad W := W - \alpha\, v_{dW}$

Consistent directions build up speed; oscillating ones cancel. dW is acceleration, v_dW velocity, β friction. Fixes the ravine zigzag.

RMSprop — EWMA of the squared gradient, then divide by its root:

$s_{dW} = \beta_2\, s_{dW} + (1-\beta_2)\,dW^2 \qquad W := W - \alpha\,\frac{dW}{\sqrt{s_{dW}} + \varepsilon}$

Squaring drops the sign, so s_dW measures magnitude. Dividing by √s_dW is automatic gain control: loud, volatile dimensions get smaller steps; quiet ones get bigger. It’s AdaGrad with the unbounded sum swapped for an EWMA — which is why its learning rate never decays to zero and stalls.

Adam — Momentum and RMSprop together, each bias-corrected with its own β:

$v_{dW} = \beta_1 v_{dW} + (1-\beta_1)dW, \qquad s_{dW} = \beta_2 s_{dW} + (1-\beta_2)dW^2$

$W := W - \alpha\,\frac{v_{dW}/(1-\beta_1^{\,t})}{\sqrt{s_{dW}/(1-\beta_2^{\,t})} + \varepsilon}$

Defaults: β₁=0.9 (first moment), β₂=0.999 (second moment), ε=10⁻⁸, α tuned. AdamW is the modern variant — it decouples weight decay from the adaptive denominator, which is why it, not vanilla Adam, trains LLMs.

THE WAY I REMEMBER IT — ADAM IS TWO FILTERS

interactive

Here’s the mental model that made Adam click for me — straight out of signal processing. Adam is two filters stacked:

①v_dW is a low-pass filter on direction. The raw gradient is a noisy signal; the EWMA passes the slow, consistent trend and rejects the high-frequency jitter. That’s momentum — keep the direction the gradient agrees on, drop the bounce.
②1/√s_dW is automatic gain control. It measures the recent power of the gradient (its square) and turns the step size down where the signal is loud, up where it’s quiet — exactly like the AGC in a radio that stops a loud station from blowing your ears out.

So one Adam step = smooth the direction (low-pass) and normalise the volume (AGC). Watch both at once below — the top panel is the low-pass smoothing the noisy gradient; the bottom is the gain dropping right where the gradient gets loud.

v_dW — LOW-PASS FILTER · keeps the trend, drops the noise

1/√s_dW — AUTOMATIC GAIN CONTROL · turns the step DOWN where the gradient is loud

β₁ (low-pass)0.90
β₂ (AGC)0.990

Adam = a low-pass filter on direction + automatic gain control on step size. Raise β₁ to smooth harder; raise β₂ to make the gain react slower.

↑ Raise β₁ and the coral low-pass line gets smoother (more momentum). In the shaded burst the raw gradient goes loud — the blue gain line dips, because s_dW sees the power spike and the AGC clamps the step. Two knobs, two filters.

PLAYGROUND — WATCH THE PATH

interactive

Ng’s contour-bowl, made interactive. The surface is a ravine — steep up-and-down, shallow left-to-right. Switch optimisers and watch how each navigates it.

step

SGD zigzags across the steep walls · momentum & RMSprop damp the oscillation → straighter, faster path

↑ SGD zigzags across the steep walls, crawling toward the min. Momentum averages those oscillations away → a straighter, faster path (drag β to feel the friction). RMSprop damps the steep direction directly. This single picture is the whole reason these methods exist.

THE COST — MEMORY

The part interview answers usually miss: optimiser state is a memory tax, and at scale it dominates.

DATA TABLE n=4

Optimiser	Fixes	State stored	Memory
SGD	(baseline)	none	1×
Momentum	ravine zigzag	v_dW	2×
RMSprop	AdaGrad dying LR	s_dW	2×
★ Adam / AdamW	both at once	v_dW + s_dW	3×

A 3B model in fp32: 12 GB of weights becomes 36 GB under Adam (weights + two full-size buffers), before gradients or activations. On a 24 GB card you don’t full-fine-tune — you reach for LoRA/QLoRA (train tiny adapters so the 3× tax applies to millions, not billions, of params), 8-bit Adam, or fall back to momentum-SGD.

CONCLUSION

✓ ACHIEVEMENT

Hypothesis confirmed.

The chain holds: momentum smooths direction, RMSprop scales per-parameter, Adam fuses them, AdamW fixes the weight-decay coupling. Derivable from memory, not memorised.

The two edges that separate depth from recitation: optimiser state is the dominant memory cost at LLM scale (LoRA/QLoRA, 8-bit Adam), and faster convergence isn’t free — Adam reaches sharper minima that generalise worse, which is why vision still trains ResNets with SGD+Momentum. Same implicit-regularisation thread as the double-descent graph: the optimiser quietly chooses which solution you land on, and flat beats sharp on the test set.

WHAT NEXT

Learning-rate schedules — warmup and cosine decay, and why transformers fall over without warmup. Then batch normalisation: what it actually does to the loss surface, kept separate from its accidental regularising side-effect.

★ RELATED EXPERIMENTS