LIVE · BENGALURU EST. 2024
EXP #005 Th Th Q-02 · ML Foundations from First Principles · step 1 ✓ Achievement

Bias-Variance: The U-Curve Everyone Cites and Few Can Diagnose

Anyone can label a model "overfit". The signal is reading it off two numbers, knowing which lever to pull first — and knowing exactly where the classical theory stops being true.

2026-06-07 8 MIN READ COMPLETE
01

HYPOTHESIS

H₀
H₀ Test error decomposes cleanly into bias, variance, and irreducible noise — and a model's failure mode can be diagnosed from the gap between training and validation error, then treated by moving along the complexity axis.
02

METHOD

This is the first entry in the ML Foundations from First Principles quest — prep for senior ML/AI engineer interviews, done the way I build systems: intuition first, then the math, then a test that tries to break my understanding.

I studied bias-variance to the point where I could explain it without notes, then sat a four-question mock interview: one conceptual, one reading a buggy learning-curve script, one production system-design scenario, and one curveball on why billion-parameter models don’t overfit. The interesting part wasn’t what I got right — it was every place I circled the right idea without naming the mechanism. Those gaps are the real output.

03

OBSERVATIONS

  1. 01Bias is error from wrong assumptions — the model is too simple to represent reality, and it shows up as high training error. It’s consistently wrong in the same way. Variance is sensitivity to the particular training sample — low training error but a large gap to validation. It memorised the noise, so a different sample would give a wildly different model.
  2. 02The diagnosis lives in two numbers. The train/val gap is the variance meter; the height of the training error is the bias meter. You never observe bias and variance directly — you reason about them from total error.
  3. 03”Learning curve” is ambiguous and the distinction matters. The one that diagnoses bias vs variance is error vs dataset size: if train and val converge at a high value, it’s bias and more data won’t help; if a gap persists and is still closing, it’s variance and more data will. That’s different from the per-epoch loss curve you watch during training, which only tells you when to early-stop.
  4. 04High bias and high variance can coexist — a complex model trained on noisy or mislabelled data. The trap: adding regularisation to fix the “overfit” pushes bias even higher. When both are high, fix the data before touching the model.
  5. The classical U-curve only describes generalisation within a fixed (i.i.d.) distribution, and it treats parameter count as a proxy for capacity. Modern deep learning breaks both assumptions — drift moves the distribution, and overparameterised networks enter a second descent where test error falls again. Owning bias-variance means owning exactly where it stops being true.
04

THE MATH

For a new point, expected test error decomposes into exactly three pieces:

E[(yf^(x))2]=(Bias[f^(x)])2wrong on average+Var[f^(x)]jumpy across samples+σ2irreducible\mathbb{E}\big[(y - \hat{f}(x))^2\big] = \underbrace{\big(\text{Bias}[\hat{f}(x)]\big)^2}_{\text{wrong on average}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{jumpy across samples}} + \underbrace{\sigma^2}_{\text{irreducible}}

Where Bias[f̂(x)] = E[f̂(x)] − f(x) (how far the average model is from the truth) and Var[f̂(x)] = E[(f̂(x) − E[f̂(x)])²] (how much the model fluctuates if you redraw the training set). The third term, σ², is a noise floor no model can beat.

Turning the model-complexity knob trades one for the other: more complex lowers bias but raises variance. You are not eliminating either — you are minimising their sum. That sum traces the famous U.

05

PLAYGROUND

interactive

This is the whole tradeoff in one slider. Drag from SIMPLE to COMPLEX and watch a polynomial try to fit 16 noisy samples of a true sine curve.

FITTING A DEGREE-3 POLYNOMIAL TO 16 NOISY SAMPLES
ERROR vs MODEL COMPLEXITY
train MSE = 0.050test MSE = 0.001gap = -0.048GOOD FIT · BALANCED
SIMPLECOMPLEX
sweet spot ≈ degree 3 (lowest test error) · drag to under/overfit

↑ At degree 1 the line can’t bend to the curve — high bias, train and test error both high. Crank it to degree 12 and the curve snakes through every noisy point — high variance, train error near zero but test error blows up. The bottom panel is the payoff: train error always falls, test error makes the U. The bottom of that U is the sweet spot.

06

DIAGNOSIS — READ IT OFF TWO NUMBERS

The entire framework, made actionable:

DATA TABLE n=3
What you seeDiagnosisReach for first
Train high, val also high (small gap)High bias / underfitBigger model, more features, less regularisation
Train low, val much higher (big gap)High variance / overfitMore data, regularisation, early stopping
★ Both high + huge gapBoth — usually bad/noisy dataFix the data, not the model

Notice the fixes for bias and variance pull in opposite directions — “less regularisation” cures one, “more regularisation” cures the other. That opposition is the tradeoff. And over-regularising doesn’t just stop helping; it walks you back across the U into high-bias territory. Validation error is the meter you watch to find the bottom.

There’s a second diagnostic the interview loves: the learning curve — error vs dataset size (not vs epochs). It answers the one question the U-curve can’t: will more data even help?

final gap = 0.038BALANCED
small gap at a low floor → near the sweet spot
SIMPLECOMPLEX
read the curves vs dataset size: converge-high = bias (more data won't help) · persistent gap = variance (more data will)

↑ Drag complexity. Simple → train and validation converge at a high floor with a tiny gap: that’s bias, and more data won’t move it. Complex → a big gap that’s still closing: that’s variance, and more data will help. The gap is your variance meter; the floor is your bias meter.

07

THE LLM EXCEPTION — DOUBLE DESCENT

interactive

Here’s where the U-curve stops being the whole story — and the question that breaks most candidates: billion-parameter LLMs have far more parameters than training examples, so by the classical tradeoff they should overfit catastrophically. They don’t. Why?

Because the classical U only covers the left half of this graph. Keep adding parameters and something strange happens: test error rises to a peak right where the model can just barely fit the data (the interpolation threshold, params ≈ data), then descends a second time. Modern LLMs live far to the right, in that second descent.

test error = 0.30CLASSICAL REGIME
the textbook U-curve lives here
SMALLHUGE
drag past the threshold — watch error rise, peak, then fall again into the regime where LLMs operate

This is how to visualise “params vs performance” for LLMs. Drag SMALL → HUGE. The classical regime (left) is the U we just studied. The peak is the danger zone. The modern regime (right) is why scale works — and it’s not magic: SGD implicitly prefers simple, flat-minimum solutions among the infinitely many that fit the data, so a huge model still lands on one that generalises.

08

CONCLUSION

✓ ACHIEVEMENT
Hypothesis confirmed — within its domain.

The decomposition holds and the train/val gap is a genuinely reliable diagnostic. I can read a model’s failure mode off two numbers and name the right first lever. That part is owned.

But the honest result is the boundary. The classical U-curve assumes i.i.d. data and treats parameter count as capacity — and both break in practice. A model that catches fraud at launch rots in production not because its variance grew, but because the distribution moved out from under it (drift is a different failure mode entirely). And billion-parameter LLMs generalise despite having more parameters than data, because SGD implicitly prefers simple, flat-minimum solutions and the test curve enters a second descent. Bias-variance doesn’t explain those — and knowing that is the point.

09

WHAT NEXT

Optimisers from first principles — SGD → Momentum → AdaGrad → RMSProp → Adam → AdamW — and why optimiser state (those two Adam moment buffers) becomes the dominant memory cost when you train large models. After that, a dedicated dive into double descent and the implicit regularisation of SGD: the two ideas that explain why the U-curve isn’t the whole story.