EXP #007 Ml Ml Q-03 · Production ML — Monitoring, Serving, Scale · step 1 ✓ Achievement

Detecting Drift: PSI, KS, and the Stat That Lies When You Bin It Wrong

Three kinds of drift, four ways to measure it, and the one detail — your binning — that decides whether the alarm fires on noise or stays silent through a real shift.

2026-06-12 6 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ Distribution shift between training and production is measurable without labels, and the right statistic depends on the data type — but every binned metric (PSI, chi-square) can be fooled by a poor binning choice in either direction.

METHOD

First entry in Production ML — the MLOps half of interview prep. A model that ships is only half the job; the other half is noticing when the world it was trained on stops matching the world it now runs in. This post is about detecting that shift. The next one (EXP-008) is about building monitoring around it.

Studied in the fraud / credit-risk setting, then a mock: interpret a PSI series, name the binning failure modes, and match each statistic to its data type.

OBSERVATIONS

01Three drifts. Data drift — P(X) moves (the inputs shift, the rule is stable), e.g. a new user demographic. Concept drift — P(y|X) moves (same input, different correct answer): buying more than 1 kg of gold was legal yesterday and illegal today because a new government rule landed — identical transaction, flipped label. Label drift — P(y) base rate moves. Only data drift is detectable on day one with no labels.
02PSI thresholds: <0.1 stable, 0.1–0.25 investigate, >0.25 act. But PSI needs binning, and the binning cuts both ways.
★Binning lies in two directions. Bins too coarse hide a shift happening within a bin → missed drift. Picture a cosmetics brand whose buyers used to cluster in their 30s and now split into a younger and an older segment — same average age, so a coarse “young vs old” bin sees nothing, while finer age bins reveal the market polarised. Bins too sparse in the tails make the ln(prod/train) term explode on tiny fluctuations → false alarm, and an empty bin makes PSI undefined. The fix: ~10 quantile bins with an epsilon floor.
03Pick the test by data type. KS for continuous features — the max gap between two CDFs, bin-free, and it gives a real p-value. Chi-square for categorical — observed vs expected counts.
04PSI isn’t its own invention — it’s the symmetric KL divergence: PSI = KL(P‖Q) + KL(Q‖P), the Jeffreys divergence. The symmetry is exactly why the reference direction doesn’t matter.

THE MATH

Population Stability Index across bins:

$\text{PSI} = \sum_i (\text{prod}_i - \text{train}_i)\,\ln\!\frac{\text{prod}_i}{\text{train}_i}$

The Kolmogorov–Smirnov statistic — the largest vertical gap between the two cumulative distributions:

$D = \max_x \,\lvert F_P(x) - F_Q(x) \rvert$

And the identity worth remembering — PSI is just KL made symmetric:

$\text{PSI} = D_{KL}(P\,\|\,Q) + D_{KL}(Q\,\|\,P), \qquad D_{KL}(P\,\|\,Q) = \sum_i P_i \ln\frac{P_i}{Q_i}$

PLAYGROUND

interactive

Drag the production distribution away from training and watch PSI climb through the thresholds. This is exactly what a drift monitor fires on.

PSI = 0.000STABLE

NO DRIFTHEAVY DRIFT

drag the production distribution away from training — PSI climbs through green → amber → red

↑ The bars are the binned distributions. As the production mean shifts, the per-bin difference grows and PSI rises — green (stable) → amber (investigate) → red (act). No labels were needed to compute any of this.

Now the trap. Same drift, different binning — and PSI flips its verdict. A cosmetics brand’s buyers kept the same average age but split into a young and an older segment. Drag the bin count:

same buyers, same average age (44) — but the market split into a younger and an older segment

2 bins → PSI = 0.000MISSED — drift hidden in coarse bins

2 (coarse)16 (fine)

2 coarse bins see "still ~44 on average" and miss it entirely — only finer bins reveal the market polarised

↑ At 2 coarse bins PSI ≈ 0 — the average didn’t move, so the monitor says “stable” and misses a real market shift. Add bins and the polarisation appears; PSI crosses into the red. The drift never changed — only your binning did.

KS sidesteps binning entirely — it works on the cumulative curves. The statistic is just the biggest vertical gap between them:

KS statistic D = 0.451= the biggest vertical gap between the two CDFs

NO SHIFTBIG SHIFT

no bins anywhere — KS works directly on the cumulative curves, which is why it's the go-to for continuous features

↑ Drag the production distribution. The blue bar marks where the two CDFs are furthest apart — that distance D is the whole test. No bins, no binning trap, and it comes with a p-value.

DIAGNOSIS — WHICH TEST FOR WHICH DATA

DATA TABLE n=4

Statistic	Data type	p-value?	Watch out for
PSI	binned	no (rule of thumb)	binning — coarse hides, sparse false-alarms
KS test	continuous	yes	less sensitive in the tails
Chi-square	categorical	yes	low expected counts per cell
★ KL divergence	either	no	asymmetric — PSI symmetrises it

CONCLUSION

✓ ACHIEVEMENT

Hypothesis confirmed.

Drift is measurable label-free, and the statistic follows the data type — KS for continuous, chi-square for categorical, PSI (= symmetric KL) as the industry rule-of-thumb. The real skill is the binning: quantile bins with an epsilon floor, or the metric quietly lies in whichever direction your bins are wrong.

WHAT NEXT

Every statistic here watches the inputs. But a model can have perfectly stable inputs and still rot — and these metrics will never see it. The harder half — monitoring a model whose labels arrive a year late, and the failure mode no distribution monitor can catch — is EXP-008: Monitoring in Production.

★ RELATED EXPERIMENTS