Detecting Drift: PSI, KS, and the Stat That Lies When You Bin It Wrong
Three kinds of drift, four ways to measure it, and the one detail — your binning — that decides whether the alarm fires on noise or stays silent through a real shift.
HYPOTHESIS
H₀METHOD
First entry in Production ML — the MLOps half of interview prep. A model that ships is only half the job; the other half is noticing when the world it was trained on stops matching the world it now runs in. This post is about detecting that shift. The next one (EXP-008) is about building monitoring around it.
Studied in the fraud / credit-risk setting, then a mock: interpret a PSI series, name the binning failure modes, and match each statistic to its data type.
OBSERVATIONS
- 01Three drifts. Data drift —
P(X)moves (the inputs shift, the rule is stable), e.g. a new user demographic. Concept drift —P(y|X)moves (same input, different correct answer): buying more than 1 kg of gold was legal yesterday and illegal today because a new government rule landed — identical transaction, flipped label. Label drift —P(y)base rate moves. Only data drift is detectable on day one with no labels. - 02PSI thresholds: <0.1 stable, 0.1–0.25 investigate, >0.25 act. But PSI needs binning, and the binning cuts both ways.
- ★Binning lies in two directions. Bins too coarse hide a shift happening within a bin → missed drift. Picture a cosmetics brand whose buyers used to cluster in their 30s and now split into a younger and an older segment — same average age, so a coarse “young vs old” bin sees nothing, while finer age bins reveal the market polarised. Bins too sparse in the tails make the
ln(prod/train)term explode on tiny fluctuations → false alarm, and an empty bin makes PSI undefined. The fix: ~10 quantile bins with an epsilon floor. - 03Pick the test by data type. KS for continuous features — the max gap between two CDFs, bin-free, and it gives a real p-value. Chi-square for categorical — observed vs expected counts.
- 04PSI isn’t its own invention — it’s the symmetric KL divergence:
PSI = KL(P‖Q) + KL(Q‖P), the Jeffreys divergence. The symmetry is exactly why the reference direction doesn’t matter.
THE MATH
Population Stability Index across bins:
The Kolmogorov–Smirnov statistic — the largest vertical gap between the two cumulative distributions:
And the identity worth remembering — PSI is just KL made symmetric:
PLAYGROUND
interactiveDrag the production distribution away from training and watch PSI climb through the thresholds. This is exactly what a drift monitor fires on.
↑ The bars are the binned distributions. As the production mean shifts, the per-bin difference grows and PSI rises — green (stable) → amber (investigate) → red (act). No labels were needed to compute any of this.
Now the trap. Same drift, different binning — and PSI flips its verdict. A cosmetics brand’s buyers kept the same average age but split into a young and an older segment. Drag the bin count:
↑ At 2 coarse bins PSI ≈ 0 — the average didn’t move, so the monitor says “stable” and misses a real market shift. Add bins and the polarisation appears; PSI crosses into the red. The drift never changed — only your binning did.
KS sidesteps binning entirely — it works on the cumulative curves. The statistic is just the biggest vertical gap between them:
↑ Drag the production distribution. The blue bar marks where the two CDFs are furthest apart — that distance D is the whole test. No bins, no binning trap, and it comes with a p-value.
DIAGNOSIS — WHICH TEST FOR WHICH DATA
| Statistic | Data type | p-value? | Watch out for |
|---|---|---|---|
| PSI | binned | no (rule of thumb) | binning — coarse hides, sparse false-alarms |
| KS test | continuous | yes | less sensitive in the tails |
| Chi-square | categorical | yes | low expected counts per cell |
| ★ KL divergence | either | no | asymmetric — PSI symmetrises it |
CONCLUSION
Drift is measurable label-free, and the statistic follows the data type — KS for continuous, chi-square for categorical, PSI (= symmetric KL) as the industry rule-of-thumb. The real skill is the binning: quantile bins with an epsilon floor, or the metric quietly lies in whichever direction your bins are wrong.
WHAT NEXT
Every statistic here watches the inputs. But a model can have perfectly stable inputs and still rot — and these metrics will never see it. The harder half — monitoring a model whose labels arrive a year late, and the failure mode no distribution monitor can catch — is EXP-008: Monitoring in Production.