EXP #008 Ml Ml Q-03 · Production ML — Monitoring, Serving, Scale · step 2 ✓ Achievement

Monitoring in Production: When the Dashboard Is Green and the Model Is Dying

Your inputs are stable, your predictions are stable, every monitor is green — and the model is quietly getting worse. How that happens, and the four things you watch so you see it before your users do.

2026-06-12 7 MIN READ COMPLETE

HYPOTHESIS

H₀

H₀ A production monitoring system must combine label-free signals (available day-one) with delayed-label performance tracking via cohorts — because distribution monitors are structurally blind to concept drift.

METHOD

EXP-007 showed how to detect drift in the inputs. This is the architecture around it: what you actually watch on a model in production, and the one failure those input monitors can never see.

Studied on the credit-default problem — where the label (did they default?) can take up to a year to arrive — then a four-question mock, including the curveball: green dashboard, performance down, explain it.

OBSERVATIONS

01Four signals you can watch on day one, with no labels: (1) input feature drift [PSI/KS], (2) prediction-distribution drift, (3) data quality / pipeline health [nulls, schema, latency], (4) early proxy signals (30/60/90-day delinquency).
★Prediction-distribution drift is the most under-used day-one signal. If the model predicted a 5% default rate at launch and now outputs 12%, something moved — and you caught it before a single label exists. Most people monitor inputs and forget to monitor the output distribution at all.
02Delayed labels force cohort (vintage) analysis: group by origination month, evaluate only matured cohorts, use partial-maturity checkpoints (30/60/90-day) as early proxies, and track calibration per score band. On a credit book those proxies are concrete — EMI closure %, dues mid-month, new-vs-closed EMIs — leading indicators that show up in weeks, long before the 12-month default label. “Default accuracy for loans now 12 months old” is your freshest reliable number.
03The curveball — green dashboard, performance down — is concept drift via a hidden unobserved confounder. My favourite version: a Rapido ride model. On a normal day a ₹59 fare for <5 km books at ~70%. It starts raining — same fare, same distance, same time-of-day — and booking rate craters to <10%. Every input is unchanged, so input and prediction monitors stay green. The thing that moved — rain — is a feature you never put in the model, so concept drift looks like nothing at all. (Also possible: label drift, or a feature that kept its distribution but lost its predictive power.)
04Retrain trigger: matured-cohort performance drop, sustained PSI breach, or calibration break. Guardrail: champion/challenger — shadow the new model on live traffic and require it to beat the incumbent before promotion. On imbalanced targets, judge with PR-AUC, not ROC-AUC.

PLAYGROUND — THE GREEN DASHBOARD THAT LIES

interactive

The Rapido model, on a live dashboard. Drag the rain up and watch what happens: the two PSI monitors don’t budge, but bookings collapse. This is concept drift — and it’s why a green dashboard is not the same as a healthy model.

MONITORING DASHBOARD — RAPIDO BOOKINGS

Input drift

fare · distance · time-of-day

PSI 0.02

Prediction drift

model output distribution

PSI 0.03

Booking success

what actually happens

70%

Clear skies. Everything green, model healthy.

☀ NO RAIN☔ POURING

rain is a feature you never measured — your monitors watch X and ŷ, both stable. concept drift lives in P(y|X), so they\'re blind to it.

↑ Rain changed P(y|X) — the relationship between the inputs and the outcome — without touching X or ŷ. Distribution monitors only watch X and ŷ, so they’re structurally blind to it. The only thing that catches this is actual performance, which (on a credit model) arrives months late.

There’s one label-free signal people forget, though — the model’s own output distribution. If the predicted-positive rate creeps up week over week, something changed, and you didn’t need a single label to see it:

predicted positive rate = 13%within range

LAUNCH+12 WEEKS

no labels needed — the model\'s own output distribution drifting is a day-one alarm that something changed

↑ The score distribution drifts right; the predicted-positive rate climbs from its launch baseline. That’s a day-one alarm — not proof the model is wrong, but proof something moved and is worth investigating now, not in twelve months.

PLAYGROUND — WHY LABELS MAKE YOU WAIT

interactive

Each row is a loan cohort by origination month; each bar runs from origination to 12-month maturity. Drag “today” forward and watch which cohorts become evaluable.

▓ matured — perf measurable▓ pending — proxy labels only░ en route → 12mo maturity

today = month 9evaluable cohorts = 0 / 8

LAUNCH+22mo

drag "today" forward — a cohort's true performance is only measurable once it fully matures (12mo)

↑ A cohort’s true performance is only measurable once it fully matures (green). Recent cohorts are still pending — you only have proxy labels. This is why you can’t just compute “accuracy” on live traffic: the freshest reliable signal is always months behind.

DIAGNOSIS — WHAT EACH SIGNAL SEES (AND MISSES)

DATA TABLE n=4

Signal monitored	Catches	Needs labels?	Blind spot
Input feature drift	data drift P(X)	no	concept drift
★ Prediction drift	shift in model behaviour	no	silent if X stable too
Data quality / pipeline	broken features, schema	no	model logic
Matured-cohort performance	concept + label drift	yes (delayed)	arrives months late

CONCLUSION

✓ ACHIEVEMENT

Hypothesis confirmed.

Day-one signals — input drift, prediction drift, data quality, proxies — catch most failures fast, but they are structurally blind to concept drift, which only matured-cohort performance reveals. The honest takeaway: a green dashboard proves the inputs and outputs look normal, not that the model is right. You need both halves — and the patience to wait for cohorts to mature.

WHAT NEXT

Model serving and latency — batch vs online inference, INT8/INT4 quantisation, ONNX and vLLM, and the throughput/accuracy trade-offs that decide what actually ships.

★ RELATED EXPERIMENTS