§ blog · Optimization Loop06/17/2026
← All articles

How production AI models degrade — and how to detect it before it's too late

After deployment, a model keeps receiving inputs and returning predictions — but the world around it changes. Data drift and concept drift are the two mechanisms by which a once-accurate model gradually loses reliability without producing a single log error. How to detect degradation early, design a three-layer monitoring pipeline, and decide when to retrain versus fix the root cause.

Optimization LoopAI & MLModel MonitoringData DriftMLOps8 min read
FIG.L-01 · MODEL MONITORING LOOP● LOOPPROD INPUTlive featuresMODELv2.3 · INT8PREDICTIONSP(y=1) distFEATURE MONPSI / KL-divOUTPUT MONdist shiftPSI = 0.24▲ threshold 0.20RETRAINTRIGGEROPTIMIZATION LOOP · LAYER 06 · CLOSED FEEDBACKretrain ← monitor ← deploy

After a model is deployed to production, there's a common pattern: the team shifts focus to new features, and the model keeps running quietly — until someone notices it's performing worse than before. No code error, no exception in the logs. The world changed, but the model didn't.

These are the two most common drift mechanisms in production AI: data drift (when the input distribution shifts from what the model was trained on) and concept drift (when the relationship between inputs and the expected output changes). Neither produces any error in the logs, and both can be detected early — if you have the right monitoring loop in place.

Two types of drift — different causes, different fixes

Correctly identifying the type of drift is the first step to responding correctly. Retraining on new data won't fix a concept drift problem without reviewing labels; fixing feature engineering won't resolve an upstream data quality issue.

  • Data drift: the distribution of input features shifts compared to training time. A common example: a surface defect detection model trained on daytime lighting conditions, after a camera moves to a different lighting environment. The model still accepts images and returns predictions — they're just increasingly wrong because the pixel intensity histogram has drifted from the training distribution
  • Concept drift: the relationship between input and the correct output changes, even if the input distribution stays stable. Example: a churn prediction model trained before a major product redesign — the same behavioral patterns no longer correlate with churn in the same way. This is harder to fix: you can't just retrain on new data without reviewing whether old labels are still valid
  • Upstream data quality drift: a source pipeline changes silently — a column gets a different unit, nulls are handled differently, or a third-party source starts returning lower-quality results. Statistically this looks like data drift, but the root cause is data engineering, not the real world changing

All three types produce no exception or system error. The system keeps running, the model keeps returning results — they just become progressively less trustworthy.

Measuring drift — from statistical tests to proxy metrics

Without drift monitoring, you typically have to wait for ground truth to accumulate and the gap to become large enough to notice. For systems with slow feedback (forecasting, image classification), that may be weeks. A few tools for earlier detection:

  • PSI (Population Stability Index): measures change in the distribution of a continuous feature. PSI < 0.1 is generally stable, 0.1–0.2 warrants monitoring, > 0.2 is significant. Can be computed in parallel for all input features in a batch monitoring job
  • Wasserstein distance and KL divergence: measure the "distance" between the current distribution and the training baseline — better suited than PSI for multidimensional or long-tail distributions
  • Output distribution shift: monitor the distribution of the model's own output (predicted probabilities, class ratios). If the predicted positive rate jumps from 5% to 20% with no obvious business reason, that's an early signal — even before any ground truth arrives
  • Delayed ground truth: for many problems (fraud detection, churn), ground truth arrives days to weeks later. When it does, compute accuracy/F1 over a recent rolling window directly from predictions vs. actuals. The most reliable signal — but the latest to arrive

Three monitoring layers — from features to business metrics

No single metric is reliable enough to decide when to retrain. Effective monitoring needs three coordinated layers — enabling early detection in the upper layers and confirmation in the lower ones:

  • Layer 1 — Feature monitoring: log the distribution of each input feature by batch (hourly or daily), compared against the training baseline. The earliest warning layer — drift in features typically surfaces before it affects outputs. Common tools: Evidently AI, WhyLogs, or a custom pipeline on Great Expectations
  • Layer 2 — Output monitoring: track the distribution of model outputs (predicted probabilities, confidence scores, predicted class ratios). Output drift typically lags feature drift but is more direct — if output distribution shifts sharply without corresponding feature drift, it often signals concept drift or a preprocessing issue
  • Layer 3 — Business metric monitoring: conversion rate, actual precision once ground truth arrives, false positive rates from downstream systems. The latest signal but the most important — if business metrics are holding, drift in layers 1–2 may not yet justify a retrain

The goal of three layers isn't to generate more alerts — it's to detect degradation 2–4 weeks earlier than a business metric would, giving enough time to collect new data, retrain, and validate before end users are affected.

When to retrain, when to rethink the architecture

Not every drift triggers a retrain. The decision depends on drift type, magnitude, and rate of degradation:

  • PSI > 0.1 on 2–3 of the most important features, business metrics not yet significantly affected → scheduled retrain with fresh data, no urgency
  • Output drift and feature drift both rising → urgent retrain, plus an audit of the data pipeline for upstream quality issues
  • Confirmed concept drift (ground truth arrives, accuracy drops, but feature distribution is relatively stable) → retrain with label review — old training labels may no longer be valid
  • Drift recurs after multiple retrains on new data → the distribution has shifted enough to warrant a feature engineering review or a rethink of the model architecture — no longer a retrain problem, but an architecture one

Real example: surface defect detection in steel manufacturing

In the structural monitoring and defect classification project across 4 steel mills, a computer vision model was deployed to detect surface defects on steel coil from industrial cameras. Three weeks after deployment on lines 3 and 4, output confidence scores began drifting lower — the model still made predictions, but the confidence distribution shifted toward lower values compared to baseline.

Layer 1 (feature monitoring) caught it: the pixel intensity histogram of images from cameras on lines 3 and 4 had drifted significantly from training data — seasonal lighting variation (warehouse doors opened differently). The fix wasn't a full retrain: augment the training set with images from all four lighting conditions and add a preprocessing normalization step. Without a monitoring loop, this drift could have gone undetected until recall on actual defects had dropped enough to disrupt the QC process.

Conclusion

AI models are not "set and forget" — this is the most fundamental difference between traditional software and AI in production. The Optimization Loop, the sixth layer in the KonexForge stack, is where monitoring, alerting, and retrain decisions are standardized into a repeatable process — not an ad-hoc task done only after it's already too late. Every Pilot Build is designed with a monitoring baseline from day one of deployment, not bolted on as an afterthought.

Have a similar problem to solve?

Contact the team