The Real Reason Drift Detection Fails in Production - Why MLOps Needs "Audit" Over "Validation" (with Reproducible Demo)

kanna qed
1月1日
読了時間: 3分

“The model had 99% accuracy in PoC. It performed perfectly on test data. …So why does the field dismiss its alerts as ‘unreliable’ once it’s in production?”

If you’ve worked in MLOps, you’ve likely hit this wall. The hard truth is that models fail in production not because of “accuracy,” but because of a lack of “objective grounds” for their decisions.

In this post, I will introduce ADIC (Audit of Drift in Context), an audit protocol designed to solve this fundamental trust issue.

1. Why Drift Detection is Doubted Post-Hoc

No matter how advanced your detection logic is, the following suspicions inevitably arise during the operations phase:

Moving the Goalposts: Are you just loosening the thresholds because there were too many “noisy” alerts?
Retrospective Adjustments: Are you twisting evaluation criteria after seeing the results, claiming “the data was just special at that time”?
Irreproducible Judgments: Can a third party prove that a “detection” made months ago was actually correct based on the standards of that time?

Case Study: When demand forecasting collapsed during COVID-19, the problem wasn’t whether the “change” could be detected. The problem was whether we could prove post-hoc that the decisions made at that time followed a pre-defined protocol. Conventional tools can show that “something changed,” but they cannot reproduce the “validity of the judgment at that time,” leading to a loss of trust and a forced return to manual operations.

2. The Problem Isn’t “Detection,” It’s “Fixing Responsibility”

In a production environment, what matters isn’t the sophistication of your detection algorithm. It is: “Can you explain the judgment at that time to stakeholders or third parties?”

If this remains ambiguous, any detection alert is just “black box magic” to those in the field. Before chasing “accuracy,” we must fix the “record of responsible judgment.”

3. The Solution: Turning Detection into an “Audit Protocol”

Instead of leaving detection logic as a black box, ADIC upgrades it into a transparent Audit Protocol.

Fixing the Logic: Pre-determine the conditions, the data, and the splitting methods for evaluation.
Immutable Records: Save the fixed conditions and execution results in a format that cannot be altered later.
Third-Party Verification: Ensure that anyone — not just the original engineer — can reproduce whether the judgment followed the protocol.

4. Structure: Certificate → Ledger → Verifier

ADIC is an audit protocol designed to fix the basis of judgment in a form that can be reproduced post-hoc.

Certificate: Fixes the evaluation conditions, data splitting methods, and hashes. This is your “contract.”
Ledger: Records execution results in an append-only format. It prevents post-hoc threshold manipulation.
Verifier: Allows a third party to perfectly reproduce the OK/NG verdict through hash verification.

This structure ensures the objectivity that “this alert was issued based on a pre-agreed protocol.”

5. Case Study: Electricity Demand x Weather (Jan–Apr 2024)

Let’s look at a time-series audit demo using ADIC, focusing on “Early 2024 Electricity Demand Forecasting.”

In this audit, the result was Verdict: NG (TAU_CAP_HIT). This doesn't just mean "the prediction was wrong." It means that we have mathematically and procedurally proven that the prerequisites collapsed beyond the pre-defined tolerance (TAU_CAP).

Instead of a vague response like “accuracy is down, let’s tweak it,” you can take responsible action: “The audit proved the prerequisites collapsed; therefore, the model must be retrained.”

6. This Is Not a “Magic Bullet”

To be clear, ADIC is not a magic crystal ball that predicts the future. Nor does it guarantee zero error.

However, it is the only means to provide clear evidence for the question: “Was this judgment valid given the situation at that time?”

Summary: Entering the Era of Audit Protocols

As AI/ML becomes part of our social infrastructure, models stop in production not because of “low accuracy” but because of a “lack of accountability.”

Drift detection is evolving from a simple monitoring tool into an “Audit Protocol” that can be explained to third parties.

This is not a “detection tool” — it is a protocol for fixing the responsibility of judgment.