Why Drift Detection Fails in the Field | The Root Cause is Not Data, but Evaluation Criteria

kanna qed
2025年12月23日
読了時間: 15分

1. Introduction: The Real Reason Drift Detection Fails | AI Evaluation is Misaligned, Not the Data

1.1 Background: The Ubiquity and Vulnerability of Machine Learning Systems in Production

Within the context of modern industrial structures, Machine Learning (ML) systems have evolved beyond experimental technologies to become core components of social infrastructure. According to recent statistics as of 2024, approximately 83% of organizations worldwide utilize ML in some capacity, with applications spanning from fraud detection in financial transactions to medical diagnostics, autonomous driving, and content generation via Generative AI. These systems possess the remarkable ability to learn from historical data and solve complex problems without explicit programming. However, to maintain their value over the long term, they must operate continuously within dynamic and uncertain “Production Environments.”

Nevertheless, production environments possess a character fundamentally different from controlled laboratory settings: “constant change.” As observed by Ovadia et al. (2019) and Klein (2021), ML systems are exceptionally sensitive to shifts in input data distribution and novel situations not represented in the training set (Out-of-Distribution: OOD). Furthermore, due to the non-deterministic nature of these systems, it is theoretically impossible to comprehensively test all potential scenarios prior to deployment. Consequently, operational systems are perpetually exposed to the risk of “Silent Failure.” This refers to a state where the system continues to function without triggering explicit errors, yet the quality and validity of its output are secretly deteriorating.

1.2 The Current State of Monitoring Technology and the Implicit Assumption of a “Fixed Evaluation Protocol”

To mitigate these challenges, advanced monitoring and drift detection technologies have been developed, exemplified by the comprehensive survey by Hinder et al. (2024) and the adaptive retraining framework by Dong et al. (2024). These research efforts have made significant contributions to preventing model obsolescence by detecting changes in data distribution (Data Drift) or shifts in the relationship between input and output (Concept Drift).

However, even within these cutting-edge studies, one critical assumption remains largely unquestioned: the assumption that “the ‘ruler’ used to evaluate the model — namely, the Metrics, Test Sets, and the definition of Ground Truth — remains invariant over time and is always valid.” While Hinder et al. (2024) measure the statistical distance of distributions and Dong et al. (2024) define accuracy degradation on a test set as “Harmful Drift,” neither study deeply explores the possibility that “the measurement criteria themselves may diverge from reality.”

1.3 Purpose and Structure of This Study: Proposing Evaluation Drift

This report argues that this “fixity of the evaluation protocol” is the primary risk factor in modern AI operations, particularly regarding generative AI and adaptive systems. We define the phenomenon in which the validity of evaluation criteria themselves is lost due to environmental changes or shifting social requirements as “Evaluation Drift” and propose it as a new conceptual framework.

The structure of this paper is as follows. Chapter 2 critically examines major prior studies such as Hinder et al. (2024), Dong et al. (2024), and Polo et al. (2023) to clarify their structural limitations. Chapter 3 systematizes the concept of Evaluation Drift and details its four primary components: “Metric Definition Drift,” “Ghost Drift,” “Update Opacity,” and “Loss of Adaptive Validity.” Chapter 4 demonstrates how conventional monitoring methods fail to detect Evaluation Drift based on these theories. Finally, Chapter 5 proposes an architecture (Dynamic Evaluation Store, Immutable Audit Logs, and Differential Auditing) to realize “Auditable AI Safety” predicated on Evaluation Drift, indicating a new direction for AI governance.

1.4 What is the Difference Between Drift Detection and GhostDrift Detection?

Conventional drift detection focuses on “whether the data (distribution) has changed.” Conversely, GhostDrift detection examines “whether the meaning (interpretive alignment) has altered, even if the evaluation protocol remains constant.” The central claim of this article is that the failure of drift detection in practical applications is primarily due to a misalignment of the “evaluation (the ruler)” rather than the “data.”

1.4.1 What Conventional Drift Detection Monitors

Detection Targets:
Changes in input distribution P(X) (Data Drift)
Changes in the relationship between input and output P(Y|X) (Concept Drift)
Degradation of performance metrics (which often requires delayed labels)
Typical Comparisons:
Reference distribution (past) vs. Current distribution
Reference performance (past) vs. Current performance
Implicit Assumption:
The evaluation protocol (including metrics, threshold policies, data splits, ground truth definitions, and aggregation granularity) is fixed and unchanging.

1.4.2 What is GhostDrift? (Minimum Definition)

GhostDrift is a phenomenon where, despite no model weight updates (or independent of such updates), the interpretive alignment of responses undergoes non-linear transformation due to the accumulation of context or changes in internal attention structures. This causes the consistency of dialogue or judgment to collapse. Even if fixed Q&A benchmarks are successfully passed, the alignment of the semantic space may be distorted, potentially rendering past evaluation results operationally invalid.

1.4.3 Distinguishing the Three Types of “Drift Detection”

In this section, we decompose and distinguish drift detection into the following three categories:

1: Data Drift Detection

What changes: Data distribution (statistical properties of inputs)
What to compare: Reference distribution vs. Current distribution
Common Failure: Even when evaluation criteria or operational rules have changed, the system appears “normal” simply because the data remains consistent.

2: Evaluation Drift Detection (The primary focus of this article)

What changes: The Evaluation Protocol (The Ruler)
Metrics
Threshold policies
Data splits (boundaries of train/calibration/test)
Ground truth definitions (labeling conventions)
Aggregation granularity (the unit of measure for determining OK/NG)
Execution code and execution environment (ensuring the “sameness” of evaluation)
What to compare: Old protocol vs. New protocol (Differential analysis)
Common Failure: If changes are introduced silently, past results become incomparable, allowing for an infinite loop of post-hoc justifications.

3: GhostDrift Detection (Semantic drift occurring even with fixed evaluation)

What changes: Meaning and interpretive alignment (internal structure)
What to compare: Behavioral differences against fixed probes under identical protocols and conditions.
Common Failure: Even if the model performs normally on existing benchmarks, the consistency of dialogue and the logic of judgment reasoning begin to collapse.

1.4.4 Minimum Requirements to Classify GhostDrift Detection as “Detection”

GhostDrift is not merely a subjective “atmosphere”; it can be treated as a formal detection process when the following minimum requirements are satisfied:

Guaranteeing Identity of Conditions: The evaluation protocol (metrics, thresholds, splits, ground truth definitions, aggregation, code, and environment) must be uniquely identifiable and reproducible under identical conditions.
Possession of Fixed Probes: In addition to existing accuracy benchmarks, a set of probes specifically designed to target semantic alignment (addressing contradiction, logic, self-consistency, and dependencies) must be prepared.
Detection via Differentials: Rather than relying solely on aggregate scores, log the specific differences in response structure, reasoning, and reference relationships, and base judgments on these differentials.
Separation of “Protocol Differentials” and “Behavioral Differentials”:
Existence of protocol differentials: Classified as Evaluation Drift.
Absence of protocol differentials alongside behavioral differentials: Classified as Suspected GhostDrift.

1.4.5 Connection to the Claim of This Article

The reason many drift detection mechanisms fail in field operations is not solely due to an exclusive focus on data drift. Because the evaluation protocol (the ruler) often shifts after the fact, conclusions become incomparable, and explanations become infinitely malleable. Furthermore, if the alignment of meaning alters (GhostDrift) even when evaluation is fixed, success on static benchmarks no longer guarantees safety. Therefore, this paper proposes extending the scope of drift detection from “data” to “evaluation protocols” while including the drift of meaning (GhostDrift) as a critical detection target.

2. Technical Achievements and Limitations in Modern ML Monitoring: A Critical Review

Approaches to maintaining the reliability of AI systems have evolved from simple rule-based monitoring to statistical distribution monitoring, and finally to adaptive control based on model performance. Here, we overview the major research representing the current state-of-the-art and highlight their limitations through the lens of the “fixity of evaluation.”

2.1 Distribution Process Monitoring in Hinder et al. (2024) and Its Limitations

Hinder et al. (2024) conducted an extensive survey of ML model monitoring and systematized the interdependencies between data and time. Their significant contribution lies in reconceptualizing the monitoring target not merely as a Time Series but as a “Distribution Process.”

2.1.1 Methodology: Moment Trees and Kernel Methods

Hinder et al. argue that when interest lies in the collective distribution (e.g., shifts in public opinion) rather than individual data points (e.g., individual voting behavior), the system should be modeled as a distribution process. Specifically, they propose a method to construct kernels specific to a dataset using machine learning models, augmenting conventional fixed kernels like the Gaussian Kernel.

Particularly noteworthy is their utilization of “Moment Trees (Hinder et al., 2021c),” which are random forests with modified loss functions for conditional density estimation. This model is trained to predict the observation time T from the observed data X, and the resulting kernel demonstrated dramatic performance improvements in drift detection.

2.1.2 Critical Examination: The Absolutization of the Reference Distribution

While the approach developed by Hinder et al. is statistically sophisticated, it is constrained by a fundamental philosophical limitation: the “fixation of the Reference Distribution.” Typically, the data distribution during training or validation is utilized as the reference distribution. Their method tests whether the current distribution $P_t(X)$ has deviated significantly from the reference distribution $P_{ref}(X)$ ($MMD(P_t, P_{ref}) > \epsilon$).

This approach contains a powerful normative judgment that “Reference Distribution = Normal.” However, in the real world, the definition of “normal” is itself dynamic. For example, in a fashion trend prediction model, if the historical “normal” color distribution is used as the absolute criterion, the emergence of every new trend will be flagged as “Abnormal (Drift).” Because the framework of Hinder et al. does not incorporate the updating of evaluation criteria (failing to track Evaluation Drift) — the idea that “the changed distribution may be the new normal” — it risks triggering a deluge of false positives or, conversely, missing phenomena where the distribution remains constant but the meaning has changed (the Ghost Drift described later).

2.2 Merits and Demerits of the “DDLA” Approach in Dong et al. (2024)

The study by Dong et al. (2024), “Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines,” addresses the monitoring problem from the perspective of practical cost-efficiency. They focused on the observation that not all data drifts degrade model prediction accuracy and proposed a method to avoid redundant retraining.

2.2.1 Methodology: DDLA and Identifying Harmful Drift

The core idea proposed by Dong et al. is to identify regions within the data distribution where model prediction accuracy is low (Data Distributions with Low Accuracy: DDLA). They partition the input space using decision trees and define leaf nodes (regions) with significant errors as DDLA. Consequently, they treat drift as “Harmful Drift” only when it occurs within these DDLA regions, thereby triggering retraining.

In experiments using real-world data, this method was reported to successfully maintain model accuracy above 89% while significantly reducing retraining costs compared to baseline models.

2.2.2 Critical Examination: The “Sanctification” of Accuracy and Circular Reasoning

While the claim of Dong et al. appears reasonable at first glance, it falls into a dangerous circular reasoning when viewed through the lens of Evaluation Drift. Their method is dependent upon “model accuracy as measured by an existing test set.” In essence, they assert that “if accuracy on the current test set is maintained, the drift is harmless.”

But what if the test set itself no longer reflects reality? For example, in a search ranking algorithm, suppose the quality of information sought by users shifts from “accuracy” to “comprehensiveness.” The model is trained to maximize “accuracy” and continues to produce high scores on the traditional accuracy-based test set. The method of Dong et al. would judge the data changes occurring in this situation as “harmless” (because accuracy hasn’t dropped). However, the business value — user satisfaction — is actually in decline.

The approach of Dong et al. adopts the stance of “trusting the measurement results as long as the ruler (test set) is presumed correct,” leaving the system defenseless against Evaluation Drift where the ruler itself becomes distorted.

2.3 Polo et al. (2023) “DetectShift” and the Wall of Hypothesis Testing

“DetectShift,” proposed by Polo et al. (2023), is a unified framework capable of separately testing for covariate shift ($P(X)$), label shift ($P(Y)$), and concept shift ($P(Y|X)$).

2.3.1 Methodology: Multi-dimensional Shift Testing

DetectShift identifies which type of shift is occurring even in situations with limited labeled data in the target domain, guiding adaptation strategies such as retraining or importance sampling. They attempt to control the False Alarm Rate by utilizing formal hypothesis testing.

2.3.2 Critical Examination: Divergence Between Feature Space and Semantic Space

While DetectShift is mathematically rigorous, it remains limited to distribution comparisons on a fixed feature space. Particularly in high-dimensional and semantic models like Generative AI and LLMs, the contextual meaning of input $X$ may transform even if the statistical distribution of $X$ remains constant. The method of Polo et al. assumes that $X$ and $Y$ follow a pre-defined fixed schema and cannot handle “Unknown Unknowns” outside that schema or shifts in the evaluation axes themselves.

2.4 Summary of Prior Research: The Trap of Fixed Protocols

What the studies by Hinder, Dong, and Polo have in common is the stance of “judging the current state using an evaluation protocol (distributions, test sets, and variable definitions) defined in the past as an absolute standard.” While this is effective for detecting “Model Drift” or “Data Drift,” it may provide a “False Sense of Security” regarding “Evaluation Drift” where the standard itself is in motion. The “Silent Failure” warned of by Rabanser et al. (2019) occurs precisely within this blind spot.

3. Theoretical Framework of Evaluation Drift

To overcome the limitations of prior research, we introduce the concept of “Evaluation Drift” and elucidate its structure. Evaluation Drift is not merely a decrease in accuracy, but “the transformation of the definition of accuracy itself.”

3.1 Definition of Evaluation Drift

Evaluation Drift refers to the phenomenon where, due to the passage of time, environmental changes, or the transformation of social or business contexts, the validity and reliability of the metrics, test data, and ground truth labels used to evaluate AI systems decline. This results in an irreversible divergence between measured performance and true utility/safety in the real-world environment.

3.2 The Four Components of Evaluation Drift

Evaluation Drift is not a single phenomenon but a composite driven by the following four primary mechanisms:

3.2.1 Metric Definition Drift

This occurs when the definitions of business KPIs or evaluation metrics change due to internal consensus-building or external factors. As noted in (11), even a common term like “Revenue” can have fluid definitions depending on the department or period, such as whether it is “pre-discount” or “post-discount,” or whether it “includes refunds.”

Phenomenon: The model continues to maximize “Revenue under the old definition,” while management decisions begin to be made based on “Revenue under the new definition.”
Result: Model performance on the dashboard remains “green (normal),” but actual business contribution declines. This is also referred to as “KPI Drift” (12).

3.2.2 Ghost Drift: Structural Transformation

This phenomenon is prominent in Generative AI, especially LLMs, where the model’s “Attention structure” undergoes non-linear transformation through the accumulation of context and prompts, without model parameter updates. According to (14), this is not a simple change in output but an irreversible shift in the internal structure (Interpretive Alignment) when the model generates responses to inputs.

Phenomenon: A “Bending of Meaning Space” occurs that cannot be captured by conventional fixed Q&A benchmarks (16).
Result: Even if the model responds normally to existing test prompts, the “personality” or “contextual” consistency in user interaction alters, effectively invalidating previous evaluation results.

3.2.3 Update Opacity and Epistemic Drift

“Update Opacity,” as proposed by Hatherley (2025), is a phenomenon where the user’s mental model collapses due to model updates (17).

Phenomenon: Even if a model is retrained and objective accuracy improves, if it behaves inconsistently with the “expectations” the user has built based on previous model experiences, the user perceives it as a “deterioration.”
Result: “Improvement” on evaluation metrics leads to a “loss of trust” in the field. This signifies that the evaluation criteria are drifting from “objective performance” to “user predictability” (19).

3.2.4 Adaptive Validity Loss

As demonstrated by Dwork et al. (2015) in “Preserving Statistical Validity in Adaptive Data Analysis,” repeatedly reusing the same holdout data (test set) for model evaluation and selection leads to “Adaptive Overfitting” to that specific test set (20).

Phenomenon: In automated retraining pipelines like those in Dong et al. (2024), if “Harmful Drift” is judged continuously with a fixed test set, the model becomes specialized to that test set and loses its ability to generalize to the broader population.
Result: Test scores remain high, but performance in the real environment degrades. This represents a drift in the evaluation axis of “Freshness of the evaluation dataset.”

3.3 Comparison Between Evaluation Drift and Conventional Drift

The following table summarizes the differences between conventional drift concepts and Evaluation Drift.

4. Why Latest Prior Research Misses Evaluation Drift: A Deep Analysis

Based on the aforementioned theory, this chapter demonstrates through specific mechanisms why the approaches of Hinder and Dong are vulnerable to Evaluation Drift.

4.1 The Blind Spot of “Harmfulness” Determination in Dong et al. (2024) DDLA

Dong et al.’s DDLA (Data Distributions with Low Accuracy) utilizes an efficient strategy of “retraining only in regions where accuracy is low.” However, if the Ground Truth (correct label) required to measure this “Accuracy” is not immediately available in production, or is available with a delay (Delayed Feedback), their method must rely on historical datasets.

Scenario Analysis: The Paradox of “Harmless” Drift in E-commerce Suppose user purchasing behavior in an e-commerce site shifts rapidly from “PC site” to “mobile app” (Data Drift).

Dong et al. Judgment: If the legacy model was trained on PC site purchase data and happens to yield high prediction scores for mobile access by chance, DDLA judges it as “no accuracy degradation = harmless.”
Evaluation Drift Perspective: Mobile users operate in a different UI/UX context than PC users, where the meaning of clicks (e.g., possibility of accidental taps) and evaluation criteria for dwell time differ. Determining “high accuracy” based on PC-era standards means missing mobile-specific opportunity losses (e.g., ignoring scroll depth). Here, the definition of “correctness” may have drifted from “purchase” to “quality of engagement,” but DDLA using a fixed evaluation set cannot detect this shift.

4.2 Static Limits of the “Distribution” Concept in Hinder et al. (2024)

Hinder et al.’s Moment Trees detect distribution changes by predicting time T. This excels at identifying “when the data occurred” but fails to address “how the meaning of that data changed.”

Scenario Analysis: Guideline Revisions in Medical AI Suppose in a medical imaging diagnostic model, a medical society revises the diagnostic criteria (guidelines) for a specific disease.

Hinder et al. Judgment: Since there is no change in the pixel distribution $P(X)$ of the input images (X-rays, etc.), kernel methods like Moment Trees judge it as “no drift.”
Evaluation Drift Perspective: Because diagnostic criteria have changed, certain ground truth labels $Y$ in historical datasets become incorrect (False). Since the evaluation criteria (the rules for assigning ground truth labels) are drifting, a divergence occurs between the model output $Y_{pred}$ and the new true correct answer $Y_{true}$. Distribution monitoring remains blind to this “change in rules.”

4.3 Warnings from Dwork et al. (2015) and the “Worn-out Ruler”

While Dong et al. (2024) recommend “retraining” after drift detection, they do not address the risk of reusing the same test set to evaluate the model after retraining. As pointed out by Dwork et al. (2015), reusing test sets in adaptive data analysis causes significant problems with statistical Multiple Hypothesis Testing, effectively invalidating significance levels (22).

In the context of Evaluation Drift, this is equivalent to a state where “the ruler is worn out and the scale cannot be read.” If updates to the test set are neglected for the sake of cost reduction within the framework of Dong et al., the model will fall into a state of overfitting — where it “fits the test set but not reality” — thereby accelerating Evaluation Drift.

5. Architecture for Auditable AI Safety

When Evaluation Drift is assumed, monitoring model accuracy alone is insufficient to guarantee AI safety. A new architecture is required that continuously updates and audits the evaluation process itself to ensure transparency. We propose an “Auditable AI Safety” framework consisting of the following three core components.

5.1 Dynamic Evaluation Store

Abolish fixed test sets and introduce a “Evaluation Store” that is constantly updated. This is an extension of the concept mentioned in (23).

Function: The Evaluation Store is a database, not a static file. Three types of data are continuously accumulated:
Golden Set: The latest ground truth data verified by humans.
Adversarial Examples: High-difficulty inputs designed to deceive the model, generated from red team testing or past failures.
Edge Cases: Data with low confidence scores in production or data that received explicit user feedback (e.g., complaints).
Response to Evaluation Drift: When making the DDLA judgment as in Dong et al., use data sampled based on “Freshness” from this Evaluation Store instead of a fixed test set. This allows harmfulness to be judged by “current standards.”

5.2 Immutable Audit Logs and ML Metadata

To track causes and fulfill accountability when Evaluation Drift occurs, the history of evaluations must be recorded in an unalterable form.

Function: Record the following information using blockchain or Write Once Read Many (WORM) storage technology (24):
Evaluation Logic Hash: Recording which version of the evaluation code and which metric definitions were used.
Data Snapshot: The IDs and states of data used during evaluation.
Judgment Results: Not just scores, but DDLA results and the decision on whether to retrain.
Utilization of Google ML Metadata: As shown in (26), manage the lineage of artifacts (models, data) and executions (evaluation, training) in a graph structure using ML Metadata (MLMD) schemas. This enables post-hoc verification (Audit) of “when and why evaluation criteria were changed.”

5.3 Differential Auditing to Elucidate the “Why”

To solve Hatherley’s “Update Opacity” and prove data provenance, we introduce “Differential Auditing” as proposed by Mu et al. (2022) (28).

Function: Applying the mathematical framework of Differential Privacy, quantify the degree of impact a specific dataset (or specific evaluation criteria) had on the model based on changes in model output.
Response to Evaluation Drift: When model behavior changes abruptly (e.g., Ghost Drift), distinguish whether it is due to “addition of data,” “change in evaluation logic,” or “change in the external environment.” If specific harmful data is causing structural changes in the model, the “Poison” can be identified via differential auditing.

5.4 Implementation Model of the Proposed Architecture

The overall structure of the proposed “Evaluation Drift-aware AIOps” is summarized below.

6. Conclusion and Future Outlook

In this report, we analyzed the limitations of monitoring frameworks presented by the latest research, such as Hinder et al. (2024) and Dong et al. (2024), from the perspective of the “assumption of fixed evaluation protocols.” As long as ML systems are operated in real society, not only input data but also the “ruler (evaluation criteria)” used to evaluate them will inevitably drift. Optimization while ignoring this “Evaluation Drift” may contribute to local cost reduction, but will ultimately lead to fatal risks such as “Silent Failure” and the “loss of user trust” in the long term.

The “Auditable AI Safety” framework we proposed is a paradigm shift that treats the evaluation process itself as a monitoring target, centered on the Dynamic Evaluation Store and Immutable Audit Logs. This achieves the following three objectives:

Restoration of Authenticity: Enables “living evaluation” aligned with current reality rather than relying on fixed relics of the past.
Ensuring Transparency: Resolves Update Opacity and enables explanation of model behavior based on data provenance and shifts in evaluation criteria.
Sustainable Safety: By incorporating evaluation criteria update cycles into the adaptive retraining process, the system becomes truly robust against environmental changes.

Future research tasks include the development of sophisticated automated update algorithms for the Evaluation Store and the creation of “Semantic Drift Metrics” to detect structural changes like Ghost Drift at an early stage. As AI systems gain greater autonomy, the “technology of evaluation” that governs them must also continue to evolve. Confronting Evaluation Drift is the most critical challenge in next-generation AI governance.

Note on Integrated Handling of References: In this report, research materials forming the basis of analysis are explicitly indicated by citation markers within the text and integrated into the context of the discussion. In addition to major prior research such as Hinder et al. (2024), Dong et al. (2024), Rabanser et al. (2019), Hatherley (2025), Dwork et al. (2015), and Polo et al. (2023), insights obtained from related technical blogs and documents (11) are also reflected.