Why ML Model Monitoring Fails: The "Post-hoc Modification of Evaluation Metrics" — A Blind Spot in AI Safety That Drift Detection Cannot Protect

kanna qed
2025年12月23日
読了時間: 9分

The fundamental reason model monitoring and drift detection systems fail is not rooted in the data itself, but in the fact that evaluation criteria are altered after the fact. This article defines AI safety as an "irreversible evaluation protocol" and establishes its connection with established theoretical frameworks.

Why Drift Detection Alone is Insufficient for AI Safety: Establishing "Non-Retroactivity" as a Minimum Condition

Data drift (shifts in input data distribution), concept drift (changes in the definition or boundaries of the target variable), and model decay (performance degradation over time) are well-documented risks in the production deployment of machine learning systems. Indeed, the distribution gaps between training and operational environments, alongside temporal degradation, often lead to diminished accuracy and unpredictable behavior. Consequently, these factors remain the primary focus of monitoring and detection in industrial settings. However, there is an emerging reality where the mere detection of such distribution shifts is insufficient to guarantee the overarching safety of AI systems.

Accidents and user distrust encountered in the field do not always stem from flaws within the data or the model architecture. There exists a critical blind spot: what is truly compromised is the "evaluation" itself. Even if the model's inputs and outputs remain constant, any form of model monitoring becomes futile if the evaluation criteria or "passing" thresholds are rewritten after the results are known. For instance, if performance indicators or acceptance thresholds are relaxed post-hoc to accommodate observed outcomes, the model is erroneously deemed to have "met the criteria," allowing potential failures to be overlooked. In essence, the ability to modify evaluations post-hoc is the very factor that causes the safety paradigm to collapse.

Mapping Existing AI Safety Theories (The Current Landscape)

Current approaches to AI safety are broadly categorized into three fundamental pillars:

Robustness – A research domain focused on ensuring models maintain consistent performance even in scenarios that deviate from their training data. This includes outlier detection, out-of-distribution (OOD) detection, and enhancing resilience against distribution shifts, all aimed at bolstering prediction reliability by mitigating model vulnerability to unforeseen data.
Interpretability – Efforts dedicated to designing methodologies and metrics that elucidate the rationale behind a model’s outputs, allowing humans to verify and understand its internal behavior. By clarifying feature contributions and decision-making logic, these methods supplement trust in otherwise "black-box" models.
Governance – A field concerned with the management processes across the entire model lifecycle—from development and testing to verification and ongoing operation—while ensuring ethical and legal compliance. It places heavy emphasis on the TEVV (Test / Evaluation / Verification / Validation) process. Risk management frameworks increasingly advocate for continuous testing and evaluation throughout the system's operational phase [1]. For example, NIST’s AI Risk Management Framework (AI RMF 1.0) posits that incorporating testing and evaluation throughout the AI lifecycle, and systematically documenting those results, is fundamental to establishing trustworthiness [2].

While these existing theories have significantly advanced AI safety by addressing issues inherent to models and data (such as uncertain inputs and model opacity), it is crucial to note that none of these approaches directly guarantees the prerequisite that "the evaluation process itself must remain immutable." A robust model loses its significance if the evaluation metrics can be manipulated later. Similarly, even the most sophisticated explanations are rendered counterproductive if they are used as a pretext to substitute or weaken evaluation indicators. Furthermore, while risk management frameworks prioritize evaluation records, they stop short of mandating specific protocols that fix evaluation parameters in a way that precludes post-hoc tampering.

The Blind Spot — Safety is Impossible if "Evaluation Can Be Altered Post-hoc"

The blind spot of AI safety resides in the inherent malleability of the evaluation process. The following patterns of "post-hoc evaluation modification" are frequently observed in real-world applications:

Retrospective Threshold Adjustment – Situations where the acceptance criteria for performance (e.g., classification score thresholds) are lowered or raised only after the results are reviewed. This allows underperforming models that should have triggered alerts to be classified as "passing" simply by moving the goalposts.
Post-hoc Redefinition of Evaluation Periods – Retroactively altering the timeframe or data subset used for model evaluation. This includes selectively excluding periods of poor performance or, conversely, cherry-picking intervals where the model performed well. Such practices render "continuous monitoring" a mere formality.
Metric Substitution – Replacing the originally agreed-upon evaluation metrics with alternative ones based on the observed results. For instance, if prediction accuracy declines, the focus might be shifted to a different score, justifying the model's continued use by claiming, "While accuracy is down, this specific metric shows no issue."
"Well-Intentioned" Evaluation Modification – It is vital to recognize that these modifications do not always stem from malice or fraud; they often emerge from "good-faith" optimization efforts in the field. A deployment lead might relax a threshold as a "temporary measure to meet business requirements," or an analyst might justify post-hoc that they "adjusted the evaluation set to account for data bias." While these are often viewed as "flexible judgments" in practice, this field-level optimization is precisely the blind spot that destabilizes the foundation of AI safety.

In short, even the most advanced model monitoring is defenseless against post-hoc evaluation changes. When evaluation methods and criteria fluctuate based on outcomes, drift detection and alert thresholds lose their functional meaning. Without addressing this, the discourse on AI safety remains purely theoretical.

Unresolved Challenges in Existing Research (The "Continuum" of Contention)

Fortunately, fragments of this "broken evaluation" problem are beginning to be acknowledged in contemporary research. However, these discussions remain isolated within specific sub-fields and have yet to converge into a comprehensive solution—namely, a "fixed evaluation protocol" that spans the entire AI lifecycle. To contextualize our proposal, it is necessary to examine the current state of related prior research.

Metric Gaming and the Obsolescence of Targets (Goodhart’s Law) – The principle that "when a measure becomes a target, it ceases to be a good measure" (Goodhart’s Law) has been empirically validated in machine learning. Research by Hutchinson et al. critiques the ML community’s over-reliance on a narrow set of metrics (predominantly accuracy), which leads to those metrics becoming ends in themselves, overshadowing essential concerns such as safety and fairness [3][4]. This results in models that "game" the metrics. In reinforcement learning, the optimization of proxy rewards often leads to a deviation from the true objective, with performance deteriorating sharply beyond a certain critical threshold [5]. Karwowski et al. (ICLR 2024) provided a geometric explanation for this phenomenon and proposed mitigations like early stopping [6]. However, addressing "metric gaming" is not a fundamental solution; as long as the evaluation itself can be altered post-hoc, increasing the number of metrics merely leads to a "cat-and-mouse" game.
Data Contamination Rendering Metrics Deceptive – In the evaluation of Large Language Models (LLMs), test data is frequently leaked into pre-training corpora, leading to an overestimation of the generalization performance being measured. Emerging technologies aim to detect and quantify this contamination. For instance, Oren et al. (2024) developed a statistical test to prove test set contamination in black-box LLMs [7]. Their method exploits the fact that, without contamination, benchmark data sequences should be equally probable, whereas contaminated models predict specific "canonical" sequences with significantly higher probability [8]. Similarly, Xu et al. (EMNLP 2025) introduced the DCR (Data Contamination Risk) framework, which scores contamination across four levels—Semantic, Information, Data, and Label—using fuzzy inference to correct a model’s true performance [9]. Their application of DCR to nine LLMs achieved contamination diagnosis within a 4% average error margin, highlighting its practical utility for routine evaluation [10]. Furthermore, Golchin & Surdeanu (TACL 2025) proposed the DCQ (Data Contamination Quiz), which utilizes word-level perturbations to estimate whether a model has "memorized" test data [11]. DCQ operates without access to training data or internal model states and has been reported to offer higher sensitivity than previous methods [12].
Pre-registration as a Barrier to Post-hoc Modification – In broader scientific practice, the "pre-registration of hypotheses and analysis plans" is used to enhance reproducibility. This mechanism involves publicly recording the intended evaluation methodology before research begins, thereby preventing "p-hacking" (repeated analysis until a favorable result is found) and "HARKing" (Hypothesizing After Results are Known). A comparative study by van den Akker et al. (2024) found that while pre-registration improves statistical power and research impact, it did not significantly reduce the ratio of positive results or statistical errors as much as hoped [13]. This suggests that while beneficial, pre-registration is not yet a definitive solution for eliminating post-hoc interpretation shifts. Moreover, it remains primarily a cultural convention in academia, lacking a technical protocol suitable for industrial ML operations.

In summary, the "evaluation failure" problem is being identified as a series of disconnected points. While metric gaming is recognized and contamination detection tools are emerging, they have not yet been synthesized into a unified protocol that renders evaluation immutable.

The Proposal: Defining "Non-Retroactivity" as a Mandatory Protocol Condition

The conclusion is clear: a minimum requirement for AI safety is the explicit, technical guarantee that "evaluation cannot be rewritten post-hoc." We define this as the "Non-retroactivity of Evaluation." This must be viewed not as a mere ethical guideline, but as a rigid protocol requirement for system implementation.

Non-retroactivity of Evaluation: A state in which it is impossible to modify evaluation methodologies or criteria once results have been observed. This requires the prior fixing, recording, and locking of the following five elements:

Evaluation Boundary (Scope) – Predetermine the exact data distribution, population, and timeframe for evaluation. For example, by fixing the scope to "production data from January to March 2025," the range cannot be conveniently altered to suit the results.
Evaluation Split (Fixed Data and Duration) – Finalize the partitioning of training, testing, and validation data, as well as the specific duration for time-series evaluation. Once set, these datasets and periods must not be replaced after results are obtained.
Threshold Policy – Define the numerical thresholds for acceptance, alert conditions, hysteresis (allowable margins), and any exception rules in advance. This removes the possibility of "fine-tuning" thresholds based on the model’s performance.
Input Identification – Uniquely identify all input data provided during evaluation, including specific preprocessing steps. By recording dataset hashes or script versions (data fingerprinting), we ensure the inputs have not been manipulated post-hoc.
Execution Environment and Code – Secure the exact code (model versions and scripts) and the computational environment used. Utilizing commit IDs and Docker image hashes ensures that the evaluation process is fully reproducible by a third party under identical conditions.

The execution of an evaluation process based on these fixed parameters culminates in an "Evaluation Certificate" or "Audit Log." This document uniquely identifies the model, data, procedures, thresholds, and environment, and records the final performance values. Crucially, this certificate allows a third party to reproduce the exact same results from the same inputs, technically eliminating the suspicion that criteria were shifted. By establishing a protocol for issuing and verifying these certificates, we enable fact-based discourse on AI safety, providing a foundation of trust where the evaluation axis is immutably fixed.

Connecting to Existing AI Safety (As a Foundational Requirement)

The proposed "Non-retroactivity of Evaluation" does not replace existing AI safety approaches; rather, it serves as the foundational condition that makes them viable.

Enhancing Robustness: Fixing Boundaries – Robustness evaluation and drift detection only gain substance when the evaluation boundaries are fixed. Without a non-retroactive protocol, unfavorable drifts can be ignored by simply redefining the scope of the evaluation.
Protecting Interpretability: Preventing the Abuse of Explanations – Interpretability is intended to build trust, but loose protocols allow "soft" explanations to be used as an excuse to bypass performance criteria (e.g., "I understand why it failed, so it's okay"). Furthermore, since explanation metrics themselves can be "Goodharted" [14][15], non-retroactivity ensures that explanations cannot be used to retrospectively bend the rules.
Strengthening Governance: Immutable Auditability – In AI governance, audit logs are only reliable if they are immune to arbitrary modification. The proposed Evaluation Certificate acts as an immutable record of the TEVV process, fulfilling NIST AI RMF requirements for objectivity and documentation [2]. This shifts the burden of proof from self-declaration to technologically secured factual records.

Ultimately, the superstructures of robustness, explainability, and auditability are only as strong as the underlying foundation that prevents evaluation infrastructure from being tampered with post-hoc.

Conclusion: A New Foundation for AI Safety

AI safety is not achieved solely through robustness, performance, or transparency. No matter how advanced a model is, its safety cannot be guaranteed if the evaluation framework remains subject to the operator’s post-hoc discretion. The true blind spot of AI safety is "evaluation modification," and the minimum condition for safety is that "evaluation cannot be rewritten later." By implementing a robust, non-retroactive evaluation protocol, we can finally address model uncertainty head-on and operate "Safe AI" in a manner that is truly verifiable by third parties.

References

NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0) (NIST Technical Series Publication) [1][2]
Hutchinson et al., Evaluation Gaps in Machine Learning Practice. FAccT 2022 [3][4]
Karwowski et al., Goodhart’s Law in Reinforcement Learning. ICLR 2024 [5][6]
Xu et al., DCR: Quantifying Data Contamination in LLMs Evaluation. EMNLP 2025 [9][10]
van den Akker et al., Preregistration in practice: A comparison of preregistered and non-preregistered studies in psychology. Behav Res Methods 56(6):5424–5433 (2024) [13]
Hsia et al., Goodhart’s Law Applies to NLP’s Explanation Benchmarks. Findings of ACL: EACL 2024 [14][15]
Oren et al., Proving Test Set Contamination in Black-Box Language Models. ICLR 2024 (OpenReview 2023) [7]
Golchin & Surdeanu, Data Contamination Quiz: A Tool to Detect and Estimate Contamination in LLMs. TACL 2025 [11][12]

[1] [2] Artificial Intelligence Risk Management Framework (AI RMF 1.0) https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf

[3] [4] Evaluation Gaps in Machine Learning Practice https://facctconference.org/static/pdfs_2022/facct22-3533233.pdf

[5] [6] proceedings.iclr.cc https://proceedings.iclr.cc/paper_files/paper/2024/file/6ad68a54eaa8f9bf6ac698b02ec05048-Paper-Conference.pdf

[7] [8] Proving Test Set Contamination in Black-Box Language Models | OpenReview https://openreview.net/forum?id=KS8mIvetg2

[9] [10] aclanthology.org https://aclanthology.org/2025.emnlp-main.1173.pdf

[11] [12] [2311.06233] Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models https://arxiv.org/abs/2311.06233

[13] (PDF) Preregistration in practice: A comparison of preregistered and non-preregistered studies in psychology https://www.google.com/search?q=https://www.researchgate.net/publication/375575020_Preregistration_in_practice_A_comparison_of_preregistered_and_non_preregistered_studies_in_psychology

[14] [15] aclanthology.org https://aclanthology.org/2024.findings-eacl.88.pdf