The Achievements and Limitations of AI Research in 2025: The Expansion of Test-Time Compute and the Imperative for "Preserve-then-Select" Design

kanna qed
3月11日
読了時間: 5分

Abstract

AI research in 2025 significantly shifted its focus from simple performance improvements through model parameter expansion to exploration during inference (Test-Time Compute), agentic execution, and the evaluation and safety design of these processes. This article outlines the achievements of AI research in 2025 and the limitations that have subsequently emerged, positioning the "preserve-then-select" design principle of "Beacon" as a structural response to these technical challenges.

1. Achievements: The Expansion of Test-Time Compute and the Rise of Agentic Systems

The most significant change in AI architectures in 2025 was the paradigm shift from reliance on "pre-trained weights" to "how to explore and verify during inference." As Ji et al. (2025) point out, the approach of exploring multiple paths before generating an output and allocating additional computational resources to improve inference accuracy has been established as an independent research theme.

Simultaneously, there was a progression from single-turn response generation to multimodal and long-horizon agentic operations predicated on tool use (the Agentic Era). Kwa et al. (2025) demonstrated that the time horizon for tasks AI can execute has extended significantly. This implies that the evaluation axis for systems has expanded from "what was ultimately output" to "what state transitions and procedures were taken toward the goal."

2. Limitations: The Vulnerability of Internal Monitoring and the Difficulty of Path Recovery

While the advancement of test-time compute and agentic capabilities enabled more sophisticated reasoning, it also exposed new structural vulnerabilities. In complex exploration spaces, the process of "which candidates the system preserves and which it rejects" has become directly linked to safety and reliability.

First, the limits of monitorability in the reasoning process have been pointed out. Studies by Chen et al. (2025) and Korbak et al. (2025) show that the output of internal processes, such as Chain of Thought (CoT), does not always reflect the model's true reasoning state, making it vulnerable as a safety monitoring mechanism.

Second, there are risks specific to agents interacting with external environments over long tasks. As Evtimov et al. (2025) demonstrate, inference paths of agents can be easily hijacked by malicious instructions embedded in external information (e.g., prompt injection). Once safe candidates or crucial signals are rejected during the inference process, and the system transitions to a malicious or compromised path, it becomes exceedingly difficult for the system to autonomously return to its original safe state (a shift into a hard-to-recover regime).

In other words, the research trends of 2025 highlighted the difficulty of auditing and controlling the process—not "how to control the final output," but "at which stage in the intermediate inference paths critical candidates are prematurely discarded and dangerous selections are made"—as a critical limitation.

3. Structural Response (Beacon): Preserve-then-Select at the Pre-Selection Stage

The design principle of Beacon is positioned as a structural response to the limitations described above. The challenge faced by research in 2025 stems from the fact that while techniques for choosing a "smarter answer" from many candidates improved, countermeasures against the risk of rejecting "candidates that must not be lost" (such as safety or crucial signals) in a way that makes recovery difficult remained insufficient.

Beacon does not entirely replace existing architectures; rather, it functions as a design proposal that introduces a new layer in the inference process: "explicitly preserving crucial candidates before making a selection" (preserve-then-select).

This approach strongly aligns with the 2025 trends in verification functions and guardrail design. For example, as Venktesh et al. (2025) argued for the importance of verifiers in test-time scaling, and Jiang et al. (2025) demonstrated the effectiveness of thought correction before an agent acts, introducing an intervention layer at the "stage right before making a decision" is becoming an academic necessity. The full-scale emergence of process evaluation by Luo et al. (2025) also demands control over the process rather than post-hoc output filtering.

Instead of relying on post-hoc output filtering or imperfect internal monitoring (like CoT), Beacon embeds a mechanism within the architecture itself to "preserve crucial candidates before drifting into dangerous paths." By doing so, it aims to structurally mitigate the loss of information and safety risks associated with the "narrowing down of candidates," which is inevitable in test-time compute and long-horizon agentic operations.

4. Conclusion

AI research in 2025 reached remarkable milestones, notably the expansion of test-time compute and agentic capabilities. However, this has brought a new challenge to the forefront: "how to audit and control the process of exploration and selection." Beacon is not an idiosyncratic concept isolated from the context of mainstream research in 2025; rather, it can be concluded that it is a clear and rational design principle aimed at addressing the difficulty of recovering from premature selection failures and the need for candidate preservation—a deficit inevitably exposed by the evolution of mainstream models.

For a comprehensive view of next-generation AI research including Beacon, please refer to our Next-Generation AI Research page.

References (Key Research of 2025)

Ji, Y. et al. (2025). A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning. arXiv:2501.02497.
- A comprehensive survey on additional computation during inference. Foundational literature showing that the main battleground of research has shifted to exploration, preservation, and selection before the final output.
Venktesh, V. et al. (2025). Trust but Verify! A Survey on Verification Design for Test-time Scaling. arXiv:2508.16665.
- Systematizes the importance of verifiers in test-time scaling. Positions the "preserve-then-select" design within the trend of verified selection.
Chen, Y. et al. (2025). Reasoning Models Don't Always Say What They Think. arXiv:2505.05410.
- Points out that monitoring the reasoning process through CoT is insufficient. Suggests the necessity of candidate preservation as a structural element rather than relying on observation.
Korbak, T. et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv:2507.11473.
- Outlines the vulnerabilities of opportunities to monitor internal states. Provides the rationale for shifting safety assurance from monitorability to design.
Jiang, C. et al. (2025). Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction. arXiv:2505.11063.
- A safety method using thought correction before action. Empirically supports the effectiveness of intervention at the pre-selection stage (preserve-then-select).
Luo, H. et al. (2025). AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents. arXiv:2506.00641.
- An advanced evaluation framework for agent safety. Demonstrates the current landscape where evaluating intermediate processes is as essential as evaluating the final output.
Evtimov, I. et al. (2025). WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks. arXiv:2504.18575.
- Examines the vulnerability of agents to prompt injection. Shows that a structure to preserve crucial candidates is essential to prevent a difficult-to-recover drift into malicious paths.
Kwa, T. et al. (2025). Measuring AI Ability to Complete Long Software Tasks. arXiv:2503.14499.
- Measures the ability to execute long tasks. Suggests that as tasks become longer, "what to preserve without dropping" between successive reasoning steps becomes extremely important.