The State of Attention Research in 2026: Achievements, Limits, and Breakthroughs — The Preserve-then-Select Architecture Beyond Weighted Mixing

kanna qed
3月11日
読了時間: 7分

1. Why Do We Need a Map of Attention Research?

The attention mechanism, which has driven natural language processing and sequence modeling, is at a critical juncture of architectural reorganization in 2026. This article serves as a hub that categorizes the trajectory of attention research into three main paradigms: "Dense Mixing," "Sparse / Routing," and "Energy-based" descriptions. It then positions "GD-Attention (Ghost Drift Attention)" as a breakthrough that cannot be fully captured by these three existing currents.

GD-Attention is not merely an efficiency-oriented refinement of the "probability distribution-based information mixing" assumed by existing attention mechanisms. It represents a novel design philosophy that focuses centrally on "Selection" based on Semantic Energy. By elucidating the conditions of this divergence, we highlight the necessity of layer separation between "Preserve" (protecting candidates) and "Select" (choosing candidates)—an evolutionary direction for next-generation model architectures.

2. The Standard Form Established by Softmax Attention: The Paradigm of Dense Mixing

The fundamental concept of attention was initially introduced as a mechanism to dynamically assign weights to relevant information (Bahdanau et al., 2014; Luong et al., 2015). This concept was standardized as Self-Attention using the Scaled Dot-Product and Softmax function with the advent of the Transformer by Vaswani et al. (2017).

In this standard form, attention is defined by "calculating a probability distribution over all tokens" and "computing a weighted sum (Weighted Blending) of Values based on that distribution." Subsequent developments, such as Sparsemax (Martins & Astudillo, 2016) and entmax in Adaptively Sparse Transformers (Correia et al., 2019), which achieve context-dependent sparsification, still remain within the framework of "probability distribution design" and "Mixing," even though they strictly reduce certain weights to zero.

These limitations became increasingly evident between 2024 and 2025. Differential Transformer (Ye et al., 2024) highlighted the issue that Transformers are prone to allocating excessive attention to irrelevant context, and proposed a structure that cancels out noise through the difference between two attention maps. Furthermore, Scalable-Softmax Is Superior for Attention (Nakanishi, 2025) argued that as sequence length increases, the Softmax attention distribution flattens, weakening the focus on critical information in long contexts. These studies indicate that the weighted mixing paradigm inherently suffers from limitations such as noise accumulation and diminished concentration.

3. What Did Sparse / Routing Attention Change? Designing Computational Efficiency and Receptive Fields

To overcome the quadratic computational complexity ($O(N^2)$) bottleneck associated with increasing sequence lengths, attention research pivoted toward "Sparsification" and "Routing."

Starting with Sparse Transformers (Child et al., 2019), models such as Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) emerged, combining local windows with global tokens. Furthermore, mechanisms like Routing Transformer (Roy et al., 2020), which computes attention only between similar tokens via clustering, and Reformer (Kitaev et al., 2020), which utilizes Locality-Sensitive Hashing (LSH), were proposed.

The primary objective of these Sparse / Routing Attention methods is to "optimize the trade-off between maintaining expressive power and computational efficiency." They focus entirely on designing the receptive field and sparsity patterns on the attention matrix—dictating "where to look" and "how much to compute." These methods do not treat the "Selection" of semantic candidates as their central issue.

However, in 2025, the Sparse paradigm itself entered a new phase. Native Sparse Attention (Yuan et al., 2025) introduced a dynamic hierarchical sparsification strategy that combines coarse-grained token compression with fine-grained token selection, reporting performance and long-context efficiency comparable to or exceeding full attention. In other words, while the primary focus of the Sparse / Routing lineage remains on efficiency, it is advancing toward a more precise handling of candidate compression and local selection.

4. What Did Energy-based Attention Redefine? Attention as an Energy Landscape

Distinct from the pursuit of computational efficiency, another approach theorizes the behavior of the attention mechanism through the lens of physics and associative memory.

Hopfield Networks is All You Need (Ramsauer et al., 2020) proved that the attention update rule of the Transformer is mathematically equivalent to the minimization process of the energy function in a continuous-valued Modern Hopfield Network. Furthermore, the Energy Transformer (Hoover et al., 2023) redefined the attention layer itself as a network designed to minimize a "specifically engineered energy function."

This line of research offered a vital perspective by conceptualizing attention as "convergence to attractors on an Energy Landscape." However, these frameworks remain confined to state descriptions and theoretical foundations; they do not extend to the implementation of an "explicit selection mechanism" based on energy.

Consequently, the primary discussion in 2026 is no longer merely whether attention can be modeled as an energy landscape. Rather, the focus has shifted to the types of competition, exclusion, and selection that should be implemented upon it. The Energy-based paradigm has provided a coordinate system for theorizing selection, but it does not constitute a completed selection layer in itself.

5. The Breakthrough: A New Focus on Semantic Energy Selection

Building upon these existing paradigms (Softmax distributions, Sparsification via Sparse/Routing, and theoretical grounding via Energy), GD-Attention emerges as a novel breakthrough centered on "Semantic Energy Selection."

This breakthrough entails more than simply replacing existing attention mechanisms with alternative implementations. In 2026, A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention (Ye et al., 2026) demonstrated that a strict hierarchical difference in expressive power can exist between full attention and hybrid/linear attention. This implies that attention research is no longer solely about efficiency, but has entered the realm of capability design, questioning "which competition structures to allow in which layers" and "what selection capabilities to retain." In this context, GD-Attention is positioned not just as an efficient attention variant, but as a divergence that explicitly targets selection capability as a primary design objective.

In the past, there have been approaches that hybridized Soft Attention and Hard Attention, assigning Subset Selection to the Hard Attention (Shen et al., 2018). GD-Attention, however, is not a mere introduction of discrete Hard Selection. It presupposes the concept of the "Energy Landscape" presented by Energy-based Attention and executes Singular Selection using the energy states among semantic candidates as the evaluation metric.

In other words, GD-Attention provides a focus that cannot be fully captured by the three existing currents in the following ways:

Target of Operation: It operates on the energy states of semantic candidate groups, rather than relying on the weighting of a single probability distribution.
Criterion for Elimination: It executes intentional exclusion based on incompatibility within the energy landscape, rather than relying on sub-threshold weights or non-locality.
Output Format: It produces a definitive "Selection," moving away from Weighted Blending as the central operational principle.

6. Layer Separation with Beacon: Establishing the "Preserve-then-Select" Architecture

The structural prerequisite for enabling GD-Attention to function not merely as an isolated attention layer variant, but as the core of the architecture, is the integration of a "Beacon."

Beacon (Preserve Layer): Grounded in the philosophy of candidate protection, this layer retains minute gradients and minority semantic candidates (ghosts) that might otherwise be lost during computation, preventing them from being subsumed by the smoothing of the energy landscape.
GD-Attention (Selection Layer): This is the layer that makes the final "Selection" based on Semantic Energy among the multiple semantic candidates protected and presented by the Beacon.

This layer separation of Beacon (Preserve) -> GD-Attention (Select) establishes a new architectural paradigm known as "Preserve-then-Select." It addresses issues that existing Transformer-based architectures—which mix and average information—could not adequately handle, thereby providing a new design foundation that makes the structure of selection explicit.

References

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473.
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv preprint arXiv:1508.04025.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
Martins, A., & Astudillo, R. (2016). From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. International Conference on Machine Learning.
Correia, G. M., Niculae, V., & Martins, A. F. (2019). Adaptively Sparse Transformers. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S., & Zhang, C. (2018). Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling. AAAI Conference on Artificial Intelligence.
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509.
Roy, A., Saffar, M., Vaswani, A., & Grangier, D. (2020). Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics.
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150.
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Amrhein, P. (2020). Big Bird: Transformers for Longer Sequences. Advances in Neural Information Processing Systems, 33.
Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The Efficient Transformer. International Conference on Learning Representations.
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, S., Holzleitner, M., Pavlović, M., Sandner, G. K., & Hochreiter, S. (2020). Hopfield Networks is All You Need. International Conference on Learning Representations.
Hoover, B., Liang, Y., Pham, B., Panda, R., Strobelt, H., Chau, D. H., Zaki, M. J., & Koutra, D. (2023). Energy Transformer. arXiv preprint arXiv:2302.07253.
Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G., & Wei, F. (2024). Differential Transformer. arXiv preprint arXiv:2410.05258.
Nakanishi, K. M. (2025). Scalable-Softmax Is Superior for Attention. arXiv preprint arXiv:2501.19399.
Yuan, J., Gao, H., Dai, D., Luo, J., Zhao, L., Zhang, Z., Xie, Z., Wei, Y. X., Wang, L., Xiao, Z., Wang, Y., Ruan, C., Zhang, M., Liang, W., & Zeng, W. (2025). Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics.
Ye, X., He, X., Liao, C., Wu, C., & Lu, P. (2026). A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention. arXiv preprint arXiv:2602.01763.