top of page
検索

A Verified Survey of Prior Work and Structural Limits in Generative Search and LLM-IR:An Analysis Centered on Algorithmic Legitimacy Shift (ALS) (2026)

1. Executive Synthesis

This report provides a verified, fact-based review of recent research on Generative Search and Large Language Model–based Information Retrieval (LLM-IR). It analyzes the dual trajectories of the paradigm shift from Neural IR to LLM-IR in academia and the transition from Search Engines to Answer Engines in real-world operations.

Academic Perspective: Redefining the IR Pipeline

In contrast to traditional keyword matching (BM25) and early neural search (BERT-based), Large Language Models (LLMs) are redefining every stage of the search ecosystem. According to a recent comprehensive survey (published 14 Nov 2025, DOI: 10.1145/3748304), LLMs now function as distinct, integral modules: Query Rewriter (intent understanding/expansion), Retriever (knowledge indexing), Reranker (relevance judgment), and Reader (answer generation). Notably, RAG (Retrieval-Augmented Generation) has been widely adopted as a grounding approach aimed at reducing unsupported generations by conditioning the model on retrieved evidence (Lewis et al., NeurIPS 2020). In evaluation methodologies, LLM-as-a-judge (using LLMs for relevance labeling) is being investigated in shared-task settings, such as TREC RAG Track contexts exploring LLM-based automatic assessment vs. human judgments (Evidence: SIGIR '25, DOI: 10.1145/3726302.3730165). This accelerates the shift toward automated evaluation to complement human judgment.

Operational Perspective: Implementation Challenges in Generative Search

Generative search systems, exemplified by Google's AI Overviews (introduced through Search Generative Experience / SGE experiments and later expanded), now provide direct answers on the Search Engine Results Page (SERP). According to official Google Search Central specifications, this system is not merely a list of links but an aggregator that may use a "query fan-out" technique to issue multiple related queries and synthesize information. However, 2025 research reports by the UK communications regulator (Ofcom) highlight a critical trust gap: users are "more likely to rely on traditional search" for high-stakes domains (e.g., health, finance). The accuracy of generated answers, citation transparency, and the opaque impact on the Web ecosystem's traffic remain critical, unresolved limits.



2. Strategic Research Map (Selected 20+ Papers)

Key papers driving the evolution of LLM-IR are categorized into six clusters. All papers listed include persistent identifiers (DOI, ACL Anthology, OpenReview, PMLR, or official proceedings IDs).

A. Survey & Foundations

Year

Title

Venue / Source

Key Contribution

2025

Large Language Models for Information Retrieval: A Survey

Systematizes the role of LLMs in IR, categorizing them into modules such as Rewriter, Retriever, Reranker, and Reader.

2024

Dense Text Retrieval Based on Pretrained Language Models: A Survey

A comprehensive survey focusing on dense retrieval methodologies evolved from BERT-based models.

2025

From Matching to Generation: A Survey on Generative Information Retrieval

Provides a structured overview of the paradigm shift from "matching-based" to "generative" information retrieval.

2021

LaMDA: our breakthrough conversation technology


$$Fact$$

 Official disclosure of Google's conversational model design philosophy, prioritizing safety, groundedness, and factual accuracy. (Accessed: 2026-01-23)

B. Retrieval Models (Retriever)

Year

Title

Venue / Source

Key Contribution

2020

Dense Passage Retrieval for Open-Domain Question Answering (DPR)


$$Fact$$

 Established the standard for dense retrieval using Dual-Encoders, outperforming BM25 on open-domain QA benchmarks evaluated in the paper.

2021

Approximate Nearest Neighbor Negative Contrastive Learning (ANCE)

Improved DPR training efficiency and retrieval accuracy by refining negative sampling techniques.

2022

Unsupervised Dense Information Retrieval with Contrastive Momentum (Contriever)

Proposed a general-purpose Dense Retriever capable of high performance using unsupervised learning.

2021

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Established a comprehensive benchmark for evaluating retrieval models in zero-shot settings across diverse domains.

C. Reranking & Interaction (Reranker)

Year

Title

Venue / Source

Key Contribution

2020

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT


$$Fact$$

 Improves efficiency vs. cross-encoder reranking while maintaining strong effectiveness via a Late Interaction architecture.

2021

RocketQA: An Optimized Training Approach to Dense Passage Retrieval

Optimized DPR/Reranking training using advanced techniques like Cross-Batch Negative Sampling.

2021

Splade: Sparse Lexical and Expansion Model for First Stage Ranking

Achieved high-speed, high-precision search with standard inverted indexes via sparse lexical representation learning.

D. RAG & Grounding (Augmented Generation)

Year

Title

Venue / Source

Key Contribution

2020

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)


$$Fact$$

 Incorporated retrieval mechanisms directly into generative models, achieving SOTA in knowledge-intensive tasks.

2020

REALM: Retrieval-Augmented Language Model Pre-Training

Introduced a language model architecture that integrates retrieval during the pre-training phase.

2022

Improving language models by retrieving from trillions of tokens (RETRO)

DeepMind's scalable RAG architecture capable of performing retrieval over datasets of trillions of tokens.

2023

Enabling Large Language Models to Generate Text with Citations (ALCE)

Established a benchmark for evaluating the correctness of citation attribution in generated text.

E. Agentic & Reasoning (Autonomous Search)

Year

Title

Venue / Source

Key Contribution

2023

ReAct: Synergizing Reasoning and Acting in Language Models


$$Fact$$

 Established a method where models alternate between Reasoning (thought) and Acting (search/API calls).

2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Demonstrated improved complex task resolution by explicitly generating the reasoning process (CoT).

2023

Toolformer: Language Models Can Teach Themselves to Use Tools

Demonstrated that LLMs can learn to use external tools (e.g., search engines) via self-supervised learning.

F. Evaluation & Benchmarks (LLM-as-a-judge)

Year

Title

Venue / Source

Key Contribution

2024

Report on the 1st Workshop on Large Language Model for Evaluation in IR (LLM4Eval)


$$Fact$$

 Report on the first major SIGIR workshop dedicated to IR evaluation (automatic labeling) using LLMs.

2023

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Large-scale verification of agreement between LLM and human evaluation.

2023

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Demonstrated cost reduction and improved evaluation efficiency through the cascaded use of LLMs.


3. Critical Limits (8 Points)

Current technical and operational limitations are described strictly within the scope of facts based on peer-reviewed papers and public reports. Observation Logs are appended to supplement how these theoretical limits manifest in actual user experiences.

Limit-01: Hallucination and Information Inaccuracy

Limit-02: Citation Inaccuracy and Verification Difficulty

  • [Verified Fact]: Benchmark evaluations have confirmed cases where citations (source links) provided by generative models do not support the generated text or do not exist (Hallucinated Citations).

  • Evidence: Gao et al., EMNLP 2023 (ALCE, ACL: 2023.emnlp-main.417)

  • Observation Log:

    • Disclaimer: This log is an observation sample and not a verified claim.

    • Env: US/English, 2025-01-20

    • Log: Query "AI hallucination rates 2024" -> Generated summary claims "near zero error rate" -> Link points to an unrelated marketing blog (Mismatch Type: Irrelevant Source).

    • (Source: GhostDrift Institute Substack - Context Data)

  • Scope: Academic and verification tasks requiring high citation precision.

Limit-03: User Distrust in High-Risk Domains

  • [Verified Fact]: Research by Ofcom ("Generative AI search: consumer experiences") indicates that users are "more likely to rely on traditional search" for high-stakes domains such as health and finance.

  • Evidence: Ofcom: User experiences of Generative AI Search (2025-09-26) (Publicly available on Ofcom website; p.4 Key Findings)

  • Non-Claim: Does not assert that this tendency applies universally across all demographics or regions.

  • Scope: YMYL (Your Money Your Life) queries.

Limit-04: LLM Evaluator Bias (Position Bias / Self-Enhancement)

  • [Verified Fact]: When using LLMs as judges, systematic biases such as Position Bias (preferring answers in specific positions) and Self-Enhancement (favoring their own generated content) have been statistically observed in peer-reviewed studies.

  • Evidence: Wang et al., ACL 2024 (ACL: 2024.acl-long.468) (Positional Bias); Zheng et al., NeurIPS 2023 (Self-Enhancement)

  • Non-Claim: Does not claim that automated evaluation is always inferior to human evaluation.

  • Scope: Automated evaluation pipelines, RLHF.

Limit-05: Computational Cost and Latency

  • [Verified Fact]: Late Interaction models (e.g., ColBERT) require significantly higher computational resources (FLOPs) compared to traditional sparse vector search to maintain effectiveness.

  • Evidence: Khattab & Zaharia, SIGIR 2020 (DOI: 10.1145/3397271.3401075)

  • Non-Claim: Does not deny the possibility of future mitigation through hardware optimization or distillation techniques.

  • Scope: Real-time search, massive index environments.

Limit-06: Opaque Impact on Traffic (Zero-Click Reporting)

  • [Verified Fact]: Google Search Central documentation indicates that clicks on links in AI Overviews are counted as standard search clicks in Search Console performance reports (granular feature-specific metrics are not separated by default). Regulatory bodies (Ofcom) have noted concerns from publishers regarding potential traffic reductions.

  • Evidence: Google Search Central: AI Overviews (Accessed: 2026-01-23) (Performance Reporting); Ofcom: The Era of Answer Engines (2025-11-04)

  • Non-Claim: Does not assert that traffic decreases for every website in every scenario.

  • Scope: Web publishers, SEO strategy.

Limit-07: Complexity via Query Fan-out

  • [Verified Fact]: Google documentation states that AI Overviews may use a "query fan-out" technique to break down queries and synthesize information from multiple sources.

  • Evidence: Google Search Central: AI Overviews (Accessed: 2026-01-23) (How it works section)

  • Observation Log:

    • Disclaimer: This log is an observation sample and not a verified claim.

    • Env: US/English, 2025-01-22

    • Log: Query "Best CMS for small business" -> System Generates 5 distinct sub-queries (e.g., "pricing", "security") -> Results merge disparate sources, occasionally conflating features.

    • (Source: GhostDrift Institute Substack - Context Data)

  • Scope: Complex queries (Multi-hop queries).

Limit-08: Evasion of Safety Filters (Jailbreaking)

  • [Verified Fact]: Research exists on adversarial prompt (Jailbreaking) techniques against LLMs, and public authorities cite this as a risk factor in AI services, including generative search.

  • Evidence: Zhu et al., ACM TOIS 2025 (DOI: 10.1145/3748304); Ofcom Report (2025)

  • Non-Claim: Does not assert that general users can routinely bypass filters in commercial search engines.

  • Scope: Safety Guardrails.


4. Impact Analysis (Fact vs Interpretation)

This section clearly distinguishes between Verified Facts, Interpretations derived from them, and Context Cases that support them structurally.

Theme: The Shift from Search to Answers

  • [Verified Fact]: Google Search Central documentation states that AI Overviews may use "Query Fan-out" to synthesize information. Ofcom surveys (2025) show that users "rely on traditional search" for high-stakes topics.

  • Interpretation: Search engines are structurally transforming into "Answer Engines," yet user trust has not fully migrated. SEO strategies must shift from keyword optimization to "citation acquisition by AI."

Theme: Automation of Evaluation and Shifting Responsibility

  • [Verified Fact]: Research from SIGIR Forum (2024) and NeurIPS (2023) indicates that while LLM-based automatic evaluation is viable, systematic biases (such as position bias) exist.

  • Interpretation: While automation lowers evaluation costs, it creates a closed loop where "AI evaluates AI." This harbors a structural risk of obscuring the evaluator's responsibility.

  • Context Case (Systemic Risk): The structural models regarding "absence of responsibility boundaries" and "impossibility of retroactive justification" presented in the ADIC Ledger (GhostDrift Research) are isomorphic to this risk structure in IR evaluation, serving as a cautionary reference for system design (Context).


5. Search Log & Supplementary Sources

Primary search queries and verified sources used in the creation of this report.

Verified Sources (Fact Basis):

  • ACM Digital Library: TOIS, SIGIR Proceedings

  • ACL Anthology: EMNLP, NAACL, ACL Proceedings

  • NeurIPS / ICLR / ICML: Peer-reviewed ML Conferences (OpenReview / PMLR / Proceedings Hash)

  • Google Search Central: Official Documentation (Accessed 2026-01-23)

  • Ofcom: Official Reports (User experiences 2025 / Answer Engines 2025)

Context Sources (Observation & Structural Models):

  • GhostDrift Institute Substack: Google Overview Behavior Observation Logs (Link)

  • ADIC Ledger (GhostDrift Research): Model cases of responsibility structure and institutional design (Link)

 
 
 

コメント


bottom of page