The answer is appealing in its logic. Large language models, trained on general corpora and updated infrequently, are unreliable sources of current biomedical knowledge. They hallucinate citations. They confuse drug names. They generate plausible-sounding clinical summaries that do not accurately reflect the evidence base they purport to describe. When ChatGPT was asked about medications for peripheral artery disease patients without increased bleeding risk, it omitted low-dose rivaroxaban — a clinically significant omission documented in the PMC systematic review of RAG in healthcare contexts published in 2025.
RAG addresses this by inserting a retrieval step between the user's query and the model's response. Instead of relying on parametric knowledge encoded in model weights, a RAG system retrieves relevant documents from an external corpus — PubMed, clinical guidelines, institutional knowledge bases — and provides those documents as context for the model's generation. The model does not need to know the answer; it needs to read the documents that contain the answer and synthesize their content accurately.
This works. The evidence from 143 papers surveyed in the 2025 RAG in biomedicine review confirms measurable improvements in factual accuracy across a range of biomedical question-answering tasks. The MEGA-RAG framework, published in PMC in 2025, demonstrated a reduction in hallucination rates of over 40% compared to standalone LLM use, through a combination of multi-source evidence retrieval using FAISS dense retrieval and BM25 keyword search, cross-encoder reranking for semantic relevance, and a discrepancy-aware refinement module.
RAG works. The problem is that the community has begun treating it as a solution to a problem it only partially addresses.
What RAG Actually Does
To understand what RAG fixes, it is necessary to be precise about what the hallucination problem in biomedical AI actually is.
There are, broadly, three distinct failure modes that get collected under the label "hallucination" in biomedical contexts.
The first is factual confabulation: the model generates a claim — a drug interaction, a clinical finding, a study result — that has no basis in its training data but is presented as if it does. This is the failure mode that RAG most directly addresses. By grounding generation in retrieved documents, RAG forces the model to produce outputs that are at least anchored to real text. Confabulated facts are harder to generate when the model has access to documents that would contradict them.
The second is outdated information: the model provides accurate information as of its training cutoff that has since been superseded by new evidence. This is also addressed by RAG, assuming the retrieval corpus is current — which, for PubMed and ClinVar, it generally is. Dynamic environments are RAG's stated advantage over fine-tuning: the knowledge base can be updated without retraining the model.
The third is retrieval failure: the model retrieves the wrong documents, or retrieves the right documents but synthesizes them inaccurately. This failure mode is not eliminated by RAG — it is the failure mode that RAG introduces. A system that retrieves irrelevant documents with high confidence, or that retrieves relevant documents and then misrepresents their content, produces outputs that are more dangerous than simple confabulation precisely because they carry the credibility of citation. The systematic review of RAG in healthcare published in PMC in 2025, which analysed 70 studies from 2020 to 2025, identifies the lack of unified benchmarks to standardise evaluation of RAG systems as a critical gap — meaning the field does not yet have reliable methods for determining, across diverse clinical contexts, how frequently retrieval failure occurs and what its consequences are.
The Infrastructure Problem That RAG Sits On Top Of
The deeper issue — and the one that the biomedical library community is best positioned to address — is that RAG is only as good as the corpus it retrieves from. And the corpus it typically retrieves from has problems that are upstream of any model architecture choice.
The MedRAGent system, presented at medRxiv in 2025, takes a specific approach to this problem: rather than using free-text queries to retrieve from PubMed, it translates user queries into formal Boolean queries using MeSH controlled vocabulary, on the grounds that MeSH-structured queries produce more reproducible and semantically precise retrieval than free-text alternatives. The evaluation on 53,054 article records from six research topics shows meaningful improvements in recall compared to systems that use LLM-generated free-text queries directly.
This finding is not new. It restates, in the context of LLM-integrated retrieval, a principle that information scientists have understood for decades: controlled vocabulary improves recall in domain-specific literature search because it resolves the ambiguity that natural language query formulation systematically introduces.
What this means for RAG-based biomedical systems is that their retrieval quality is constrained by the quality of the metadata describing the documents in their corpus. A PubMed record with incomplete or incorrect MeSH indexing is a record that a well-designed RAG system may fail to retrieve in contexts where it would be clinically relevant. A clinical trial record with inconsistent use of controlled outcome terminology may or may not be retrieved depending on the arbitrary vocabulary choices of the query author rather than the actual relevance of the trial.
This is where an unexpected but increasingly relevant parallel appears. In digital ecosystems outside of healthcare — including highly structured environments such as casino Lolajack — retrieval, classification, and user guidance depend on the same underlying principle: the quality of the system is defined not only by the interface or algorithm, but by how well the underlying data is organized, labeled, and made discoverable. This convergence increasingly highlights a critical shift, where the boundary between information architecture and user trust becomes the defining concern.
The hallucination problem, reframed: the outputs of biomedical RAG systems are unreliable not only because language models are imperfect synthesizers, but because the documents they retrieve are imperfectly described. The model synthesises what it finds. If what it finds is incomplete, the synthesis is incomplete in ways that the model cannot detect and the user cannot identify without independent verification.
What the Evidence Base Actually Supports
The 2025 PMC systematic review of RAG in healthcare, which surveyed 70 studies, identifies the key tensions in the current evidence base with appropriate specificity.
RAG improves accuracy on benchmark question-answering tasks — PubMedQA, MedMCQA, clinical reasoning benchmarks — compared to standalone LLM use. This is the finding that drives the field's enthusiasm for the approach, and it is real. Fine-tuned open-source models combined with RAG, evaluated on PubMedQA and MedMCQA, show measurable improvement over both fine-tuned models without RAG and RAG-enabled models without fine-tuning.
What the same review identifies as the major gap is that benchmark performance does not reliably predict clinical performance. The benchmarks on which RAG systems are evaluated were designed to assess question-answering accuracy under conditions that differ in important ways from clinical deployment: controlled question sets, curated corpora, defined answer spaces. Real clinical information needs are more ambiguous, more heterogeneous, and more dependent on institutional context than any benchmark can capture.
The reviewers also identify the absence of standardised evaluation frameworks for RAG in healthcare as the field's most significant methodological weakness. Without agreed methods for assessing retrieval quality — not just answer accuracy — the field cannot systematically distinguish between systems that work and systems that appear to work on the benchmarks available.
What Biomedical Libraries Can Contribute
The contribution that biomedical library expertise can most directly make to this problem is not in the model architecture. It is in the corpus.
The quality of retrieval in RAG systems depends on the quality of the metadata that indexes the documents being retrieved. Biomedical libraries maintain — or have historically maintained — the controlled vocabulary infrastructure, the cataloging standards, and the metadata quality assurance processes that make document collections discoverable in semantically meaningful ways. MeSH, NLM's Medical Subject Headings system, is the most significant example: a controlled vocabulary maintained by dedicated specialists, updated annually, and applied to over 36 million citations in PubMed.
What the integration of RAG into biomedical information access reveals is that this infrastructure — taken for granted in library circles and largely invisible to AI developers — is doing load-bearing work in every RAG system that retrieves from PubMed. The quality of the MeSH indexing applied to a given article is a direct determinant of whether that article will be retrieved in contexts where it is clinically relevant. The library community has decades of expertise in assessing and improving that indexing quality. That expertise is not being systematically applied to the evaluation of RAG system performance.
The path from RAG as a partial solution to RAG as a reliable one runs through information science rather than around it. The vocabulary infrastructure, the metadata quality frameworks, the controlled terminology systems — these are not legacy concerns from a pre-AI era. They are the foundation on which AI-based biomedical information access either stands or falls.
The field is building on top of that foundation without examining it carefully enough. The JCDL community, which has been examining it for over two decades, is precisely the community that should be saying so.