In mathematics, when one research group publishes a proof of a theorem, the scientific community evaluates it on its merits and waits for independent replication. When five research groups, working from entirely different theoretical frameworks, arrive at the same conclusion independently, the question then becomes "What does this conclusion ask us to do differently?"
Between 2024 and 2025, five independent research groups published formal proofs that hallucination in LLMs is a structural property of the architecture. Each group approached the problem from a different branch of mathematics: learning theory, computational logic, statistical calibration, mechanism design, and generalization theory. Each arrived at the same conclusion: hallucination cannot be eliminated from LLMs through any combination of better training data, larger models, improved architectures, or more sophisticated post-processing.
For health plan and health system leaders tasked with evaluating clinical AI deployments, this convergence carries a specific and practical implication. If hallucination is structural, then safety cannot be achieved by improving the LLM itself. It must be achieved by building something external to the LLM that verifies its outputs independently.
What "hallucination" means in a clinical context
An LLM hallucination is an output that is fluent, coherent, and wrong. The model generates text that reads as authoritative medical guidance and contains fabricated citations, incorrect dosing, omitted contraindications, or invented clinical findings. The danger is proportional to the plausibility; a recommendation that sounds authoritative but is clinically incorrect is more dangerous than one that is obviously nonsensical, because the former will be acted upon.
The current clinical evidence base documents what hallucination looks like in practice. Wu et al. found that 76.6% of clinical AI errors are omissions, where the model names the correct concept and fails to operationalize the guideline that should follow from it. Brodeur et al. documented that cannot-miss diagnoses were missed approximately 8% of the time across frontier models, with no improvement between model generations. These are not edge cases or adversarial prompts. They are standard clinical scenarios evaluated by published, peer-reviewed studies.
What the convergence of these proofs means for clinical AI
The five proofs discussed in the sections below approach the hallucination problem from five branches of mathematics that share no common assumptions, no shared formalism, and no overlapping proof techniques. As a result, because all five arrived at the same conclusion, that conclusion does not depend on any single framework being correct. Each proof would need to be independently invalidated for the conclusion to be overturned.
More practically, the five proofs collectively close off every major category of engineering response to the hallucination problem. While better training data, model architectures, calibration, inference and reasoning chains and even domain-specific fine-tuning reduce the frequency of hallucination, none eliminate it entirely.
If hallucination cannot be eliminated from LLMs, then any system that relies on an LLM producing correct outputs 100% of the time is architecturally unsound. In clinical contexts, where the consequences of a wrong output include misdiagnosis, inappropriate treatment, and patient harm, the implication is that safety requires an independent verification mechanism that operates outside the LLM. That mechanism must evaluate each output against a known standard (clinical guidelines, not another probabilistic model), must produce its evaluation deterministically (the same input always yields the same verification result), and must generate a structured record of what was checked and what was found.
ANCHOR (Auditable Navigation of Clinical Hazards with Oversight and Reasoning) is Waymark's clinical AI safety layer, built to meet these requirements. It is a verification layer that sits between a clinical LLM and the clinician, checking each output before it reaches the point of care. Anchor uses symbolic logic over 3,206 named-society clinical guidelines to verify LLM outputs deterministically, producing a per-output audit trail that ties each verification decision to a specific guideline and citation.
What follows is a synthesis of each proof: what it demonstrates, and which engineering approach to hallucination it rules out.
Proof 1: Learning theory says LLMs cannot learn all computable functions
Xu, Jain, and Kankanhalli formalized a specific definition of this proof: any given hallucination is an inconsistency between the output of a computable LLM and a computable ground truth function. Essentially, the LLM says one thing and the correct answer is another, and both the LLM and the correct answer are computable.
The proof demonstrates that no LLM, regardless of its training algorithm, training dataset, or model size, can learn all computable functions. There will always exist inputs for which the model produces outputs that are inconsistent with the ground truth. This is a result from formal learning theory, meaning it does not depend on the specific architecture of transformer models or the quality of any particular training dataset. It applies to any computable model.
If an organization's strategy for reducing hallucination relies on curating better datasets, extending training runs, or developing improved training procedures, Xu et al. establish that these approaches can reduce the frequency of hallucination on specific input distributions without eliminating it. For any computable LLM, there will exist an infinite set of inputs that produce hallucinated outputs. In clinical terms, better training may reduce how often an LLM gives a wrong answer, but it cannot guarantee that any specific answer is right.
Proof 2: Computational logic says every stage of the LLM pipeline will hallucinate
In 2024, Banerjee, Agarwal, and Singla published an argument that proceeds stage by stage through the LLM pipeline: training data compilation, fact retrieval, intent classification, and text generation. At each stage, they demonstrate that there is a non-zero probability of producing a hallucination, and that this probability cannot be reduced to zero through any architectural improvement, dataset enhancement, or fact-checking mechanism.
The proof introduces the concept of "structural hallucination," distinguishing it from the incidental errors that any system might produce. Structural hallucination is inherent to the mathematical and logical structure of the system itself. The argument draws on Gödel's insight that any sufficiently powerful formal system will contain true statements it cannot prove within its own rules. LLMs, as complex formal systems operating over natural language, inherit this limitation. The training data will always be incomplete (there are true facts not in any dataset), and the model's internal mechanisms for classifying intent and generating responses will always carry irreducible uncertainty.
If an organization believes that switching to a newer model architecture, adding a retrieval-augmented generation layer, or bolting on a fact-checking module will eliminate hallucination, Banerjee et al. establish that each of these components introduces its own non-zero probability of error. The problem is not that current architectures are poorly designed. The problem is that the mathematical limitations apply to any architecture operating with natural language.
Proof 3: Statistical calibration theory says well-calibrated models must hallucinate
Kalai and Vempala proved that language models that are statistically well-behaved, meaning their confidence levels accurately reflect how likely they are to be correct, must still hallucinate at a measurable rate.
Consider how a model learns facts from training data. Some facts appear thousands of times: penicillin treats bacterial infections, aspirin inhibits platelet aggregation, and so on. The model sees these facts repeatedly, from multiple sources, in multiple contexts. It learns them reliably. Other facts appear only once: a specific rare drug interaction documented in a single case report, an atypical contraindication noted in one specialty guideline. The model has no way to distinguish a true fact it saw once from a false statement it also saw once.
Kalai and Vempala proved that this creates a mathematical floor on the hallucination rate. The more rare facts a training corpus contains (and medical literature is full of them), the higher the minimum rate at which even a perfectly performing model will hallucinate. This floor cannot be lowered by making the model better at estimating its own confidence. A model that knows exactly how confident it should be will still hallucinate, because the problem is in the structure of the training data, not in the model's self-assessment.
Their subsequent work took this further; the standard way LLMs are trained (predict the next word) and the standard way they are evaluated (benchmarks that reward answering over saying "I don't know") create active incentives for models to generate confident-sounding outputs even when the honest response would be uncertainty. The training pipeline does not merely tolerate hallucination. It rewards it.
If an organization's AI safety strategy depends on improving the model's ability to flag its own uncertainty, or on refining the reward signals used during training, Kalai et al. establish that the limitation is statistical and persists even under optimal conditions. The hallucination rate is bounded by properties of the training data itself, and clinical training data, with its long tail of rare conditions, rare interactions, and rare presentations, is exactly the kind of corpus where that bound is highest.
Proof 4: Mechanism design theory says four desirable properties cannot coexist
In 2025, Karpowicz applied mechanism design theory (the branch of economics and game theory that studies how to design systems where strategic agents interact) to this set of challenges by modeling LLM inference as an "auction of ideas." During inference, different components of the model (attention heads, neural circuits, activation patterns) effectively compete to contribute knowledge to the response. Each component holds partial information, and the inference process aggregates their contributions into a coherent output.
Using the Green-Laffont theorem from mechanism design, Karpowicz proved that no LLM performing non-trivial knowledge aggregation can simultaneously satisfy four properties:
- Truthful knowledge representation (no hallucination)
- Semantic information conservation (preserving the meaning of encoded knowledge)
- Complete revelation of relevant knowledge (using everything the model knows)
- Knowledge-constrained optimality (producing the best possible response given what the model knows).
At least one of these four properties must be violated in any response. Since violating truthfulness produces hallucination, and violating any of the other three degrades the quality and completeness of the response, there is no configuration that avoids the tradeoff.
More importantly, this proof applies to both computationally bounded and unbounded reasoning. Even a hypothetical model with unlimited processing time and perfect access to its own internal knowledge would come up against this limitation. The impossibility is in the structure of information aggregation, not in the computational resources available. More sophisticated reasoning does not escape the tradeoff.
Proof 5: Open-world generalization says clinical medicine is the wrong domain for trust
Bowen Xu reframed hallucination as a manifestation of the generalization problem, introducing a distinction with direct clinical relevance. Under a closed-world assumption, where the model only encounters inputs that resemble its training data, hallucination can be reduced to very low levels. Under an open-world assumption, where the environment is unbounded and novel inputs appear continuously, hallucination becomes inevitable. The model must generalize beyond its training data, and generalization errors are hallucinations by definition.
By contrast, clinical medicine is an open-world domain. Patients present with novel combinations of comorbidities, medications, and clinical histories that no training dataset can fully anticipate. Rare diseases, atypical presentations, and evolving treatment guidelines continuously introduce inputs that fall outside any training distribution. The cases where hallucination matters most, unusual presentations, rare interactions, atypical trajectories, are the cases least represented in the training data and therefore least constrained by it.
If an organization's strategy is to fine-tune an LLM on clinical data so thoroughly that it will not hallucinate within the medical domain, Xu establishes that clinical medicine is not the environment in which to apply that strategy. The novel inputs that carry the highest clinical stakes are, by definition, the ones the model is least equipped to handle reliably.
.png)
.png)