Back to Blog

Why Healthcare AI Needs More Than a Large Language Model

Sanjay Basu

May 26, 2026

Back to Blog

Why Healthcare AI Needs More Than a Large Language Model

Sanjay Basu

June 30, 2026

When a Medicaid member who is pregnant texts our care team at 2 a.m. about which medication is safe to take for her symptoms, the response cannot be a probabilistically plausible paragraph. It has to be the right action – meaning a same-day visit referral, not vague reassurance without action behind it – and it has to be defensible to a clinician, an auditor, and a regulator the next morning.

That gap between fluent output and a clinical action a reviewer can stand behind is where most of today's healthcare AI fails. It is also why Waymark has spent the past two years building systems that pair large language models (LLMs) with explicit symbolic reasoning, rather than relying on the LLM alone.

A note on the headlines

Earlier this year, a paper published in Science lit up health AI coverage. Brodeur and colleagues reported that OpenAI's o1-preview met or exceeded physician baselines across six clinical reasoning experiments, including 70 emergency department cases from a Boston hospital. The methodology is unusually rigorous for this literature: large physician comparator groups, blinded scoring, and real ED charts in the final experiment.

Three nuances within the paper’s headline finding determine what it actually means for deployment.

First, the primary metric is the Bond score: whether the correct diagnosis appears in a five-item differential. It is a measure of differential containment, not of clinical action, calibrated uncertainty, or safety. A model can score well on Bond and still be unsafe to deploy.

Second, the safety-relevant endpoint in the paper did not move. On the NEJM Healer cases, o1-preview was not significantly better than GPT-4, attending physicians, or residents, including cannot-miss diagnoses — the life-threatening possibilities that have to be considered even when they are remote — in the initial triage differential. The authors note this in the limitations; secondary coverage has not.

Third, five of the six experiments use curated educational cases such as clinicopathologic conferences, NEJM Healer, Grey Matters, and the Berner landmark cases. These are pre-cleaned, narrative, and selected for pedagogical value. The sixth, the ED arm, is a structured second-opinion at predefined touchpoints, not triage, disposition, or management. Performance on cleaned cases systematically overestimates performance on the messy text our models actually see — a point Johri and colleagues made earlier this year with CRAFT-MD and which the Science authors themselves cite.

Two months earlier, another study evaluated a different OpenAI product. In an evaluation of ChatGPT Health, the consumer health tool launched in January 2026, on 60 clinician-authored vignettes under 16 factorial conditions, researchers found the model undertriaged 52% of gold-standard emergencies – directing patients with diabetic ketoacidosis or impending respiratory failure to 24–48 hour evaluation rather than to the emergency department – and its suicidality crisis interstitial activated inversely to clinical risk. Same AI vendor, but different model and different task. Both findings are true.

The conclusion is that frontier reasoning models can outperform physicians on differential diagnosis quality on cleaned cases, while consumer-facing deployments of the same vendor's product family fail unsafely on triage. That is not a contradiction, but rather a precise statement of the problem: benchmark performance and deployment safety are weakly coupled. Closing that gap requires architectural changes to the AI itself, not prompt-engineering exercises.

The objection: "But the models are getting better"

The most common pushback to architectural arguments for clinical AI safety is that scale will solve it – that the next generation of frontier models will close the safety gap empirically, making structural safeguards unnecessary. The 2025–2026 record does not support that hope, and the underlying theory predicts it never will.

The empirical pattern is that accuracy and hallucination are not the same thing and are not moving together. OpenAI's internal evaluations, reported by the New York Times in May 2025, found that the company's o3 reasoning model hallucinated on 33% of its PersonQA benchmark — more than twice the rate of the older o1 — and on 51% of SimpleQA. The smaller o4-mini hallucinated on 48% and 79% respectively. Stanford HAI's 2026 AI Index reported hallucination rates ranging from 22% to 94% across 26 frontier models on a knowledge-versus-belief benchmark, with GPT-4o's accuracy dropping from 98.2% to 64.4% when the same false statement was framed as a user's belief rather than a third party's.

The independent AA-Omniscience benchmark, released in November 2025, found that Grok 4 and GPT-5 achieved the highest raw accuracy (39%) but hallucinated on 64% and 81% of attempted questions respectively; only three of the 36 models evaluated cleared a net-positive reliability score. In short, larger and "smarter" did not equate to “more honest.”

The reason this pattern keeps recurring is that it is structural, not transient. Five independent mathematical results from the past two years converge on the conclusion that hallucination in LLMs cannot be eliminated by scaling or post-hoc patching:

Xu, Jain and Kankanhalli prove, using computable-function arguments from learning theory, that LLMs cannot learn all the computable functions and therefore must hallucinate when used as general problem solvers.
Kalai and Vempala, in a STOC 2024 paper, derive a statistical lower bound: any calibrated language model must hallucinate at a rate at least equal to the Good–Turing missing-mass estimate — the fraction of facts appearing exactly once in training. This bound is architecture-independent.
Banerjee, Agarwal and Singla extend the argument using Gödel's First Incompleteness Theorem and the undecidability of the halting, emptiness and acceptance problems, formalizing "structural hallucination" as a non-zero-probability event at every stage of the LLM pipeline.
Kalai, Nachum, Vempala and Zhang reduce open-ended generation to binary classification (the "is-it-valid" reduction), proving that hallucinations arise inherently from pretraining statistics and are then reinforced by post-training evaluation schemes that reward guessing over abstention.
Hong and colleagues, at NAACL 2024, show empirically and analytically that LLMs cannot reliably self-verify their own logical errors, undermining the most common deployed mitigation — chain-of-thought self-checking

These are different formal frameworks that reach the same conclusion: hallucination is not a defect that can be eliminated by more training data and greater scale. It is an inherent property of the LLM itself. Healthcare AI vendors that rely on LLMs and make safety claims are doing so without real-world evidence, and without a theoretical mechanism. We must build systems that remain safe while the underlying language model continues to hallucinate.

What neurosymbolic AI actually is

"Neurosymbolic" is shorthand for systems that pair neural networks (including LLMs) with explicit symbolic structure: knowledge graphs, logical rules, causal diagrams, decision trees, or formal verification layers. The neural component captures the messiness of natural language and high-dimensional clinical data. The symbolic component encodes what we already know – drug-drug contraindications, eligibility logic, clinical practice guidelines, regulatory constraints – into a form that can be inspected, audited, and corrected without retraining a multi-billion-parameter model.

Both models fail in different, complementary ways. Neural models fail probabilistically and unpredictably; symbolic models fail in known, traceable ways. Combining them lets the system retain LLM-level language understanding while gaining the property that matters most in regulated medicine: every consequential decision has an explanation a clinician can read and a reviewer can verify.

How we use it at Waymark

Waymark is currently creating, evaluating and testing two systems built on the architectural infrastructure needed to be safe for medical triage, accessible to patients, and keep a human physician in the loop at all times.

Compass is our 24/7 SMS-based care navigation system for Medicaid members. The LLM handles the linguistic surface – understanding what a member means when they write "I can't breathe right and my chest feels heavy" – but it does not make the care decision. That decision passes through a symbolic policy layer encoding our clinical pathways: red-flag rules for stroke and acute coronary syndromes, eligibility logic for our community health worker outreach, and escalation criteria to a Waymark clinician. The LLM is the interpreter; the symbolic layer is the protocol. Every routing decision generates a logged, human-readable trace.

Anchor is the structural verification layer we have built to sit on top of frontier LLMs. Rather than asking the LLM to check its own work, Anchor audits LLM outputs against an external, inspectable rule library and returns a deterministic per-output certificate with traceable provenance. The architectural property we are after is one no purely probabilistic stack can give: every consequential AI suggestion carries an auditable explanation of why it was approved, modified, or blocked. We are evaluating Anchor retrospectively against our care management corpus and prospectively in a pre-registered randomized trial.

Why interpretability has to be architectural, not retrofitted

It is tempting to think that the answer to LLM safety is interpretability: you get a window into the model's reasoning, and a clinician can intervene before harm reaches the patient. The broader literature on LLM interpretability suggests this hope is fragile. Mechanistic methods can expose features and circuits inside a model, but exposure is not the same as actionability. A clinician who can visualize an LLM's internal activations is not thereby able to correct its unsafe outputs.

The contrast with a structured policy is sharp. In our recent paper in BMJ Health & Care Informatics, we expose the reasoning pathways of a reinforcement learning policy used in Waymark's Medicaid care coordination, using attention motifs, feature-action flows, and concept-level probes. The reasoning the policy actually uses is recoverable, and it maps onto clinical-social concepts that adjudicating clinicians and social workers recognize: housing instability driving missed appointments, polypharmacy interacting with cognition, transportation barriers as a determinant of escalation, and so on. The decisions are not only interpretable; they are interpretable in clinically actionable language, because the architecture was designed to support that kind of inspection.

Interpretability is a property of architecture, not a feature you can bolt on after training. An LLM whose internal activations you can visualize is not the same thing as an LLM whose decisions you can correct. A neurosymbolic system, by construction, separates the two: the LLM's interpretation can be wrong without the system's decision being wrong, because the consequential decision is made downstream, in a layer whose rules can be versioned, validated, and changed without retraining.

Why this matters for safety, effectiveness, and regulation

Auditability. The FDA's evolving framework for generative AI medical devices requires manufacturers to demonstrate ongoing performance monitoring of non-deterministic systems. A neurosymbolic architecture localizes the non-determinism: the LLM may produce variable language, but the consequential decision passes through a deterministic, inspectable layer whose rules can be versioned and validated. In our experience, this simplifies QMS documentation under ISO 13485 and the QMSR substantially.
Distribution shift. The performance gap between cleaned cases and real patient messages is not a prompt-engineering problem; it is a consequence of how LLMs generalize. Symbolic constraints – "always escalate any mention of chest pain plus diaphoresis," "never recommend metformin in a patient with eGFR < 30" – do not degrade when the language style of the input changes. They are the part of the system that holds steady across the populations who are most often underrepresented in training data, which in Medicaid means most of the people we serve.
Error attribution. When a purely neural system gives the wrong answer, you can fine-tune, but you cannot meaningfully say why it failed. When a neurosymbolic system gives the wrong answer, you can trace whether the failure was in language interpretation (the neural side) or in the rule (the symbolic side), and fix the relevant component. For a learning health system, this is the difference between debugging and guessing. It is also the precondition for an AI morbidity-and-mortality process worth running.

What this is not

Neurosymbolic AI is not a refusal to use LLMs. We use them extensively, including inside Compass. It is also not the same as retrieval-augmented generation, prompt engineering, or guardrail libraries — those are useful probabilistic mitigations that reduce failure rates without changing the underlying property that the LLM is the decision-maker. The neurosymbolic claim is structural: in safety-critical clinical workflows, the LLM should be a component, not the system.

The reason this matters now, against the backdrop of Medicaid care delivery, is that the regulatory and commercial environment for healthcare AI is moving quickly toward asking the question neurosymbolic methods are best positioned to answer: “Show me why this AI made this decision for this patient, in a form a clinician can read and an auditor can check.” Pure LLMs, even at frontier scale, do not answer that question. Systems that pair LLMs with symbolic reasoning do. That is the architecture Medicaid members deserve, and it is the one we are building.

Empowering Community-Based Care Teams with Participatory Software Design

Sanjay Basu

Read post

Back to Blog