The best clinical AI systems available today can reason through a differential diagnosis about as well as an experienced physician – but those same systems also leave out steps that determine whether a patient is actually safe. The distance between the impacts of those two statements is the problem that health plans and health systems are operating within.
For the leaders responsible for those deployments, that distance shows up as utilization risk, liability exposure, and as questions from regulators that a deployment can’t yet answer. Even a one-percent omission rate, spread across the hundreds of thousands of AI-assisted interactions a large health plan will see in a year, works out to thousands of avoidable emergency visits. Each one of those represents a cost and, more importantly, a patient who was not served well.
In a 2025 study, researchers tested 31 widely used large language models across 100 real primary care cases, scored by 29 specialist physicians at Stanford University School of Medicine and Harvard Medical School. The study introduced NOHARM (Numerous Options Harm Assessment for Risk in Medicine), the foundational benchmark of the Medical AI Superintelligence Test (MAST), an ongoing public effort to measure clinical AI safety. It is the most rigorous evaluation of clinical AI safety published to date, and it remains a preprint pending peer review. An independent, peer-reviewed benchmark found a similar pattern, with model safety performance dropping 13.3% in high-risk scenarios.
Two study findings are essential for anyone running a deployment to understand:
- A model’s safety correlated only moderately with its scores on standard medical-knowledge benchmarks, which means a strong benchmark score is a weak predictor of safe behavior at the bedside.
- Of the severely harmful errors these models produced, 76.6% were errors of omission, where the model failed to include something the situation required.
That second figure changes what clinical AI safety is about: ensuring safeguards against errors of omission, or recommendations that harm patients because of what’s not said.
What an error of omission looks like
A patient arrives with tearing chest pain that radiates to the back, and blood pressure that reads differently in the left and right arms. A leading model reviews the presentation, identifies acute coronary syndrome, and recommends anticoagulation.
That combination of symptoms is the classic presentation of aortic dissection. American Heart Association and European Society of Cardiology guidelines are explicit that it calls for urgent imaging before any anticoagulation, because assuming acute coronary syndrome when symptoms point to an aortic dissection can lead to a fatal misdiagnosis. The model named a plausible diagnosis and moved to treatment. It never ran the check that separates the survivable path from the fatal one.
Nothing in that output looks wrong. There is no false claim to catch. However, the absent step is less obvious than an incorrect output or false claim, and a clinician scanning a reasonable-looking recommendation in a busy clinic may never see the gap. Automation bias compounds the risk, since clinicians tend to accept the recommendations of a generally accurate model without close scrutiny. A 2026 randomized trial in NEJM AI found that physicians trained in AI literacy still showed measurably worse diagnostic reasoning when a model’s advice carried a seeded error. Omissions are the failure that today’s safety tools are least equipped to notice.
Why today’s safeguards miss it
This is also why the common safety measures fall short on omissions. Fine-tuning a model, feeding it the right reference documents, engineering careful prompts, having a second model review the first, and adding human oversight all lower how often a bad output appears. What none of them does is confirm, for one specific recommendation, that the model carried out every step the relevant guideline requires. A tool that reviews what a model said has no way to flag what the model never said, and a second model carries the same blind spots as the first. These are worthwhile layers. They reduce the rate of error, and an omission is exactly what slips through when no single output is checked against a clinical standard. The limit is structural, and it’s the same no matter how well any one of these tools is tuned.
What a safe clinical AI system actually does
The more useful question for a health plan or health system is what safe clinical AI should look like. The answer is a verification layer that sits between the AI and the clinician and checks each recommendation before anyone acts on it. Five qualities make that layer trustworthy, and they work as a checklist for any clinical AI:
- It checks every recommendation the AI produces, including the routine-looking ones.
- It measures each recommendation against clinical guidelines from specialty societies, named and current, so the standard being applied is explicit.
- It returns the same verdict every time it sees the same case, so the check is consistent and can be relied on.
- It produces a plain record of what it checked and which guideline it checked against, so any single decision can be shown to a regulator, a board, or a court.
- It works with whatever AI model the organization runs today, and keeps working when that model is replaced by a better one.
The ability to show the basis for a clinical recommendation is becoming a baseline expectation, and not only from regulators. Boards, malpractice attorneys, and contracting partners all have reason to ask how an AI-assisted decision was reached. The regulatory picture itself is uneven and shifting, with several states now active on AI in healthcare and federal direction moving less predictably, which makes a verifiable record useful regardless of where any single rule settles. A health plan that can already produce that record for any recommendation is prepared for whatever standard arrives. One that cannot is carrying exposure that is harder to close later than to build now.
This is the layer Waymark is building with Anchor (Auditable Navigation of Clinical Hazards with Oversight and Reasoning)
Anchor checks each AI recommendation against a library of more than 3,000 named clinical rules using symbolic logic, a rules-based method that returns the same verdict every time it evaluates the same recommendation, and records exactly which guideline applied to each decision, or says ‘unverified’ if it genuinely cannot find a guideline that the LLM recommendation corresponds to. It connects through a single interface and adds to the safety steps an organization already runs, so adopting it does not mean removing anything that works today. The audit trail described above is a standard part of how it operates.
Reasoning quality is now widely available in clinical AI. Verification of every recommendation against a named standard, with a record to prove it, is the capability the field has not yet made routine. Building it now costs a fraction of retrofitting it under regulatory pressure later, and the organizations that ask for it today will be the ones ready when regulators, boards, and patients come to expect it.
.png)