The clinical AI industry has a measurement problem. Vendors report accuracy, sensitivity, and F1 scores because those are the metrics the field knows how to produce and benchmark committees know how to evaluate. They largely avoid the one measure clinical medicine developed specifically to quantify population-level harm: number needed to harm (NNH).
NNH is the standard because harm is not uniformly distributed. A model can post a strong accuracy score on a clinical triage benchmark while still producing severely harmful outputs in a meaningful share of interactions, a failure rate that never surfaces in the headline number. For a medical director deciding whether to deploy AI in a clinical workflow, the figure that matters is how often the system causes harm, and the accuracy score does not reveal it.
What NNH captures that accuracy and F1 miss
Accuracy and F1 measure whether a model gets the right answer. NNH measures how often, and how severely, it causes harm. For a deployment decision, the second is the figure that describes the risk.
In a drug trial, that distinction is obvious. A drug that reduces symptom severity in 80% of patients may still be unacceptable if it causes a serious adverse event in one of every 50. The 80% efficacy figure and the NNH of 50 are both true, and only one of them describes what happens when the drug is deployed at scale.
For clinical AI, the same structure applies, with one added complication. A pharmaceutical causes harm at a relatively stable rate across a defined population. A language model's failure rate varies, concentrating in the presentations that are hardest to get right, the multi-system, complex, and atypical cases, which in a Medicaid population are not edge cases. They are the clinical reality of the population under management.
A model that performs well on single-condition presentations and degrades on multi-system ones produces accuracy figures that look acceptable while generating its highest harm rates precisely when patients have the most to lose. A peer-reviewed benchmark in npj Digital Medicine documented roughly a 13% drop in model performance on high-risk scenarios relative to lower-risk ones, the same direction of failure that matters most for a high-acuity population. NNH, calculated on a representative population rather than a curated benchmark, is the measure that surfaces this divergence, which a headline accuracy score conceals.
What the Do NOHARM data shows at Medicaid scale
The Do NOHARM framework was developed to assess clinical AI safety in a structured, reproducible way. In a 2025 evaluation, Wu et al. applied it to publicly available large language models, and the results are specific in ways accuracy benchmarks rarely are. The Wu evaluation is a preprint and has not yet completed peer review, so the figures below are best read as an early signal rather than a settled result. The pattern it points to, sharper degradation on the hardest cases, is independently documented in the peer-reviewed npj Digital Medicine benchmark cited above, which gives the argument a foundation that does not rest on a single source.
Across the evaluated models, up to 22.2% of responses were rated as causing severe harm at the upper end of the range, rather than a uniform rate across every model. Among the severe harms specifically, the omission share, the cases where a model failed to surface clinically important information rather than recommending something actively harmful, was 76.6%.
The Wu evaluation was not specific to Medicaid. It measured general-purpose language models on clinical tasks, and its findings apply to any population those models are used on. That makes it a useful way to picture the stakes at scale. Applied to a plan managing 200,000 members, where AI handles a meaningful volume of clinical interactions, a severe-harm rate in that range would mean a defined number of harmful responses each month, in a population that already has limited access to follow-up care when something goes wrong. The Medicaid case is likely worse than the benchmark suggests, because Medicaid populations skew toward exactly the multi-system presentations where these models degrade most. NNH is the measure that makes that risk visible and quantifiable.
Why a rigorous NNH benchmark is harder than a headline number
An NNH figure means little on its own. Like number needed to treat, NNH is interpretable only relative to a comparator and a defined population and time horizon. The Do NOHARM evaluation reports it as a range, roughly 4.5 to 11.5 across the language models tested, against 3.5 for a no-intervention baseline, which shows how much the figure moves with the comparison it is measured against. The number depends entirely on how it was produced, and a handful of methodological choices determine whether it reflects the population a health plan is actually responsible for.
- The evaluation population. An NNH figure calculated on a general medical knowledge benchmark, or on simulated patient interactions (synthetic data), describes a different reality than one calculated on real member interactions. A Medicaid population carries comorbidity patterns, language access needs, and social determinants profiles that diverge sharply from the datasets underlying most published AI benchmarks. A figure derived from a general medical QA dataset may say very little about the population a plan manages day to day.
- What counts as harm. The Do NOHARM framework separates errors of commission (harmful recommendations) from errors of omission (failures to surface critical information). Both matter, and in a triage or care navigation context the omission is often also a failure. A methodology that captures only direct harmful recommendations will undercount risk in exactly the setting where missing something carries the most consequence.
- Independence. An internal evaluation using a published framework and an independent evaluation by researchers external to the vendor answer different questions. Neither is invalid, but the distinction affects how much weight a figure should carry and what additional validation is reasonable to request.
- The denominator. NNH calculated on a curated set of high-acuity cases produces a different figure than NNH calculated on a representative sample of all member interactions, routine queries included. The overall rate describes system-level risk. The rate on high-acuity presentations describes what happens when a patient has nowhere else to turn. A benchmark that can be called rigorous accounts for both.
Clinical AI considerations for vendor evaluations
These four considerations reduce to a short set of questions a medical director should be able to put to any clinical AI vendor. If they can’t receive clear answer to these questions, it’s worth investigating why before a deployment conversation goes further.
- What is your NNH, and on what population and case set was it calculated?
- Does your evaluation distinguish errors of omission from errors of commission?
- Was the NNH evaluation conducted internally using a published framework, or by an independent evaluator?
- Are safety-critical functions, emergency detection and clinical triage, handled by the same model that manages routine interactions, or by a separate, non-probabilistic architecture?
A vendor who cannot answer these questions has not necessarily built an unsafe product. It may mean only that patient safety was never set as an explicit design requirement, so the vendor has no measurement either way. For a health plan, the value of asking is to get those answers on record before deployment, rather than discovering the gaps once the system is already working with members.
Medicaid members carry the highest burden of avoidable acute care in the US healthcare system, and a meaningful share of that burden reflects gaps in access rather than gaps in clinical capability. Clinical AI built to a safety standard, evaluated on real populations and with harm measured accurately, changes the math on that access challenge. This is the standard worth demanding, and the standard toward which Waymark is building.
.png)
