An OpenAI large language model from the o1 series outperformed physician baselines across multiple measures of diagnostic and management reasoning, including in a blinded proof-of-concept evaluation using randomly selected emergency room cases, according to findings published in Science. Notably, however, the model did not significantly outperform physicians on cannot-miss diagnoses — a safety-critical endpoint — across several experiments.
Peter G. Brodeur, MD, of Beth Israel Deaconess Medical Center, Thomas A. Buckley, of Harvard Medical School, and colleagues conducted six experiments — five using structured clinical case sets and one using real, unstructured ER cases — comparing model performance against physician baselines drawn from hundreds of practitioners across training levels. The structured experiments used o1-preview, a preview version of the model; the ER study used o1, the subsequent release. Notably, o1-preview has since been supplanted by newer models, including OpenAI's o3, and the authors acknowledge that whether performance gains hold across model generations requires further study.
NEJM Clinicopathologic Conference Cases
In 143 New England Journal of Medicine (NEJM) clinicopathologic conference (CPC) cases, o1-preview included the correct diagnosis in its differential in 78% of cases and listed it first in 52%. When very close or potentially helpful diagnoses were counted, accuracy reached 97.9% (95% CI, 94.0–99.6%). On a 70-case subset used in a prior evaluation of GPT-4 — an earlier generation model — o1-preview identified an exact or very close diagnosis in 89% of cases vs. 73% with GPT-4. The model also selected the correct next diagnostic test in 88% of 130 scored CPC cases (from a starting set of 136, with seven cases excluded as not applicable); an additional 11% of testing plans were rated helpful by two physician scorers.
NEJM Healer Virtual Encounters
In 20 NEJM Healer virtual patient encounters, o1-preview achieved a perfect Revised-IDEA (R-IDEA) score — a validated 10-point scale assessing documentation of clinical reasoning — in 78 of 80 evaluations, statistically significantly outperforming GPT-4 (47 of 80), attending physicians (28 of 80), and resident physicians (16 of 72). However, o1-preview did not statistically significantly outperform GPT-4, attending physicians, or residents at identifying cannot-miss diagnoses during the initial triage presentation. The authors describe this as a safety-critical endpoint, and the finding held across multiple experiments: physicians and the model performed comparably on the diagnoses that matter most when missed. Clinicians evaluating AI tools for real-world use should weigh this carefully — a system that broadens differential generation may not reliably flag the dangerous diagnoses that most urgently require exclusion.
Management and Landmark Diagnostic Cases
In five Grey Matters management cases scored against consensus physician expert criteria, o1-preview achieved a median score of 89%, statistically significantly outperforming GPT-4 alone (42%), physicians using GPT-4 (41%), and physicians using conventional resources (34%). In six landmark diagnostic cases that have not been publicly released — compared against historical, not concurrent, physician controls — o1-preview achieved a median score of 97% vs. 92% for GPT-4 and 74% for physicians using conventional resources; these differences were not statistically significant. Physicians should interpret the 97% vs. 74% comparison with caution: because the human comparator data were drawn from a prior study rather than collected alongside the model evaluation, confounding factors cannot be ruled out.
Diagnostic Probabilistic Reasoning Cases
In five primary care probabilistic reasoning cases, o1-preview and GPT-4 performed similarly in estimating pretest and posttest probabilities, with o1-preview modestly outperforming GPT-4 overall. A striking finding was that clinicians displayed substantially wider variability in their estimates than either model. O1-preview notably outperformed both GPT-4 and human clinicians in estimating posttest probabilities for the cardiac ischemia case — a clinically meaningful result given the consequences of miscalibrated probability estimates in that setting.
Emergency Room Evaluation
In the ER study — the only experiment using real, unstructured clinical data from the electronic health record — researchers compared o1, GPT-4o, and two attending internal medicine physicians across 76 randomly selected cases from Beth Israel Deaconess Medical Center in Boston. Two additional attending physicians, blinded to whether each differential diagnosis came from a physician or an AI model, scored responses using the Bond Scale. Blinding was successful: the physician raters correctly identified the source (AI vs. human) in only 15% and 3% of cases, respectively.
At each diagnostic touchpoint, o1 identified the exact or very close diagnosis in 67% of cases at initial triage, 72% at the physician encounter, and 82% at admission to the medical floor or ICU — surpassing both attending physicians (55% and 50% at triage; 62% and 53% at the physician encounter; 79% and 70% at admission). GPT-4o also outperformed the attending physicians at some touchpoints but trailed o1 at all three, with statistically significant differences at triage and the physician encounter favoring o1. Statistically significant differences favoring o1 over physicians were found at the first two touchpoints. The performance gap was largest at initial triage, where the least patient information is available.
Limitations
The researchers emphasized several important limitations. The study evaluated only text-based performance; clinical medicine involves nontext inputs — including medical imaging interpretation, physical examination, and auditory cues — that were not assessed. The ER evaluation was designed as a proof of concept, focused on generating second-opinion differential diagnoses at predefined touchpoints; actual emergency medicine decisions more often center on triage, disposition, and immediate management. The study also focused on internal medicine and emergency medicine, with limited generalizability to other specialties, and relied on curated cases for several benchmarks. Physician comparators in the management and landmark diagnostic case experiments were drawn from historical data rather than concurrent controls — a meaningful methodological constraint when interpreting head-to-head comparisons. The tested model versions — particularly o1-preview — have since been succeeded by newer models, including OpenAI's o3. Additionally, the model was not statistically significantly better than physicians or GPT-4 at identifying cannot-miss diagnoses in the NEJM Healer cases, nor in the landmark diagnostic cases overall.
"Our findings suggest that LLMs have now eclipsed most benchmarks of clinical reasoning," the researchers wrote — though they were careful to distinguish benchmark performance from clinical utility, adding that human-computer interaction studies and prospective clinical trials are needed to assess whether AI systems can improve clinical practice and patient outcomes.
Disclosures
Funding included support from NIH/NIEHS, the Harvard Medical School Dean's Innovation Award for Artificial Intelligence, the Macy Foundation, the Moore Foundation, and others. Adam Rodman is a Visiting Researcher at Google DeepMind. Eric Horvitz is employed by Microsoft. Jonathan Chen is co-founder of Reaction Explorer LLC and received honoraria from industry conferences and health systems. Zahir Kanjee discloses royalties from Oakstone Publishing and Wolters Kluwer. Full disclosures are available in the published study.
Source: Science