Across five standardized primary care scenarios, notes generated by ambient artificial intelligence scribe tools received lower quality scores than those written by clinicians, according to a cross-sectional evaluation published in Annals of Internal Medicine.
The findings will also be presented during the breaking news scientific plenary session, “New in Annals of Internal Medicine: Hear It First from the Authors,” at the American College of Physicians (ACP) Internal Medicine Meeting 2026 in San Francisco.
The Veterans Health Administration (VHA) study evaluated 11 artificial intelligence (AI) scribe vendors against 18 human notetakers using five audio-recorded standardized patient encounters. Thirty blinded clinician raters assessed all notes using the modified Physician Documentation Quality Instrument (PDQI-9), a 10-domain framework scored on a 5-point Likert scale with a maximum of 50 points.
Findings
Human-generated notes scored higher than AI-generated notes across all five cases, with statistically significant differences in three scenarios.
The largest gap emerged in the acute low back pain scenario—which included substantial background noise—where human notes averaged 43.8 points compared with 20.3 points for AI-generated notes. The chest pain scenario, in which both patient and clinician wore surgical masks, showed a statistically significant difference of 7.5 points (42.2 vs 34.8). A nurse care manager heart failure encounter also showed a statistically significant difference of 5.7 points (38.4 vs 32.8).
Differences in the new patient visit and pharmacy consultation cases followed the same pattern but did not reach statistical significance.
In a pooled domain analysis, AI-generated notes scored lower than human-generated notes in all 10 quality domains. The largest deficits were in thoroughness, organization, and usefulness, each by about 1 point on the 5-point scale. Smaller but statistically significant differences were observed in freedom from hallucination (−0.87 points) and freedom from bias (−0.70 points).
Study Design and Methods
The five standardized cases included a new patient visit with a new diagnosis of diabetes, acute low back pain, chest pain, a pharmacy consultation, and a nurse care manager follow-up for heart failure.
Encounters were audio recorded using standardized patient actors and clinicians in a simulated outpatient environment. The researchers incorporated real-world transcription challenges, including background noise, nonnative English accents, and masked participants.
Each of the 11 vendors generated notes from identical audio files under controlled conditions in subjective, objective, assessment, and plan format using VHA-furnished devices. Three clinicians per case independently produced comparator notes after listening to the recordings once.
All notes were evaluated in randomized order, and raters were blinded to whether notes were AI- or human-generated.
Context
The researchers noted that ambient AI scribes have seen rapid adoption in clinical settings. Prior studies have shown improved efficiency, with reduced documentation time, more same-day note completion, and less after-hours work.
However, most prior evaluations have focused on efficiency and user experience rather than documentation quality and were often limited to single vendors.
Limitations
The researchers acknowledged several limitations. The cases were simulated and may not reflect real-world clinical complexity. Human-generated notes were produced outside typical clinical workflows and were not written by professional scribes, which may affect comparability.
In addition, vendors were not permitted to generate multiple iterations of notes, which could influence AI performance.
In an exclusive interview, lead study author Ashok Reddy, MD—Primary Care Physician at VA Puget Sound, HSR&D Core Investigator, Associate Director of the Primary Care Analytics Team at the VHA Office of Primary Care, and Associate Professor in the Division of General Internal Medicine at the University of Washington—addressed whether more iterative use of AI tools could improve performance.
“This is an important question,” Dr. Reddy said. “My coauthors and I are continuing to assess different aspects of quality for this technology across the pilot implementation and are conducting ongoing quality monitoring.”
Rater blinding may have been imperfect, and the PDQI-9 instrument—originally developed for physician documentation—may not fully capture errors unique to AI-generated notes.
Editorial: Rethinking How Note Quality Is Measured
In an accompanying editorial, Aaron A. Tierney, PhD, of The Permanente Medical Group and the Division of Research, Kaiser Permanente Northern California, and Kristine Lee, MD, of The Permanente Medical Group, supported the study's findings while calling for broader and more realistic evaluation frameworks.
The editorialists noted that the study averaged results across all 11 AI vendors and all raters, without assessing variability between individual vendors or clinicians—an important limitation given likely performance differences.
They also emphasized that because large language model output is probabilistic, future evaluations could assess repeated AI-generated notes across multiple iterations.
More broadly, the editorial questioned whether the modified PDQI-9 adequately reflects meaningful documentation quality. The tool, they argued, may reward longer and more complex notes shaped by billing, regulatory, and liability requirements rather than patient care needs.
The editorialists suggested that AI scribes could provide an opportunity to redesign clinical documentation to better support patient-centered care, particularly in the era of open medical records.
They also called for incorporating patient perspectives into future evaluations, noting that as many as two-thirds of patients rely on after-visit summaries to guide self-care and understand next steps. They further noted that high rates of clinician burnout may affect documentation quality in real-world settings, which may not be captured in simulated scenarios.
Bottom Line for Practice
The researchers concluded that AI scribes should be used to generate draft documentation that requires clinician review and editing rather than replace clinician-authored notes.
They called for rigorous, vendor-neutral testing before adoption and for ongoing quality assurance frameworks to ensure that AI-assisted documentation supports both clinicians and patients.
Disclosures: The study was funded by the VHA Office of Primary Care. The funder had no role in study design, data collection, analysis, or publication decisions. Disclosure forms for the researchers and editorialists are available with the original articles.
Source: Annals of Internal Medicine