A large language model achieved 92.5% exact-match agreement with expert annotation when extracting structured cardiovascular data from free-text echocardiography reports, according to a finalist abstract for the 2026 Arthur E. Weyman Young Investigator’s Award Competition.
The researchers from Johns Hopkins University in Baltimore, Maryland, evaluated whether GPT-5 mini could abstract 55 cardiovascular fields from de-identified reports in the MIMIC-III EchoNotes dataset. The fields spanned valve function, chamber size and function, wall motion, and hemodynamics.
The researchers wrote that echocardiography reports contain extensive clinical information, but much of it remains in unstructured text that may include inconsistent or ambiguous findings. They used a severity coding system with schema constraints to generate structured outputs.
Fifty reports were independently annotated by a board-certified echocardiographer and separately extracted by GPT-5 mini. A blinded independent cardiologist then adjudicated 193 field-level disagreements between the human annotation and the large language model extraction.
Using human annotation as the reference, the large language model extraction achieved 92.5% exact-match agreement. Precision ranged from 96% to 98% by category, including 96% for severity fields, 97% for numeric fields, 98% for categorical fields, and 97% for wall motion. Recall ranged from 85% to 95%, including 95% for severity fields, 92% for numeric fields, 89% for categorical fields, and 85% for wall motion.
In blinded review of discordant fields, the large language model extraction was judged superior in 60% of comparisons, or 101 of 171. The model also identified 120 additional clinical values that were present in the source reports but were not documented by the human annotators.
In an interview with Conexiant, researcher Robert B. Barrett, of Johns Hopkins University, said the additional values missed by human annotators varied in clinical and practical significance. Some reflected the model’s tendency to over-extract normal or trivial findings, he said, while others reflected ordinary human annotation errors.
“LLMs are sycophantic machines and, when given a task to extract, are eager to over-extract,” Barrett said. “For this reason, many human-missed annotations were those where the LLM identified a field as ‘normal’ or ‘trivial.’ This was an advantage for coverage, and a cautious example in the broader sense that over-extraction may lead [to] poorly justified assumptions.”
He added that some missed annotations were caused by simple human workflow issues, such as not scrolling past an unusually long blank section of a report and incorrectly assuming the report was incomplete.
“Large language models do not make these errors — they are always bound by the entirety of context provided, allowing them to reliably extract what we, as humans, may miss,” Barrett said.
Barrett said the team’s model and prompt were not fine-tuned or tailored specifically to the Johns Hopkins or MIMIC datasets, which may make the approach more flexible for large-scale echocardiography-specific data extraction.
He noted, however, that performance varied across exam types when open-weight models were used, with stress echocardiograms showing lower performance on Qwen3.5, likely because of lower information density in those reports. Institutions or vendors producing low-fidelity or sparse reports across certain exam types may see similar challenges, he said. Larger frontier models may reduce some of these issues but come with higher costs, making model selection an important consideration depending on the intended use case.
The findings suggest that large language models may help structure echocardiography report data for retrospective research, quality improvement, documentation auditing, or report summarization. However, the abstract described a small report-abstraction study and did not report prospective clinical use, diagnostic accuracy, patient outcomes, or workflow-safety outcomes.
Barrett said additional validation would be needed before using this type of approach in workflows that could influence clinical decisions. He said the team’s validation strategy included external validation between MIMIC and Johns Hopkins REACH, internal concordance checks across extracted fields, and adjudication by a second board-certified echocardiographer to avoid assuming either human- or LLM-derived values should automatically serve as the gold standard.
“These steps offer quick and robust checks for meaningful use of extracted data within QI and research workflows,” Barrett said. “Prior to integration in workflows which may influence clinical decisions, additional work should be done to further optimize for higher precision and recall.”
He added that future optimization could include fine-tuning models and prompts.
The researchers concluded that large language models demonstrated strong capability for automated echocardiography data extraction and may provide methods for research efficiency and documentation evaluation.
YIA-02 is scheduled to be presented on Saturday, June 27, during the Artificial Intelligence and Machine Learning session of the 2026 Arthur E. Weyman Young Investigator’s Award Competition.
Disclosures were not provided in the abstract.