MedPath

LLMs Show Limited Benefit in Enhancing Physician Diagnostic Reasoning: A Randomized Trial

• A randomized clinical trial assessed whether large language models (LLMs) improve diagnostic reasoning among family, internal, and emergency medicine physicians. • The study found that LLM use did not significantly enhance diagnostic reasoning compared to conventional resources alone, with a non-significant 2% score difference. • The LLM alone outperformed both physician groups, scoring 16% higher than those using conventional resources, highlighting potential for AI in diagnostics. • Further development is needed to effectively integrate LLMs into clinical practice to improve physician-AI collaboration and diagnostic accuracy.

A recent randomized clinical trial published in JAMA Network Open investigated the impact of large language models (LLMs) on physician diagnostic reasoning. The study, conducted across multiple academic medical institutions, found that providing physicians with access to an LLM did not significantly improve their diagnostic accuracy compared to using conventional resources. However, the LLM alone outperformed both physician groups, suggesting potential for AI in clinical decision support with further development.
The trial, which involved 50 physicians from family medicine, internal medicine, and emergency medicine, assessed diagnostic reasoning performance using a standardized rubric. Participants were randomized to either access an LLM (ChatGPT Plus) in addition to conventional diagnostic resources or conventional resources only. They were given 60 minutes to review up to six clinical vignettes, with their diagnostic performance evaluated based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps.
The primary outcome, the median diagnostic reasoning score per case, was 76% for the LLM group and 74% for the conventional resources-only group. The adjusted difference of 2 percentage points (95% CI, -4 to 8 percentage points; P = .60) was not statistically significant. Similarly, the median time spent per case was 519 seconds for the LLM group and 565 seconds for the conventional resources group, with a non-significant time difference of -82 seconds (95% CI, -195 to 31 seconds; P = .20).

LLM Standalone Performance

Interestingly, when the LLM was used alone to answer the cases, it scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group. This suggests that the LLM has the potential to enhance diagnostic accuracy, but its integration into clinical practice requires further refinement.

Implications for Clinical Practice

The study's findings have significant implications for the integration of AI into clinical practice. While LLMs have shown promise in medical reasoning examinations, their effectiveness in improving physician diagnostic reasoning remains uncertain. The results suggest that simply providing access to LLMs may not be sufficient to improve overall physician diagnostic reasoning.
Ethan Goh, MD, MS, the corresponding author of the study, noted that the results highlight the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice. He suggested that training clinicians in best prompting practices may improve physician performance with LLMs. Alternatively, organizations could invest in predefined prompting for diagnostic decision support integrated into clinical workflows and documentation, enabling synergy between the tools and clinicians.

Structured Reflection as an Assessment Tool

The study also developed a measure based on structured reflection to evaluate diagnostic reasoning skills. This assessment tool demonstrated substantial agreement between graders and internal reliability, advancing the field beyond early LLM research that focused on benchmarks with limited clinical utility.

Limitations

The authors acknowledged several limitations of the study, including the focus on a single LLM and the lack of explicit training in prompt engineering techniques for participants. Additionally, the study used a limited number of clinical vignettes, which may not comprehensively cover the variety of cases in the field of medicine.

Conclusion

Despite these limitations, the study provides valuable insights into the potential and challenges of using LLMs in clinical practice. While LLMs alone may outperform physicians in diagnostic reasoning, their integration into clinical workflows requires further development to enhance physician-AI collaboration and improve diagnostic accuracy.
Subscribe Icon

Stay Updated with Our Daily Newsletter

Get the latest pharmaceutical insights, research highlights, and industry updates delivered to your inbox every day.

Related Clinical Trials

NCT06157944CompletedNot Applicable
Stanford University
Posted 11/29/2023

Related Topics

Reference News

[1]
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial
jamanetwork.com · Oct 28, 2024

A randomized clinical trial involving 50 physicians found that using a large language model (LLM) did not significantly ...

© Copyright 2025. All Rights Reserved by MedPath