A recent randomized clinical trial published in JAMA Network Open investigated the impact of large language models (LLMs) on physician diagnostic reasoning. The study, conducted across multiple academic medical institutions, found that providing physicians with access to an LLM did not significantly improve their diagnostic accuracy compared to using conventional resources. However, the LLM alone outperformed both physician groups, suggesting potential for AI in clinical decision support with further development.
The trial, which involved 50 physicians from family medicine, internal medicine, and emergency medicine, assessed diagnostic reasoning performance using a standardized rubric. Participants were randomized to either access an LLM (ChatGPT Plus) in addition to conventional diagnostic resources or conventional resources only. They were given 60 minutes to review up to six clinical vignettes, with their diagnostic performance evaluated based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps.
The primary outcome, the median diagnostic reasoning score per case, was 76% for the LLM group and 74% for the conventional resources-only group. The adjusted difference of 2 percentage points (95% CI, -4 to 8 percentage points; P = .60) was not statistically significant. Similarly, the median time spent per case was 519 seconds for the LLM group and 565 seconds for the conventional resources group, with a non-significant time difference of -82 seconds (95% CI, -195 to 31 seconds; P = .20).
LLM Standalone Performance
Interestingly, when the LLM was used alone to answer the cases, it scored 16 percentage points (95% CI, 2-30 percentage points; P = .03) higher than the conventional resources group. This suggests that the LLM has the potential to enhance diagnostic accuracy, but its integration into clinical practice requires further refinement.
Implications for Clinical Practice
The study's findings have significant implications for the integration of AI into clinical practice. While LLMs have shown promise in medical reasoning examinations, their effectiveness in improving physician diagnostic reasoning remains uncertain. The results suggest that simply providing access to LLMs may not be sufficient to improve overall physician diagnostic reasoning.
Ethan Goh, MD, MS, the corresponding author of the study, noted that the results highlight the need for technology and workforce development to realize the potential of physician-artificial intelligence collaboration in clinical practice. He suggested that training clinicians in best prompting practices may improve physician performance with LLMs. Alternatively, organizations could invest in predefined prompting for diagnostic decision support integrated into clinical workflows and documentation, enabling synergy between the tools and clinicians.
Structured Reflection as an Assessment Tool
The study also developed a measure based on structured reflection to evaluate diagnostic reasoning skills. This assessment tool demonstrated substantial agreement between graders and internal reliability, advancing the field beyond early LLM research that focused on benchmarks with limited clinical utility.
Limitations
The authors acknowledged several limitations of the study, including the focus on a single LLM and the lack of explicit training in prompt engineering techniques for participants. Additionally, the study used a limited number of clinical vignettes, which may not comprehensively cover the variety of cases in the field of medicine.
Conclusion
Despite these limitations, the study provides valuable insights into the potential and challenges of using LLMs in clinical practice. While LLMs alone may outperform physicians in diagnostic reasoning, their integration into clinical workflows requires further development to enhance physician-AI collaboration and improve diagnostic accuracy.