Diagnostic Reasoning With Customized GPT-4 Model
- Conditions
- Pathologic ProcessesDisease
- Registration Number
- NCT06911645
- Lead Sponsor
- Stanford University
- Brief Summary
This study will assess the impact of immediate access to a customized version of GPT-4, a large language model, on performance in case-based diagnostic reasoning tasks. Specifically, it will compare this approach to a two-step process where participants first use traditional diagnostic decision support tools to support their diagnostic reasoning before gaining access to the customized GPT-4 model.
- Detailed Description
Artificial intelligence (AI) technologies, particularly advanced large language models like OpenAI's ChatGPT, have the potential to enhance medical decision-making. While ChatGPT-4 was not specifically designed for medical applications, it has demonstrated promise in various healthcare contexts, including medical note-writing, addressing patient inquiries, and facilitating medical consultations. However, its impact on clinicians' diagnostic reasoning remains largely unknown.
Clinical reasoning is a complex process that involves pattern recognition, knowledge application, and probabilistic reasoning. Integrating AI tools like ChatGPT-4 into physician workflows could help reduce clinician workload and decrease the likelihood of missed diagnoses. However, ChatGPT-4 was neither developed nor validated for diagnostic reasoning, and it may produce misleading information, including plausible but incorrect conclusions that could misguide clinicians. If not used appropriately, it may fail to improve-and could even hinder-clinical decision-making. Therefore, it is essential to study how clinicians use large language models to support clinical reasoning before integrating them into routine patient care.
This study will examine how immediate access to a customized version of ChatGPT-4 impacts performance on case-based diagnostic reasoning tasks, compared to a stepwise approach. In the stepwise approach, participants will first use traditional diagnostic decision support tools to support their case reasoning before interacting with a customized ChatGPT-4 model, at which point they will have the opportunity to revise their initial answers.
Participants will be randomized into different study arms and will respond to diagnostic cases by providing three differential diagnoses, along with supporting and opposing findings for each. They will also identify their top diagnosis and propose next diagnostic steps. Independent reviewers, blinded to treatment assignment, will evaluate their responses.
Recruitment & Eligibility
- Status
- COMPLETED
- Sex
- All
- Target Recruitment
- 70
- Participants must be licensed physicians and have completed at least post-graduate year 1 (PGY1) of medical training.
- Training in Internal medicine, family medicine, or emergency medicine.
- Not currently practicing clinically.
- Participated in one of our previous studies that used the same six diagnostic cases.
Study & Design
- Study Type
- INTERVENTIONAL
- Study Design
- PARALLEL
- Primary Outcome Measures
Name Time Method Diagnostic reasoning Through study completion, an average of 6 months The primary outcome will be the percentage of correct responses per case (range: 0 to 100). For each case, participants will be asked to provide their top three differential diagnoses, along with supporting and opposing findings for each. They will receive 1 point for each plausible diagnosis. Supporting and opposing findings will be graded based on correctness, with 1 point for a partially correct response and 2 points for a completely correct response. Participants will then select their top diagnosis, earning 1 point for a reasonable choice and 2 points for the most accurate diagnosis. Finally, they will list up to three next steps for further patient evaluation, with 1 point awarded for a partially correct response and 2 points for a completely correct response. The primary outcome will be analyzed at the case level, comparing performance between the randomized study groups.
- Secondary Outcome Measures
Name Time Method Prompt frequency Through study completion, an average of 6 months The investigators will compare the frequency of participant prompts to the customized GPT-4 model between the two study groups.
Sentiment Through study completion, an average of 6 months The investigators will compare the tone and sentiment of participant prompts to the customized GPT-4 model across the two study groups. The investigators will create a qualitative coding system to categorize the nature of the participants' prompts.
Time Spent Per Case Through study completion, an average of 6 months The investigators will compare the average time (in minutes) participants spend on each case across the two study arms.
Participant Perceptions of AI in Clinical Reasoning Through study completion, an average of 6 months This outcome would be assessed in both study arms and would encompass changes in attitudes, confidence, and willingness to use AI diagnostic tools before and after being exposed to the customized tool. We will assess the number of participants who were open to using AI to help with complex clinical reasoning (pre- and post-quiz), if they enjoyed working with the AI diagnostic tool, if they felt like the tool provided a valuable collaborative experience for clinical reasoning, if seeing the AI diagnostic tool's recommendations increased their confidence in their differential diagnoses, and if they would use an AI diagnostic tool like the one in this study in their daily job. These will be evaluated on a Likert scale ranking from strongly disagree to strongly agree.
Customized GPT-4's diagnostic reasoning Through study completion, an average of 6 months The customized GPT-4's 'independent' diagnoses will be assessed for accuracy. The outcome will be the percentage of correct responses per case (range: 0 to 100). For each case, the meta-prompt directs the customized GPT-4 to provide its top three differential diagnoses, along with supporting and opposing findings for each, a final diagnosis, and next steps. The customized GPT-4 will receive 1 point for each plausible diagnosis. Supporting and opposing findings will be graded based on correctness, with 1 point for a partially correct response and 2 points for a completely correct response. Its top diagnosis will earn 1 point for a reasonable choice and 2 points for the most accurate diagnosis. Finally, it will list up to three next steps for further patient evaluation, with 1 point awarded for a partially correct response and 2 points for a completely correct response. The outcome will be analyzed at the case level, comparing performance with the randomized study groups' scores.
Related Research Topics
Explore scientific publications, clinical data analysis, treatment approaches, and expert-compiled information related to the mechanisms and outcomes of this trial. Click any topic for comprehensive research insights.
Trial Locations
- Locations (1)
Stanford University
🇺🇸Palo Alto, California, United States
Stanford University🇺🇸Palo Alto, California, United States