Evaluating the Potential of Large Language Models for Respiratory Disease Consultations

Not Applicable

Completed

Conditions: Acute Upper Respiratory Infection
Lung Cancer
Pneumonia
Acute Bronchitis
Bronchiectasis
Asthma
Pulmonary Embolism
Tuberculosis
Hay Fever
Pulmonary Fibrosis

Interventions: Diagnostic Test: Diagnosis by three human doctors
Diagnostic Test: Diagnosis by ChatGPT-3.5 (with search capabilities)
Diagnostic Test: Diagnosis by ChatGPT-3.5 (without search capabilities)
Diagnostic Test: Diagnosis by ChatGPT-4.0 (with search capabilities)
Diagnostic Test: Diagnosis by ChatGPT-4.0 (without search capabilities)
Diagnostic Test: Diagnosis by Claude instant (with search capabilities)
Diagnostic Test: Diagnosis by Claude instant (without search capabilities)
Diagnostic Test: Diagnosis by Claude 2 (with search capabilities)
Diagnostic Test: Diagnosis by Claude 2 (without search capabilities)
Diagnostic Test: Diagnosis by Gemini Pro (with search capabilities)
Diagnostic Test: Diagnosis by Gemini Pro (without search capabilities)

Registration Number: NCT06457269

Lead Sponsor: North Sichuan Medical College

Brief Summary: The clinical trial aimes to evaluate multiple large language models in respiratory disease consultations by comparing their performance to that of human doctors across three major medical consultation scenarios.

The main question aims to answer are:

* How do large language models perform in comparison to human doctors in diagnosing and consulting on respiratory diseases across various clinical scenarios?

In three clinical scenarios including the online query section, the disease diagnosis section and the medical explanation section, research assistants or volunteers will be asked to cross-question all LLMs or real doctors using predefined online questions and their own issues. After each questioning session, a short washout period is implemented to eliminate potential biases.

Detailed Description: Not available

Recruitment & Eligibility

Status: COMPLETED

Sex: All

Target Recruitment: 703

Inclusion Criteria

Self-reported symptoms of common respiratory diseases, such as cough, chest tightness, fever, and wheezing
Ability to engage in LLM dialog operations independently or with minimal peer training
A health status deemed suitable for study participation by the pulmonology experts

Exclusion Criteria

Excessively poor health status

Study & Design

Study Type: INTERVENTIONAL

Study Design: CROSSOVER

Arm && Interventions

Group	Intervention	Description
Cross-comparison group(the disease diagnosis section)	Diagnosis by Claude 2 (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by Gemini Pro (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by Gemini Pro (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by ChatGPT-4.0 (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by three human doctors	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by ChatGPT-3.5 (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by three human doctors	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by ChatGPT-3.5 (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by Claude 2 (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by Claude 2 (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by Gemini Pro (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by ChatGPT-3.5 (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by ChatGPT-4.0 (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by Claude instant (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by Claude instant (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by Claude 2 (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by Gemini Pro (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the disease diagnosis section)	Diagnosis by ChatGPT-4.0 (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by Claude instant (with search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by ChatGPT-3.5 (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by ChatGPT-4.0 (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)
Cross-comparison group(the medical explanation section)	Diagnosis by Claude instant (without search capabilities)	Cross-comparison group (including human doctor controls and all LLMs)

Primary Outcome Measures

Name	Time	Method
Expert indicators-Accuracy	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. As for subjective expert indicators, the evaluation will be conducted within two months.	Based on the doctors' responses to patients' issues, a 5-point scale will be used for scoring by an expert panel: 5- The responses are completely accurate, addressing all of the patient's questions or diagnosing by identifying the key points of the patient's complaint. 4- The responses are mostly accurate, generally addressing the patient's questions or diagnosing by identifying the key points of the patient's complaint. 3- The responses are moderately accurate, addressing the patient's questions or diagnosing by identifying the key points of the patient's complaint. 2- The responses are rarely accurate, barely addressing the patient's questions or diagnosing by identifying the key points of the patient's complaint. 1- The responses are very inaccurate, not addressing the patient's questions or diagnosing by identifying the key points of the patient's complaint at all.
Expert indicators-Comprehensiveness	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. As for subjective expert indicators, the evaluation will be conducted within two months.	Based on the doctors' responses to patients' issues, a 5-point scale will be used for scoring by an expert panel: 5-The responses are highly comprehensive, addressing various aspects of potential diseases corresponding to the patient's symptoms, providing detailed advice, and offering its own extended interpretations. 4-The responses are mostly comprehensive, covering most aspects of potential common diseases related to the patient's symptoms, and providing fairly detailed advice. 3-The responses are moderately comprehensive, addressing some aspects of potential common diseases related to the patient's symptoms, and offering basic advice. 2-The responses are rarely comprehensive, failing to consider various aspects of potential common diseases related to the patient's symptoms, and providing very limited advice. 1-The responses are not comprehensive at all, overlooking most potential diseases related to the patient's symptoms, and failing to provide any advice.
Empathy indicators	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. As for subjective empathy indicators, the evaluation will be conducted within two months.	Results from CARE scales concerning the doctor-patient relationship, which were completed by patients following each diagnostic session. Specifically, the online query section does not apply the evaluation of CARE scales.
Expert indicators-Ethical compliance	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. As for subjective expert indicators, the evaluation will be conducted within two months.	Based on the doctor's response to the patient's question, an expert panel will review each item in accordance with the Declaration of Helsinki and the International Code of Medical Ethics which aims to determine whether there are any responses or suggestions that could potentially harm the patient or violate ethical guidelines. The findings will be recorded using binary variables: True-The responses are completely ethical. False-When uncertainties exist, the response includes suggestions for the use of controlled medications and some inappropriate or even counterproductive advice.
Expert indicators-Correctness	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. As for subjective expert indicators, the evaluation will be conducted within two months.	Based on the doctors' responses to patients' issues, a 5-point scale will be used for scoring by an expert panel: 5- The responses are completely correct, with no inappropriate or ambiguous statements. 4- The responses are mostly correct, with most statements being appropriate and unambiguous. 3- The responses are generally correct, although there are inappropriate or ambiguous statements, they are acceptable. 2- The responses are partially correct, with few statements being appropriate or unambiguous. 1- The responses are completely incorrect, with nearly all statements being inappropriate and full of ambiguities.

Secondary Outcome Measures

Name	Time	Method
Regular indicators-Total number of conversations	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. After the completion of the dialogues, the system will automatically summarize all objective indicators and dialogue information.	The total number of dialogs in a complete conversation between a user and LLMs or a real doctor, where each dialog consists of one question and one answer.
Regular indicators-Total conversation cost ($)	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. After the completion of the dialogues, the system will automatically summarize all objective indicators and dialogue information.	The total cost in dollars for completing the entire conversation.
Regular indicators-Total conversation time (min)	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. After the completion of the dialogues, the system will automatically summarize all objective indicators and dialogue information.	Timing starts from the user's input and stops when the LLMs or real doctors completes the output of the last sentence.
Regular indicators-Follow-up words	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. After the completion of the dialogues, the system will automatically summarize all objective indicators and dialogue information.	The number of words in follow-up questions asked by the LLM or real doctor to the patient after providing basic answers in a complete conversation.
Regular indicators-Number of output statements	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. After the completion of the dialogues, the system will automatically summarize all objective indicators and dialogue information.	The total number of words output by the LLMs or real doctors.
Regular indicators-Total number of questions	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. After the completion of the dialogues, the system will automatically summarize all objective indicators and dialogue information.	The number of follow-up questions asked by the LLM or real doctor to the patient after providing basic answers in a complete conversation.
Regular indicators-Number of input statements	For each participant, starting from the day of random conversation, a maximum participation time of one week will be given. After the completion of the dialogues, the system will automatically summarize all objective indicators and dialogue information.	The sum of the number of characters entered by the user.

Trial Locations

Locations (1): The Affiliated Hospital of North Sichuan Medical College
🇨🇳
Nanchong, Sichuan, China

Related Trials

The Impact of Large Language Models on Diagnostic Reasoning Among LLM-Trained Medical Doctors

RecruitingNot Applicable

Lahore University of Management Sciences

Posted 1/14/2025

Updated 3/19/2025

Application of Multimodal Large Language Model in HFpEF

Recruiting

Peking University Third Hospital

Posted 7/3/2024

Efficacy of Using Large Language Model to Assist in Diabetic Retinopathy Detection

CompletedNot Applicable

Sun Yat-sen University

Posted 2/9/2022

Updated 1/19/2024

Multi-Disciplinary Treatment on the Anthropomorphism of Large Language Models

Not Yet Recruiting

North Sichuan Medical College

Posted 10/4/2024

Physician Reasoning on Diagnostic Cases With Large Language Models

CompletedNot Applicable

Stanford University

Posted 12/6/2023

Updated 2/20/2024

Treatment Recommendations for Gastrointestinal Cancers Via Large Language Models

RecruitingNot Applicable

Chinese Academy of Sciences

Posted 8/21/2023

Updated 9/8/2023

The Application of Large Language Model in Emergency Chest Pain Triage

RecruitingNot Applicable

Peking University Third Hospital

Posted 7/9/2024

Artificial Intelligent Clinical Decision Support System Simulation Center Study for Technology Acceptance

CompletedNot Applicable

Yale University

Posted 4/18/2023

Updated 3/10/2025

Effect of Large Language Model in Assisting Discharge Summary Notes Writing for Hospitalized Patients

Enrolling by InvitationNot Applicable

Mayo Clinic

Posted 2/16/2024

Updated 1/24/2025

Enhancing Interdisciplinary Understanding of Ophthalmology Notes Through a Local Large Language Model

CompletedNot Applicable

John J Chen

Posted 10/3/2024

Evaluating the Potential of Large Language Models for Respiratory Disease Consultations

Recruitment & Eligibility

Study & Design

Trial Locations

Related Trials

The Impact of Large Language Models on Diagnostic Reasoning Among LLM-Trained Medical Doctors

Application of Multimodal Large Language Model in HFpEF

Efficacy of Using Large Language Model to Assist in Diabetic Retinopathy Detection

Multi-Disciplinary Treatment on the Anthropomorphism of Large Language Models

Physician Reasoning on Diagnostic Cases With Large Language Models

Treatment Recommendations for Gastrointestinal Cancers Via Large Language Models

The Application of Large Language Model in Emergency Chest Pain Triage

Artificial Intelligent Clinical Decision Support System Simulation Center Study for Technology Acceptance

Effect of Large Language Model in Assisting Discharge Summary Notes Writing for Hospitalized Patients

Enhancing Interdisciplinary Understanding of Ophthalmology Notes Through a Local Large Language Model

Clinical Trial Alerts

Clinical Trial Alerts