MedPath

OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice

Not yet recruiting
Conditions
AI (Artificial Intelligence)
Large Language Model
Generative Artificial Intelligence
Registration Number
NCT07199231
Lead Sponsor
Cambridge Health Alliance
Brief Summary

OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance.

In this study, investigators have two goals:

1. To determine whether the use of the AI tool "OpenEvidence" leads to clinically appropriate decisions when utilized by family medicine, internal medicine, and psychiatry residents in the course of clinical practice.

2. To determine how the output of the OpenEvidence tool compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in answering common questions that residents have in the course of clinical practice.

To accomplish study goal #1, investigators have enlisted residents in the above specialties to use the OpenEvidence tool in the course of clinical practice. In order to mitigate any safety risks, the residents will also use a typical reference tool for their question, which is referred to as the "Gold Standard" tool. These tools include PubMed and UpToDate. The residents will:

1. State their clinical question.

2. Query OpenEvidence, capturing their prompt and the OpenEvidence output for data analysis. All residents will undergo training in prompt engineering at the start of the study.

3. State their clinical conclusion based on the OpenEvidence data.

4. Query the Gold Standard Resource.

5. State their final clinical conclusion.

6. Answer a question on whether their clinical conclusion was modified by the Gold Standard reference.

7. Answer a question on whether they had any clinical safety concerns on the output from OpenEvidence.

Attending physician Subject Matter Experts (SMEs) matched by specialty with at least 5 years of post-training clinical experience will then evaluate the residents' responses. 5 years was chosen based the book "Outliers" by Malcolm Gladwell, in which he asserts that 10,000 hours of focused practice is needed to achieve expertise in a field.

SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a scale of 1-10. For questions where the SME's rate the clinical appropriateness of the residents' conclusions poorly (\< 5/10), they will be asked to review the OpenEvidence output and answer an additional question as to whether the output was incorrect or the resident misinterpreted the output from the tool.

To accomplish goal #2, the initial prompt entered by the residents into OpenEvidence will be copied by the research team into ChatGPT, Gemini, and Claude. The outputs from each tool (including OpenEvidence) will be surfaced to SMEs, who will be asked to rate each output based on accuracy, completeness, and bias. Likert scales will be used for these ratings. SMEs will also be asked an open-ended question to identify any patient safety issues from any of the outputs.

Detailed Description

OpenEvidence is an online tool built out of the May Clinic Platform Accelerate \[OpenEvidence\] that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care.

OpenEvidence is an online tool designed to aggregate and synthesize data from peer-reviewed clinical studies, subsequently generating responses to user inquiries through the application of generative AI. Although increasingly utilized by both seasoned clinicians and trainees, there is a notable absence of published data regarding the accuracy of the tool's outputs, their safety and efficacy in appropriately informing clinical decision-making. Concurrently, a growing number of clinicians are leveraging other publicly-available large language models (LLMs) to support decision-making in clinical care. While a number of studies have examined the accuracy of LLM responses to medical board questions or clinical vignettes, there is limited research on their performance in real-world clinical settings, and even fewer studies offer comparative analyses of this performance.

In a review of the literature, one article shows LLM's may be better at detecting anxiety than practitioners, but this was based on clinical vignettes. \[Levkovich et al.\] Another looked at diagnostic sensitivity of LLM's using patient-reported outcome measures in a structured questionnaire. \[Pagano et al.\] An additional study comparing LLM's for oncology also uses fictional vignettes. \[Benary et al.\] A randomized control trial using clinical vignettes did not show any clinical improvement for providers who had access to LLM's. \[Goh et al.\] One case study explored integration of ChatGPT 3.5 into daily rounds and evaluated its use qualitatively, but did not compare it with other LLM's or gold standard reference tools. \[Skryd et al.\] Another compared ChatGPT's responses to American College of Radiology appropriateness criteria for breast pain and breast cancer screening, but again did not compare it with other LLM's. \[Rao et al.\] In our review, only one study evaluated LLM's in a real world clinical setting. This was a series of papers that looked at their use for complex decision making in breast-cancer care, using a small number of actual cases and a standardized prompt template. \[Griewing, Knitza et al.; Griewing, Gremke et al.\] That study found issues with consistency and deterioration of accuracy (particularly with GPT 3.5), leading the authors to conclude that the clinical use of LLM's for that purpose was not yet feasible at the time of publication. Still, health systems leaders see the use of these tools rapidly accelerating in clinical practice. For this reason, investigators believe it is imperative to study their safety and the clinical appropriateness of the decisions clinicians are making as a result of their use.

Cambridge Health Alliance (CHA) is a public, academic safety-net health system in the Boston area, serving a diverse population of patients. CHA has a robust primary care and outpatient psychiatry footprint, and supports a large graduate medical education program through both Harvard Medical School and Tufts University School of Medicine. Investigators chose residents as our primary study participants as many trainees are already using OpenEvidence, and found them more incentivized to participate in the study if given access to the tool at CHA (where it is otherwise blacklisted from network services and prohibited by policy until results of this study can be determined).

Study outcomes are as follows:

Determine whether the use of OpenEvidence leads to clinically appropriate decisions by residents in the course of clinical practice in a community health setting.

Determine how the output of OpenEvidence compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in accuracy, completeness, and bias when addressing clinical questions residents have in the course of clinical practice in a community health setting.

Methods:

Data collection is planned to take place over a 6-month period in order to minimize vendor version upgrades during the study period. Residents are grouped by specialty into "medicine" (internal medicine/family medicine) and "psychiatry" (adult/child psychiatry). In order to simplify matching to appropriate specialty subject matter experts, medicine residents are asked to use OpenEvidence only for adult primary care cases (excluding OB/GYN-related issues). Psychiatry residents are asked to use OpenEvidence only for adult psychiatry cases.

Before being accepted as participants, trainees were all asked to agree to the following:

* Cross check any OpenEvidence query against a Gold Standard tool, defined in the study protocol to include PubMed, UpToDate, Dynamed, a clinical specialty society guideline, or other similar clinical reference source that must be documented in the study form.

* Not enter any personal health information (PHI) into OpenEvidence (as defined below).

* Attempt to use OpenEvidence at least 3 times per week, if appropriate to the clinical care of the patient.

* Document 100% of their OpenEvidence queries into the study research form to avoid selection bias.

All residents will be given brief training in prompt engineering for healthcare before data collection begins. Standardized prompts will not be used, as one of the subgoals of the study is to understand what types of queries residents submit to OpenEvidence in a real world setting.

All residents will be educated on the definition of PHI, as follows:

Queries should not include any of PHI, as defined by the Safe Harbor identifiers \[HHS\]; queries can include patient age in years (days/weeks/months for pediatrics), and legal sex; for patients age 89 or older, the user must instead use the term "over age 89" to comply with Safe Harbor standards.

Queries should not include patients suspected of having extremely rare conditions as defined by the National Organization for Rare Disorders, as these are also prone to reidentification \[NORD\]. If a rare condition is not initially suspected but becomes suspected through the research process of using the AI tool, the user will be asked to stop their query at that point.

Data collection will involve the use of a HIPAA-compliant Google Form within CHA's enterprise Google Workspace for Health cloud infrastructure. The data collection form will ask trainees do the following:

1. Enter their initial clinical question.

2. Paste their OpenEvidence prompt (numbered sequentially for iterative prompts) and the full OpenEvidence output(s) generated.

3. Enter their clinical conclusion based on the OpenEvidence output.

4. Enter the Gold Standard reference tool used.

5. State their final clinical conclusion based on information from both OpenEvidence and the Gold Standard Reference tool.

6. Answer a question on the extent to which their initial clinical conclusion was modified by the Gold Standard reference.

7. Answer an open-ended question on whether they noticed any clinical safety issues, inaccuracies, or bias in the output from OpenEvidence.

Queries will be sorted by specialty (medicine vs. psychiatry), and each query will receive a sequential study number.

Attending physician Subject Matter Experts (SMEs) Board Certified in Internal Medicine, Family Medicine, or Psychiatry with at least 5 years of post-training clinical experience were recruited. Five years of post-training clinical experience was chosen based on the fact that Malcolm Gladwell, in his book, Outliers, asserts that 10,000 hours of focused practice is needed to achieve expertise in a field.

SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a 10-point Likert scale. SME's will also be provided with the OpenEvidence output for each query, and where the SME rates the clinical appropriateness of the residents' conclusions poorly (\< 5/10), the SME will additionally be asked a follow-up question to assess whether the tool's output itself provided a clinically inappropriate response, in order to ascertain whether the trainee may have misinterpreted the tool's output. SME review will include a 2.5-5% overlap between reviewers to calculate a kappa score for interrater reliability.

In part two of the study, the research team will sort the OpenEvidence queries into themes, and choose a random sampling of queries from each specialty and theme for comparison between LLM's. The research team confirm that prompts do not include any PHI according to study protocol. They will then copy the OpenEvidence prompts entered by residents for the selected queries and paste them exactly into ChatGPT, Gemini, and Claude.

The outputs of each of the four tools (OpenEvidence, ChatGPT, Gemini, and Claude) will be surfaced in a Google webform. SMEs will be asked to rate each output on a Likert scale for accuracy, completeness, and bias, as well as to answer a qualitative question identifying any patient safety issues in the output.

Results:

Primary outcome results will be reported as follows:

Clinical appropriateness of decision made by residents using OpenEvidence (mean with SD, median), by specialty

* If, in cases of low clinical appropriateness, SME's identified that this was due not to the tool's output but instead due to the resident's interpretation of the tool, metrics will also be provided with these cases excluded

* Interrater reliability (kappa value)

Secondary outcome results will be reported as follows:

For each specialty and each variable (accuracy, completeness, and bias), investigators will report:

* The "win" rate for each LLM and average margin from the second place score.

* The effect size (using each LLM as a separate reference) using Cohen's d test.

* Interrater reliability (kappa value)

Recruitment & Eligibility

Status
NOT_YET_RECRUITING
Sex
All
Target Recruitment
20
Inclusion Criteria
  • Active trainees PGY-1 through PGY-6 in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry at Cambridge Health Alliance
  • Must agree to the study protocol requirements outlined in the study description.
Exclusion Criteria
  • Anyone who does not meet inclusion criteria
  • Residents who plan to leave CHA prior to the end of the study collection period.

Study & Design

Study Type
OBSERVATIONAL
Study Design
Not specified
Primary Outcome Measures
NameTimeMethod
Clinical Appropriateness: Mean with SD6 months

Clinical appropriateness score of resident decisions based on OpenEvidence output.

This is numeric score on a 10-point Likert scale. (For Likert scales described in this and all of our outcome metrics, higher scores are better outcomes.)

Mean score with standard deviation will be used for primary outcome.

Clinical Appropriateness: Median6 months

Clinical appropriateness score of resident decisions based on OpenEvidence output.

This is numeric score on a 10-point Likert scale.

Median clinical appropriateness scores will also be reported.

Clinical Appropriateness: Interrater Reliability6 months

SME's will evaluate Clinical Appropriateness scores of resident decisions based on OpenEvidence output on a 10-point Likert scale.

Interrater reliability of SME Clinical Appropriateness scores will be calculated using kappa value.

Secondary Outcome Measures
NameTimeMethod
Comparative Completeness: Margin of Win6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

For each specialty (medicine and psychiatry), we will calculate the win rate for each LLM on completeness and then also report:

Average margin from 2nd place LLM for COMPLETENESS.

* If 1st and 2nd place tie, average margin will reported as 0.

Comparative Accuracy of LLM's: Win Rate6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

Whichever model wins will be given a score of "1". If there is a tie, each gets 0.5 or 0.33 points, depending on the division of the tie.

For each specialty (medicine and psychiatry), and each LLM, we will report the "win rate" with average margin from the second place LLM.

For example, for all Medicine queries, we will report the percentage of times OpenEvidence "won" over the other LLM's on the accuracy Likert scale (where there is SME overlap to determine the kappa value), we will average the SME's scores).

As it pertains to ACCURACY, we will report the WIN RATE as a percentage.

Comparative Accuracy: Margin of Win6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

For each specialty (medicine and psychiatry), we will calculate the win rate for each LLM on accuracy and then also report:

Average margin from 2nd place LLM for ACCURACY.

* If 1st and 2nd place tie, average margin will reported as 0.

Comparative Accuracy: Effect Size6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

For each specialty (medicine and psychiatry), we will use Cohen's d test to calculate the effect size of each LLM's average ACCURACY score in comparison to each of the other three LLM's.

An effect size of 0.2 is small, 0.5 is medium, and 0.8 is considered large.

Comparative Accuracy of LLM's: Interrater Reliability6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

Interrater reliability of Accuracy scores will be calculated using kappa value.

Comparative Completeness: Effect Size6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

For each specialty (medicine and psychiatry), we will use Cohen's d test to calculate the effect size of each LLM's average COMPLETENESS score in comparison to each of the other three LLM's.

An effect size of 0.2 is small, 0.5 is medium, and 0.8 is considered large.

Comparative Completeness of LLM's: Interrater Reliability6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

Interrater reliability of Completeness scores will be calculated using kappa value.

Comparative Completeness of LLM's: Win Rate6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for COMPLETNESS on a 10-point Likert scale.

Whichever model wins will be given a score of "1". If there is a tie, each gets 0.5 or 0.33 points, depending on the division of the tie.

For each specialty (medicine and psychiatry), and each LLM, we will report the "win rate" with average margin from the second place LLM.

For example, for all Medicine queries, we will report the percentage of times OpenEvidence "won" over the other LLM's on the accuracy Likert scale (where there is SME overlap to determine the kappa value), we will average the SME's scores).

As it pertains to COMPLETENESS, we will report the WIN RATE as a percentage.

Comparative Bias of LLM's: Win Rate6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for signs of bias on a 10-point Likert scale.

Whichever model wins will be given a score of "1". If there is a tie, each gets 0.5 or 0.33 points, depending on the division of the tie.

For each specialty (medicine and psychiatry), and each LLM, we will report the "win rate" with average margin from the second place LLM.

For example, for all Medicine queries, we will report the percentage of times OpenEvidence "won" over the other LLM's on the accuracy Likert scale (where there is SME overlap to determine the kappa value), we will average the SME's scores).

As it pertains to BIAS, we will report the WIN RATE as a percentage.

Comparative Bias: Margin of Win6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for bias on a 10-point Likert scale.

For each specialty (medicine and psychiatry), we will calculate the win rate for each LLM on bias and then also report:

Average margin from 2nd place LLM for BIAS.

* If 1st and 2nd place tie, average margin will reported as 0.

Comparative Bias: Effect Size6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

For each specialty (medicine and psychiatry), we will use Cohen's d test to calculate the effect size of each LLM's average BIAS score in comparison to each of the other three LLM's.

An effect size of 0.2 is small, 0.5 is medium, and 0.8 is considered large.

Comparative Bias of LLM's: Interrater Reliability6 months

Comparing outputs of OpenEvidence, ChatGPT, Gemini, and Claude, SME's will rate each tool for accuracy on a 10-point Likert scale.

Interrater reliability of Bias scores will be calculated using kappa value.

MedPath

Empowering clinical research with data-driven insights and AI-powered tools.

© 2025 MedPath, Inc. All rights reserved.