Researchers at UT Southwestern Medical Centre have demonstrated that ChatGPT can dramatically reduce the time required to screen patients for clinical trial eligibility, cutting review times from an average of 40 minutes per patient record to as little as 1.4 minutes in some cases. However, the study published in Machine Learning: Health reveals that human oversight remains essential due to the AI models' limitations in accurately identifying all eligible patients.
Performance Comparison Between AI Models
The research team, led by Dr. Mike Dohopolski, evaluated both ChatGPT-3.5 and ChatGPT-4 using data from 74 patients, including 35 already enrolled in a phase 2 cancer trial and 39 randomly selected ineligible patients. GPT-4 consistently outperformed GPT-3.5 across all metrics, achieving a median accuracy of 84% compared to GPT-3.5's best performance of 91% under optimal conditions.
Using the self-discover prompting approach, GPT-4 achieved its highest Youden's Index of 0.73, demonstrating superior balance between sensitivity and specificity. The model maintained median accuracies of 94% and 85% in different trial contexts, significantly surpassing GPT-3.5's median accuracies of 87% and 72% in the same scenarios.
Cost and Time Efficiency Analysis
The time and cost differences between the two models were substantial. GPT-3.5 screening required 1.4 to 3.0 minutes per patient at a cost of $0.02 to $0.03 each, while GPT-4 took 7.9 to 12.4 minutes and cost $0.15 to $0.27 per patient. Despite the higher costs, both models represent significant savings compared to manual screening processes.
"LLMs like GPT-4 can help screen patients for clinical trials, especially when using flexible criteria," said Dohopolski. "They're not perfect, especially when all rules must be met, but they can save time and support human reviewers."
Critical Limitations in Patient Identification
Both AI models demonstrated a concerning pattern in their performance metrics. While achieving high specificity (median 100% for both models), their sensitivity remained problematically low. GPT-3.5 showed median sensitivity of 0%, while GPT-4 achieved only 16% median sensitivity, indicating both models struggle to correctly identify eligible patients despite being effective at ruling out ineligible ones.
When assessing patient eligibility for trial enrollment, GPT-3.5 achieved a median accuracy of 0.54 (95% CI, 0.50-0.61), with its best performance reaching 0.611 using structured and expert guidance approaches. GPT-4 performed marginally better with a median accuracy of 0.61 (95% CI, 0.54-0.65) and highest accuracy of 0.65 using the chain of thought plus expert approach.
Error Analysis Reveals Processing Challenges
Analysis of 42 misclassifications revealed two primary error types. The most common issue was improper processing of available information, accounting for 95% of GPT-4's errors and 71% of GPT-3.5's errors. This occurred when models correctly identified relevant text but misinterpreted details such as dates, locations, or clinical requirements.
The second error type, failure to identify relevant information, was more prevalent in GPT-3.5 (29% of errors) compared to GPT-4 (5% of errors), where models failed to locate necessary text for accurate responses.
Clinical Trial Enrollment Crisis
The research addresses a critical problem in clinical research, as up to 20% of National Cancer Institute-affiliated trials fail due to inadequate patient enrollment. This failure not only inflates costs and delays results but also undermines the reliability of new treatment assessments.
Part of the challenge stems from valuable patient information buried in unstructured text within electronic health records, such as doctors' notes, which traditional machine learning software struggles to interpret. The researchers suggest that LLMs could help by flagging candidates for subsequent manual review, potentially addressing the capacity limitations that cause eligible patients to be overlooked.
Implementation Recommendations
The study authors concluded that "LLM performance varies by prompt, with GPT-4 generally outperforming GPT-3.5, but at higher costs and longer processing times. LLMs should complement, not replace, manual chart reviews for matching patients to clinical trials."
The research team acknowledges several limitations, including concerns about ongoing costs with closed-source GPT models, lack of metadata extraction from clinical notes, and the need for specialized domain expertise to generate effective guidance. The single-institution patient sample with specific documentation styles may also limit generalizability to other healthcare settings.