A new machine learning model leveraging the expression of eight genes associated with CD4+ conventional T cells (CD4Tconv) has demonstrated significant accuracy in predicting the prognosis of colorectal cancer (CRC) patients. The study, published in Scientific Reports, highlights the potential of this model to identify high-risk individuals who may benefit from more aggressive treatment strategies.
The research team analyzed single-cell sequencing data from CRC samples, identifying distinct immune cell subtypes, including CD4Tconv cells. These cells play a crucial role in T cell-mediated immune responses, expressing signature genes like CD3D, CD3E, and CD4. Further analysis revealed 172 differentially expressed genes (DEGs) associated with CD4Tconv cells in CRC, with IFNG and TNF identified as the most interconnected core genes.
Prognostic Model Development and Validation
Univariate Cox regression analysis identified eight genes (HSPA1A, CXCR5, CTSD, PTGER2, FGF12, APOD, TP63, and LGALS4) whose differential expression significantly correlated with CRC patient outcomes. These genes were then integrated into a machine learning ensemble approach, utilizing leave-one-out cross-validation (LOOCV) to develop 101 models. The Elastic Net model (Enet with α = 0.8) emerged as the most effective, achieving an average C-index of 0.604.
Risk scores (RS) were calculated for each patient based on the expression of the eight genes. Patients were categorized into high- and low-risk groups using a cutoff of zero. Kaplan-Meier survival curves demonstrated that high-risk patients had significantly worse prognoses in both the TCGA training set and the GEO validation set, confirming the model's consistency and stability.
Clinical Relevance and Immune Microenvironment
Further analysis revealed that age, cancer staging, and RS significantly influenced survival rates (P < 0.05). ROC curve analysis showed that the risk score had the highest predictive power, with an AUC value of 0.705. Calibration analysis affirmed the model's robustness in forecasting survival rates at 1, 3, and 5 years, with a C-index of 0.781 (95% confidence interval: 0.733–0.829).
GSEA analysis identified distinct biological pathways in high- and low-risk groups. The high-risk group showed enrichment of gene sets linked to the immune response and extracellular matrix remodeling, while the low-risk group exhibited enrichment of gene sets related to intracellular metabolism and protein synthesis.
Immune infiltration analysis revealed a positive association between immune cell infiltration and RS, particularly with macrophages (M0, M1, M2 subtypes), CD4+ T cells, and CD8+ T cells. Conversely, a negative correlation was observed for B cell plasma. Expression levels of immune checkpoint-related genes, including CD274 (PD-L1), CTLA4, and PDCD1 (PD-1), were markedly elevated in the high-risk group.
Chemotherapy Sensitivity
Sensitivity analysis of various chemotherapy drugs showed that the low-risk group had lower maximum median inhibitory concentrations (IC50), indicating higher sensitivity. Notably, oxaliplatin, a commonly used chemotherapy drug in CRC, exhibited greater sensitivity in the low-risk group.
Gene Expression Validation
qRT-PCR and IHC techniques confirmed the differential expression of the eight genes in CRC tissues. APOD and TP63 levels were increased in cancerous tissues, while HSPA1A, CXCR5, CTSD, PTGER2, FGF12, and LGALS4 levels were elevated in normal tissues.
These findings suggest that the machine learning model based on CD4Tconv-related genes can effectively predict CRC prognosis and identify patients who may benefit from tailored treatment strategies, including chemotherapy and immunotherapy.