Machine-Learning for cancer treatment: Guided covariate selection for TTE models developed from real world data with a small number of patients
Eleni Karatza (1), Apostolos Papachristos (2), Gregory Sivolapenko (2), Vangelis Karalis (1)
(1) Department of Pharmacy, School of Health Sciences, National and Kapodistrian University of Athens, Greece (2) Department of Pharmacy, School of Health Sciences, University of Patras, Rion-Patras, Greece
Introduction: Bevacizumab slows angiogenesis through inhibition of the vascular endothelial growth factor A (VEGF-A). Indeed, VEGF-A primarily and Intercellular Adhesion Molecule 1 (ICAM1) secondarily are the main regulators of angiogenesis. Bevacizumab is currently regarded as essential for the treatment of metastatic colorectal cancer (mCRC). Despite the fact that many factors have been proposed as predictive biomarkers of bevacizumab therapy, there isn’t an established one yet [1,2].
Current computational approaches for quantifying the effect of various predictors on survival data rely mainly on Cox regression and inclusion of co-variates in parametric time-to-event (TTE) models. However, require a high number of patients in order to give accurate results. Indeed, Cox regression, when the events per variable ratio (EPV) is low, tends to overfitting, while parametric TTE models are not robust to misspecification [3,4]. Nonetheless, in the real-life clinical setting decisions have to be made based on a restricted number of data. Thus, a computational method giving results of acceptable precision with a small sample size is needed. In this study TTE models for progression-free survival (PFS) and overall survival (OS) of patients with mCRC under bevacizumab therapy were developed. Various machine learning (ML) approaches were applied, and results were evaluated in order to assure the predictive value of the selected co-variates.
- Investigate the applicability and performance of ML algorithms for prediction of PFS and OS
- Identify the ML approach with the best performance and predictive capacity despite the restricted dataset used for training
- Develop a parametric TTE model for PFS and one for OS
- Evaluate results retrieved from various ML algorithms, in order to support covariate selection of the final model
- Identify significant predictive biomarkers of bevacizumab therapy
Methods: PFS and OS data were retrieved from an observational, prospective study, after 5-years of follow-up including 46 patients. The survival data was right-censored with 26 events of death for OS and 37 events of disease progression for PFS. Bevacizumab was administered at a dose of 5 mg/kg once every 2 weeks or at a dose of 7.5 mg/kg once every 3 weeks. Patients also received either irinotecan or oxaliplatin based co-treatment. VEGF-A (V2578, V1154, V6341), ICAM-1 (ICAM469, ICAM241) genes polymorphisms, age, gender, weight, dosing scheme and co-treatment (irinotecan or oxaliplatin) were investigated as possible co-variates affecting PFS and OS. All co-variates were tested as dichotomous, namely genes (wild type or mutant), weight (over or under 80kg) and age (over or under 65 years). Independence of the co-variates has been assured using chi-squared test and Fisher exact test.
Kaplan-Meier (KM) estimators, logrank tests and cox regression have been performed using R packages ‘survival’ and ‘survminer’. Penalized cox regression techniques including LASSO (least absolute shrinkage and selection operator), ridge regression and elastic net were applied using the R package ‘glmnet’ and ‘plotmo’. Evaluation of these approaches has been performed based on C-index, AIC, prediction error curves (obtained using the R packages ‘pec’ and ‘riskRegression’) and time-dependent ROC (obtained using the R package ‘survivalROC’). A survival tree has been build using R package ‘rpart’ and a random forest for survival data has been developed after selection of the proper hyperparameters using R package ‘ranger’. Predictive performance of the models using each approach was evaluated in a bootstrapped sample.
Parametric TTE nonlinear mixed effect models were developed using Monolix2019R1. Models for hazard investigated included the exponential, Gompertz, log-logistic, uniform, Weibull and gamma distribution. They were evaluated in terms of precision of the estimates, physiological relevance and goodness-of-fit criteria and plots. The impact of covariates on model parameters was searched by a combination of a stepwise forward addition and backward elimination. Co-variates were included based on results of a Wald test, a one-way ANOVA, a likelihood ratio test and their capacity to explain inter-subject variability. Correlations between model parameters were also investigated.
Results: As expected in view of the low EPV, LASSO was the most efficient way to describe both PFS and OS data [5,6] and resulted in the best predictive performance based on prediction error curves, time-dependent ROC curves (AUC>0.6 in 6 time points), AIC and C-index. However, all ML approaches tested showed an acceptable performance and resulted in the same significant co-variates.
The most significant predictor for PFS was V2578 gene. According to the LASSO model, individuals with mutant type V2578 are almost 0.9 times as likely to have disease progression as those with wild type V2578, i.e. having mutant type reduces the risk of progression by about 10%.
The most significant predictors for OS were V1154 and ICAM241 genes. According to the LASSO model, individuals with mutant type ICAM241 are almost 0.8 times as likely to die at any time as those with wild type ICAM241, i.e. having mutant type reduces the risk of death by 20%. Individuals with mutant type V1154 are almost 0.9 times as likely to die at any time as those with wild type V1154, i.e. having mutant type reduces the risk of death by 10%.
The hazard of PFS was best described by a Weibull distribution (h = p/Te * (t/Te)^(p-1). The characteristic time (scale parameter) and the shape parameters were estimated Te_pop = 321 days and p_pop = 2.33, respectively. V2578 has been found as a statistically significant co-variate on Te with beta =0.561, i.e. Te = Te_pop*1.75 V2578 , with V258 being 0 for wild type and 1 for mutant type. These estimations indicate that patients with mutant type V258 gene have reduced hazard of disease progression.
The hazard of OS was best described by a uniform distribution (h=1/Te-t) with Te_pop = 1320 days. ICAM241 and V1154 were found significant co-variates on Te with beta ICAM241 =1.11 and beta V1154 = 0.909. Thus Te =Te_pop*3.03 ICAM241 *2.48V1154 with ICAM241 and V1154 being 0 for wild type and 1 for mutant type. These estimations indicate that patients with mutant type ICAM241 and/or V1154 genes have reduced hazard of death .
- All ML algorithms that have been tested in this analysis were in line regarding which were the statistically significant covariates.
- The LASSO is the best ML approach to use when the EPV is low, as thanks to its ability to shrink to zero non-significant covariates, gives more accurate estimates .
- The hazard of PFS and OS followed a uniform distribution.
- As shown through a multitude of approaches the V2578, ICAM241 and V1154 genes are associated with good prognostic for patients with mCRC under bevacizumab treatment.
 Loupakis F, Cremolini C, Yang D, Salvatore L, Zhang W, Wakatsuki T, Bohanes P, Schirripa M, Benhaim L, Lonardi S, Antoniotti C, Aprile G, Graziano F, Ruzzo A, Lucchesi S, Ronzoni M, De Vita F, Tonini G, Falcone A, Lenz HJ. Prospective validation of candidate SNPs of VEGF/VEGFR pathway in metastatic colorectal cancer patients treated with first-line FOLFIRI plus bevacizumab. PLoS One. 2013 Jul 4;8(7):e66774
 Wang L, Ji S, Cheng Z. Association between Polymorphisms in Vascular Endothelial Growth Factor Gene and Response to Chemotherapies in Colorectal Cancer: A Meta-Analysis. PLoS One. 2015 May 8;10(5):e0126619.
 Pavlou M, Ambler G, Seaman SR, Guttmann O, Elliott P, King M, Omar RZ. How to develop a more accurate risk prediction model when there are few events. BMJ. 2015 Aug 11;351:h3868.
 Hosmer DW, Lemeshow S, May S. (2008) Applied Survival Analysis: Regression Modeling of Time-to-Event Data, 2nd ed. Hoboken, NJ: John Wiley & Sons, Inc.
 Pavlou M, Ambler G, Seaman S, De Iorio M, Omar RZ. Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Stat Med. 2016 Mar 30;35(7):1159-77.
 Hutmacher M and Kowalski K, Covariate selection in pharmacometric analyses: a review of methods. Br J Clin Pharmacol. 2015 Jan; 79(1): 132–147.
 Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16:385–395.