Do tree-based machine learning methods outperform Cox regression for progression-free survival prediction in mobocertinib treated lung cancer patients?
Antari Khot (1), Santiago Vargas (2), Michael J. Hanley (1), Neeraj Gupta (1)
(1) Quantitative Clinical Pharmacology, Takeda Development Center Americas, Inc., Lexington, MA, USA (2) Department of Chemistry and Biochemistry, University of California, Los Angeles, CA, USA
Objectives: Prediction of survival curves, overall survival (OS) and progression-free survival (PFS), based on pre-trained models could help to refine inclusion/exclusion criteria by identifying important population covariates and increase the probability of success in future clinical studies. Cox proportional hazard (CPH) models are commonly applied to assess the impact of covariates on survival. There is an increase in adoption of machine learning (ML) methods to deconvolute non-linear and higher-order covariate relationships in survival analysis. However, there is no conclusive study to show superiority of ML methods compared to CPH or provide a framework for ML use over traditional methods. Therefore, in this case study, we have compared different tree-based ML methods to CPH for PFS prediction in NSCLC patients treated with mobocertinib. Mobocertinib is a tyrosine kinase inhibitor approved for treating NSCLC patients with EGFR exon20 insertion mutations whose disease has progressed on or after platinum-based chemotherapy.
Methods: PFS data for 305 NSCLC patients treated with mobocertinib was utilized for this study. The master dataset included 102 features from patient baseline lab values, drug-exposure data (area under the curve till end of the treatment, AUC) and tumor size dynamics (TSD) parameters. TSD parameters for the population were estimated by fitting the Stein tumor growth inhibition (TGI) model to the sum of longest diameter data. All models were systematically evaluated for different combinations of features to test the advantage of adding more features to the patient baseline feature list that were collected throughout the clinical trial, such as tumor growth-related and drug exposure data. This assessment shows how close our predictions would be to the Kaplan-Meier curve using the data collected at the beginning of the study (lab values and patient/disease characteristics at the time of screening) vs. all the data collected by the end of the study (AUC, TGI parameters). Four ML models (random survival forests, extra trees, gradient boosted trees, and xgboost-survival-embeddings) were trained, tested and validated on over two years of survival data. ML and CPH model performance of the test fits were compared to Kaplan-Meier curve using right-censored concordance index ( RCC index) and Brier score. Four datasets were created from the master dataset for testing the hypothesis of this case study; dataset A: baseline lab values and baseline tumor size, dataset B: baseline lab values, baseline tumor size, and exposure data, dataset C: baseline lab values, baseline tumor size, and TGI parameters (kgrowth, kkill, TS0), dataset D: baseline lab values, baseline tumor size, exposure data, and TGI parameters. Monolix was used for population modeling and Python was used for CPH and ML models. Feature selection engineering was performed to remove uninformative features using Boruta and permutation importance. Features that had greater than 10% missing data were removed from the dataset and missing values for the rest of the features were imputed mean or mode for continuous and categorical features.
Results: Feature selection methods reduced the feature space from 102 features to 38 features for dataset A and 42 features for dataset D. Hyperparameter tuning for each ML model was performed using Bayesian optimization on high performance computing. To build robust ML models, different combinations of hyperparameters were then evaluated through 5-fold cross validation as opposed to a single training-validation set. For dataset A, CPH performed better than ML models with RCC index of 0.7 and Brier score of 0.15. For dataset B, extra survival trees and XGBSE performed marginally better than CPH with RCC index of 0.79 and Brier score of 0.14. For dataset C, gradient boosted and XGBSE performed better with RCC index of 0.79 and Brier score of 0.09. For dataset D, gradient boosted and XGBSE were clearly superior to CPH with RCC index of 0.81 and Brier score of 0.08.
Conclusions: The performance of ML methods improved as data collected during the trial on tumor dynamics and exposure was added to the baseline patient features, while the performance of the CPH remained unchanged. CPH imputes hazard ratios which are easy to interpret. However, if prediction performance of the survival curve is the main objective, more sophisticated ML models appear to be superior for this case study. Ultimately, the method of choice depends on not just the objective of the study but also the complexity and size of the data.
 Cox, DR. Journal of the Royal Statistical Society. vol. 34, no. 2, 1972, pp. 187–220. JSTOR
 Suresh, K et al. BMC Med Res Methodol 22, 207 (2022).
 Duke, ES et al. Clin Cancer Res (2023) 29 (3): 508–512
 Pölsterl, S. Journal of Machine Learning Research, vol. 21, no. 212, pp. 1–6, 2020.