Luca Marzano (1), Adam Darwich (1), Salomon Tendler (2), Asaf Dan (2), Rolf Lewensohn (2), Luigi De Petris (2), Jayanth Raghothama (1), Sebastiaan Meijer (1)
(1) Division of Health Informatics and Logistics, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), KTH Royal Institute of Technology, Huddinge, Sweden, (2) Dept. of Oncology-Pathology, Karolinska Institutet and the lung oncology center, Karolinska University hospital, Stockholm, Sweden
Introduction: Retrospective studies using clinical data from cancer patients usually are analysed using Cox Proportional Hazards CPH. [1]
This approach has several limitations when it comes to evaluating effects of relevant covariates, such as lack of relative importance criterion between different covariates and information related to threshold for numerical measures. In addition, choice of a proper missing multiple imputation strategy to handle censored mixed data is challenging.
This work was aimed to illustrate how tree-based machine learning algorithms can provide additional clinical information on covariates and extract prognostic patterns of an ED-SCLC cohort of patients treated with platinum doublet chemotherapy at Karolinska University Hospital, Sweden (n=377). [2]
In detail, we employed random forest [3] to impute missing values, missing forest [4], and survival analysis, random survival forest RSF [5], and two learning-rules decision trees, PART [6] and C5.0 [7, 8], to provide explainable outcomes to study high overall survival and relapse.
Objectives:
- Perform missing forest and compare CPH results with other strategies: no imputation and MICE algorithm [9]
- Compute relative covariate importance in survival analysis with RSF and compare the ranking with covariate hazard ratios obtained from CPH
- Explore covariates’ patterns for patients having high overall progress free survival (>300 days, n=125) and patients developing platinum relapse (progression free survival at first line treatment<180 days, n=202) extracting IF-THEN rules with PART, C5.0 and compare detected patterns with RIPPER algorithm [10]
Methods: The explored patient clinical characteristics included: age, sex, ECOG (Eastern Cooperative Oncology Group) performance status, 8th version of TNM staging descriptors (T tumour size, N lymph nodes metastases, M spread of metastases, ST Stage of the disease), prophylactic cranial irradiation PCI, PET or CT of thorax and Brain CT and the following haematology and blood chemistry values before the start of chemotherapy: hemoglobin Hb, c-reactive protein CRP, lactate dehydrogenase LDH, sodium Na and albumin.
Concordance index error and covariate hazard ratios were computed by bootstrapping to assess impact of imputation on CPH results.
Robustness of RSF was assessed measuring out-of-bag concordance on 100 iterations. For each iteration the relative importance of the covariates was computed. This ranking was compared with covariates having statistically significant (p<0.05) CPH hazard ratios to see if these were the most impactful.
Generalisation of IF-THEN rules extraction was studied with a 10-fold cross validation repeated 100 times. Accuracy and area under the ROC curve were evaluated. IF-THEN rules were extracted for the entire cohort.
Results: Cox regression with missing forest data yielded higher concordance 0.743+/-0.02 with respect to no imputation 0.724+/-0.02 and MICE imputation 0.729+/-0.02. Significant hazard ratios were obtained for: M= M1A, M1B, stage IVB, ECOG = 1, 2, 3, PCI, Hb, LDH and albumin. LDH and albumin hazards were found statistically significant only with missing forest pre-processing.
RSF concordance was 0.761+/- 0.002. The most important covariates were PCI, LD and ECOG. The rest of importance ranking indicated that covariates statistically significant according to CPH were not more relevant than others.
In terms of dept and pattern detections, PART performed better than Ripper and C5.0. Two scenarios were explored for long survivals: one including and one excluding PCI. Survival analysis yield to 43 rules excluding PCI, 28 including it. 48 rules were extracted to study relapse.
High interpretability of IF-THEN rules allowed to explore collective effects of covariates. In addition, thresholds for numerical covariates were discovered.
LDH and Hb were relevant conditions in all scenarios. Furthermore, Na and CRP were discriminators for relapse. In the last scenario, all patients receiving PCI (n=56) had high overall survival. Albumin and CRP with other covariates discriminated the rest of the cohort not receiving PCI.
Conclusions: The approach used in this study showed that tree-based machine learning algorithms involved in different tasks can add insights for covariate effects on survival in patients with ED-SCLC. This could detect and explain outcomes beyond the conventional Cox Proportional Hazards model.
References:
[1] Christensen E. Multivariate survival analysis using Cox’s regression model. Hepatology. 1987;7(6):1346-58.
[2] Tendler S, Zhan Y, Pettersson A, Lewensohn R, Viktorsson K, Fang F, et al. Treatment patterns and survival outcomes for small-cell lung cancer patients–a Swedish single center cohort study. Acta Oncologica. 2020;59(4):388-94.
[3] Breiman L. Random forests. Machine learning. 2001;45(1):5-32.
[4] Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112-8.
[5] Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Annals of Applied Statistics. 2008;2(3):841-60.
[6] Frank E, Witten IH. Generating accurate rule sets without global optimization. Machine Learning: Proceedings of the Fifteenth International Conference. 1998:144-51.
[7] Quinlan JR. Learning with continuous classes. Proceedings AI’92. 1992:343-8.
[8] Quinlan JR. C4. 5: programs for machine learning: Elsevier; 2014.
[9] Azur MJ, Stuart EA, Frangakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research. 2011;20(1):40-9.
[10] Cohen WW. Fast effective rule induction. Machine learning proceedings 1995: Elsevier; 1995. p. 115-23.
Reference: PAGE 29 (2021) Abstr 9742 [www.page-meeting.org/?abstract=9742]
Poster: Drug/Disease Modelling - Oncology