Introducing a novel analytical framework for risk stratification of real-world data with survival and unsupervised machine learning. A small cell lung cancer SCLC study
Luca Marzano (1), Adam S. Darwich (1), Salomon Tendler (2), Asaf Dan (2), Rolf Lewensohn (2), Luigi De Petris (2), Jayanth Raghothama (1), Sebastiaan Meijer (1)
(1) Division of Health Informatics and Logistics, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), KTH Royal Institute of Technology, Huddinge, Sweden, (2) Dept. of Oncology-Pathology, Karolinska Institutet and the Thoracic Oncology Center, Karolinska University hospital, Stockholm, Sweden.
Objectives: Stratification of patients to predict accurate prognostic outcomes and improve treatment selection is a crucial challenge in oncology . Increased availability and utilization of real-world data (RWD) provides the opportunity to aid stratification from prognostic indicators and fill the knowledge gap between randomized controlled trials (RCTs) and clinical practice, as well as to shape future studies [2, 3] .
However, RWD poses a series of practical challenges,which include: data quality, sample size, defining role of variables in clinical processes, addressing missing values and potential biases, and interpretation of results [2, 4, 5].
Small cell lung cancer (SCLC) is an interesting case study for performing risk stratification from clinical RWD. Prediction of clinical outcomes is challenging due to the rapid progression of the disease with multiple distant metastases and the lack of insight to chemo-resistance mechanisms [6, 7].
Furthermore, treatment selection based on clinical tumor stage, assessed as either by the Veterans' Administration Lung Study Group (VALSG) 2-class limited-extensive disease system or the more detailed 8th TNM classification is still broad and unable to sufficiently differentiate prognostic subgroups .
The aim of this study was to explore the potential of RWD, identify and address the practical challenges, and improve the detection of SCLC prognostic subgroups and analyse their patterns using a development pipeline with clinical experts and machine learning methods.
Methods: An analysis was carried out for 636 non-surgical patients with TNM cancer stage IIIA-IVB re-classified retrospectively from previous VALSG staging and treated at the Karolinska University Hospital, Stockholm, between 2008-2016 .
The variables included in the study were: TNM staging descriptors, performance status, smoking status, age, sex, PET CT, Brain CT, and haemaotlogy and blood chemistry values for c-reactive protein, lactate dehydrogenase, sodium, albumin, and hemoglobin.
The conceptual treatment decision process (chemotherapy cycles, concomitant thoracic irradiation, prophylactic cranial irradiation, no treatment after diagnosis) was reconstructed.
A novel approach was proposed to address challenges in order to extract real-world evidence (RWE). The method merged standard survival analysis (Cox regression ) with survival (random survival forest ), explainable (feature importance ), and unsupervised (partition around medoids ) machine learning.
The synergy of survival analysis and unsupervised machine learning lead to a comprehensive separation of SCLC prognostic groups.
The analysis yielded k=7 compacted and well-separated clusters of patients with statistically significant overall survival (p< 2.2e^-16). Interestingly, competition between treatment decisions within the clusters were observed.
Performance status, lactate dehydrogenase, spreading of metastasis, cancer stage and c-reactive protein were the baselines that characterised the subgroups. Performance status and lactate dehydrogenase were globally the most significant covariates for prognosis.
Conclusions: The approach showed the potential of future applications of RWD to understand disease, develop individualised therapies, and improve healthcare decision-making.
Proper choice of methods and involvement of physcianscan be crucialfor advancing methodologies aimed to extract real-world evidence from RWD.
This will ultimately benefit the understanding of the course of the disease as well as improve the decision making of physicians treating SCLC and other malignicies alike.
 Azuaje F. Artificial intelligence for precision oncology: beyond patient stratification. npj Precision Oncology. 2019;3(1):1-5. doi:10.1038/s41698-019-0078-1
 Schurman B. The Framework for FDA’s Real-World Evidence Program. Applied Clinical Trials. 2019;28(4):15-17.
 Tan K, Bryan J, Segal B, et al. Emulating Control Arms for Cancer Clinical Trials Using External Cohorts Created From Electronic Health Record-Derived Real-World Data. Clinical Pharmacology & Therapeutics. 2022;111(1):168-178. doi:10.1002/CPT.2351
 Miksad RA, Abernethy AP. Harnessing the Power of Real-World Evidence (RWE): A Checklist to Ensure Regulatory-Grade Data Quality. Clinical Pharmacology & Therapeutics. 2018;103(2):202-205. doi:10.1002/CPT.946
 Rivera DR, Henk HJ, Garrett-Mayer E, et al. The Friends of Cancer Research Real-World Data Collaboration Pilot 2.0: Methodological Recommendations from Oncology Case Studies. Clinical Pharmacology and Therapeutics. 2022;111(1):283-292. doi:10.1002/cpt.2453
 Schiller JH. Current standards of care in small-cell and non-small-cell lung cancer. Oncology. 2001;61(SUPPL. 1):3-13. doi:10.1159/000055386
 Johal S, Hettle R, Carroll J, Maguire P, Wynne T. Real-world treatment patterns and outcomes in small-cell lung cancer: a systematic literature review. J Thorac Dis. 2021;13(6):3692-3707. doi:10.21037/jtd-20-3034
 Tendler S, Grozman V, Lewensohn R, Tsakonas G, Viktorsson K, de Petris L. Validation of the 8th TNM classification for small-cell lung cancer in a retrospective material from Sweden. Lung Cancer. 2018;120:75-81. doi:10.1016/j.lungcan.2018.03.026
 Tendler S, Zhan Y, Pettersson A, et al. Treatment patterns and survival outcomes for small-cell lung cancer patients–a Swedish single center cohort study. Acta Oncologica. 2020;59(4):388-394. doi:10.1080/0284186X.2019.1711165
 Breslow NE. Analysis of Survival Data under the Proportional Hazards Model. International Statistical Review / Revue Internationale de Statistique. 1975;43(1):45. doi:10.2307/1402659
 Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Annals of Applied Statistics. 2008;2(3):841-860.
 Fisher A, Rudin C, Dominici F. All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. Journal of Machine Learning Research. 2019;20:1-81.
 Kaufman L, Rousseeuw PJ. Partitioning around medoids (program pam). Finding groups in data: an introduction to cluster analysis. 1990;344:68-125.