Machine learning versus mechanistic modeling for prediction of metastatic relapse in breast cancer
C. Nicolò (1,2), C. Périer (1,2), M. Prague (3, 4), G. MacGrogan (5), O. Saut (1,2), S. Benzekry (1,2)
(1) MONC team, Inria Bordeaux Sud-Ouest, France (2) Institut de Mathématiques de Bordeaux, France, (3) SISTM team, Inria Bordeaux Sud-Ouest, France, (4) Inserm U1219, Bordeaux Public Health, Bordeaux, France (5) Pathology department, Bergonié Cancer Research Center, Bordeaux, France
Introduction: 
Predicting the probability of metastatic relapse for patients diagnosed with early-stage breast cancer is critical for decision of adjuvant therapy [1]. Current predictive models usually rely on proportional hazard Cox regression models [2]. Using the breast cancer database from the Bordeaux Bergonié Institute (n=1057 patients), we investigated the potential use of machine learning (ML) algorithms for predicting 5-years metastatic relapse (MR) or metastatic-free survival. Both Cox regression and ML algorithms are purely statistical methods and do not integrate any biological knowledge. To address this and provide personalized, data-informed simulations of the natural history of the disease, we developed a mechanistic model of the time to relapse based on the biology of metastatic spread.
Objectives:
- Investigate the applicability of machine learning algorithms for prediction of MR
- Develop a mechanistic model of time to MR
- Compare both approaches to classical survival models
Methods:
Classification algorithms for prediction of probability of MR at 5-years included logistic regression, support vector classification, k-neighbors, naïve bayes, random forest, gradient boosting and multi-layer perceptron. They were trained using the python package scikit-learn [3]. Due to the small probability of MR (<10% at 5 years) possibly impairing the results of classification algorithms, we restricted ourselves to a balanced data set with 50% of relapse for this task. To deal with time-to-event data and censoring (not handled with classical ML regression algorithms), survival random forests were also investigated [4]. The mechanistic model of time to MR was built based on a model using a size-structured population dynamics framework (transport partial differential equation) for description of metastasis [5]. This model was previously validated against longitudinal experimental data of spontaneous metastatic development after surgery in a clinically relevant animal model of breast cancer [6]. A nonlinear mixed-effects model was added to the structural model for description of inter-individual variability in the two parameters (growth and dissemination), as well as assessment of the impact of covariates, pivotal in the development of the model as a personalized predictive tool. Population parameter estimation was performed using the R package saemix [7]. To prevent using the same set for training the models and prediction, 10-fold cross-validation was used to assess the predictive power of the various models.
Results:
For the classification task (prediction of 5-years MR probability), the best performances were achieved by the random forest algorithm with an accuracy on test sets of 60%, area under the ROC curve of 0.7 and positive and predictive values of 60% each. A calibration plot also indicated good predictive power. The random survival forest algorithm had similar performances with a concordance index [8] of 0.68, which was also the score obtained by a proportional hazard Cox regression model. The mechanistic model was able to provide accurate fits of the survival data with random effects in two key parameters of dissemination and growth. Critically, these parameters allowed for integration of biological covariates in a physiologically meaningful way. The primary tumor size at diagnosis for instance is a direct variable of the model. In addition, significance of covariates (assessed by means of Wald tests) suggested other covariates to be either biomarkers of growth (such as the level of the proliferation marker Ki67) or dissemination (such as the vimentin level). At the time of writing of this abstract, we can only report on the concordance index on the calibration set (0.66) due to the large computational cost (8 hours to fit the population parameters on the entire data set on a 24 CPU server).
Conclusion:
These findings provide the first step towards the development of a mechanistic model for prediction of metastasis. It could yield a personalized prediction tool of help for routine management of breast cancer patients. Not only would it provide estimates of the metastasis-free survival probability, but it would also generate informative estimates of the invisible metastatic burden at the time of diagnosis and forward simulations of future dissemination and growth. To achieve concrete clinical transfer, the model should be further refined and validated on external data sets.
References: 
[1] Cardoso, F., van't Veer, L. J., Bogaerts, J., Slaets, L., Viale, G., Delaloge, S., et al. (2016). 70-Gene Signature as an Aid to Treatment Decisions in Early-Stage Breast Cancer. N Engl J Med, 375(8), 717–729.
[2] Ravdin, P. M., Siminoff, L. A., Davis, G. J., Mercer, M. B., Hewlett, J., Gerson, N., & Parker, H. L. (2001). Computer program to assist in making decisions about adjuvant therapy for women with early breast cancer. J Clin Oncol, 19(4), 980–991.
[3] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine Learning in Python. JMLR, 12, 2825–2830.
[4] H. Ishwaran, U. B. Kogalur (2019) https://cran.r-project.org/web/packages/randomForestSRC/index.html
[5] Iwata, K., Kawasaki, K., & Shigesada, N. (2000). A Dynamical Model for the Growth and Size Distribution of Multiple Metastatic Tumors. J Theor Biol, 203(2), 177–186. 
[6] Benzekry, S., Tracz, A., Mastri, M., Corbelli, R., Barbolosi, D., & Ebos, J. M. L. (2016). Modeling Spontaneous Metastasis following Surgery: An In Vivo-In Silico Approach. Cancer Res, 76(3), 535–547. 
[7] Comets, E., Lavenu, A., & Lavielle, M. (2017). Parameter Estimation in Nonlinear Mixed Effect Models Using saemix, an RImplementation of the SAEM Algorithm. J Stat Soft, 80(3).
[8] Harrell, F. E., Lee, K. L., & Mark, D. B. (1996). Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med, 15(4), 361–387.
