I-022

A benchmark of machine learning feature selection methods for stable identification of biomarkers of resistance to immunotherapy in lung cancer

Anastasiia Bakhmach1,4,6, Florence Monville2, Amélie Pouchin3,5,6, Marie Roumieux4, Maryannick Le Ray5, Frédéric Vély5,6, Jean Philippe Dales5,6, Florence Sabatier5,6, Mohamed Boussena1,4,6, Mélanie Karlsen1,4,6, Andrea Vaglio1,4,6, Vanina Leca2, Richard Malkoun5, Jacques Fieschi2, Éric Vivier7, Joseph Ciccolini1,4,6, Laurent Greillier1,4,5,6, Fabrice Barlesi8, Sébastien Benzekry1,4,6

1COMPO (COMPutational pharmacology and clinical Oncology), Centre Inria d'Université Côte d'Azur, 2Veracyte SAS, 3Centre d’Immunologie de Marseille Luminy, 4Centre de Recherches en Cancerologie de Marseille (CRCM), Inserm U1068, CNRS UMR7258, Institut Paoli-Calmettes, 5Assistance Publique-Hôpitaux de Marseille (APHM), 6Aix Marseille Université, 7Innate Pharma, 8Gustave Roussy

Introduction: Feature selection (FS) is essential for biomarker discovery and analysis of biomedical datasets, as the underlying true model is typically expected to be sparse. However, challenges such as high-dimensional feature space (large number of variables p), limited sample size (small number of patients n), multicollinearity, and missing values make FS non-trivial in biomedical data. An important concern is the stability of FS methods, i.e. the similarity between feature sets obtained under data perturbations [1, 2]. Despite being a long-recognized problem, stability is rarely evaluated in current biomedical studies. Objectives: 1) To conduct a comprehensive benchmark of machine learning (ML) FS methods, evaluating both stability and predictive performance on multimodal data integrating clinical variables and complex biomarkers. 2) To identify an optimal FS method that balances stability and performance tailored to prediction of primary resistance to anti-PD-(L)1 in non-small cell lung cancer (NSCLC). Methods: We used data from 435 NSCLC patients from the PIONeeR biomarkers clinical study (NCT03493581). The dataset consisted of 374 variables, comprising clinical and demographic data (n = 10), routine blood tests (n = 49), tumor multiplex immunohistochemistry markers (n = 159), and circulating immune and vasculophenotyping markers. The latter included both flow cytometry data (n = 141) and soluble (n = 15) markers. The outcome to predict was primary resistance to immunotherapy defined as disease progression within 6 months of treatment. The ML pipelines included missing values imputation (30.8% of the total data), feature selection, model training, and internal validation through .632+ optimism correction [3]. The latter was implemented in an open-source Python package called compOC (https://gitlab.inria.fr/compo/compoc). All pipeline steps were applied to 100 subsamples with or without replacement. We benchmarked 31 FS methods belonging to one of the three classes of FS: 16 filters, 10 embedded methods, 4 wrappers, as well as one ensemble approach [4-15]. The predictive performances of the feature sets were evaluated using 11 downstream classifiers implemented in scikit-learn [16]. Nogueira’s measure [2] was used to quantify FS stability. To address the multicollinearity in the data, we investigated the relevance of a preliminary filtering of the variables using variance inflation factor (VIF) thresholding [17], leading to a reduced feature set (p = 214). The final signature was selected by evaluating: stability, predictive metrics (AUC, positive predictive value (PPV), and accuracy), and feature set size. Results: LASSO (Least Absolute Shrinkage and Selection Operator), a standard FS approach in biomedical studies, exhibited moderate stability (S = 0.56). Out of the 22 features selected on the full data, only 10 were robustly selected with frequency = 70% across subsamples, and only one in all subsamples. The best results for both stability (S = 0.69) and performance were obtained with a filter method relying on adjusted p-values (cut-off = 0.01) from a multivariable Cox model for progression-free survival. With this approach, we identified a signature of 19 features that, paired with a gradient boosting classifier, achieved performance equivalent to the full model containing all 374 features. Overall, FS methods showed high variability in both stability (ranging from 0.13 to 0.92) and predictive performance (AUC ranging from 0.54 to 0.74). Several highly stable methods such as MIM (mutual information maximization, S = 0.92) and JMI (joint mutual information, S = 0.87), both filters, exhibited poor AUC. This indicates that, although they consistently selected the same features, these had limited utility for prediction. Nevertheless, statistical filter approaches offered the advantage of being applicable without imputing missing values. They demonstrated better stability and accuracy compared to more complex ML methods. Applying VIF decorrelation before FS improved stability (mean increase of 0.11 across all methods) without compromising performance. The ensemble FS approach did not improve stability compared to the individual methods. Conclusion: Our results highlight the importance of evaluating both stability and performance for choosing an optimal FS method. In high-dimensional datasets, feature selection can be unstable even in the cases where the number of features is approximately equal to the number of patients (p ~ n). Evaluating FS stability and comparing multiple FS methods is important to improve the reliability of biomarker discovery related to drug response and resistance. Future perspectives include the study of FS stability for covariate selection in mixed-effects modeling and higher dimensional settings (p >> n) typical in, e.g., omics data.

 1. Kalousis, A., Prados, J., & Hilario, M. (2007). Stability of feature selection algorithms: A study on high-dimensional spaces. Knowledge and Information Systems, 12(1), 95–116. https://doi.org/10.1007/s10115-006-0040-8 2. Nogueira, S., Sechidis, K., & Brown, G. (2018). On the Stability of Feature Selection Algorithms. Journal of Machine Learning Research, 18(174), 1–54. 3. Efron, B., & Tibshirani, R. (1997). Improvements on Cross-Validation: The .632+ Bootstrap Method. Journal of the American Statistical Association, 92(438), 548–560. https://doi.org/10.2307/2965703 4. Hédou, J., Maric, I., Bellan, G., Einhaus, J., Gaudillière, D. K., Ladant, F.-X., Verdonk, F., Stelzer, I. A., Feyaerts, D., Tsai, A. S., Ganio, E. A., Sabayev, M., Gillard, J., Amar, J., Cambriel, A., Oskotsky, T. T., Roldan, A., Golob, J. L., Sirota, M., … Gaudillière, B. (2024). Discovery of sparse, reliable omic biomarkers with Stabl. Nature Biotechnology. https://doi.org/10.1038/s41587-023-02033-x 5. Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735 6. Meinshausen, N. (2007). Relaxed Lasso. Computational Statistics & Data Analysis, 52(1), 374–393. https://doi.org/10.1016/j.csda.2006.12.019 7. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. JSTOR. 8. Bach, F. R. (2008). Bolasso: Model consistent Lasso estimation through the bootstrap. Proceedings of the 25th International Conference on Machine Learning, 33–40. https://doi.org/10.1145/1390156.1390161 9. Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A Sparse-Group Lasso. Journal of Computational and Graphical Statistics, 22(2), 231–245. https://doi.org/10.1080/10618600.2012.681250 10. Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology, 03(02), 185–205. https://doi.org/10.1142/S0219720005001004 11. Guyon, I., Weston, J., Barnhill, S. et al (2002). Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46, 389–422. https://doi.org/10.1023/A:1012487302797 12. Jiang, H., Deng, Y., Chen, HS. et al (2004). Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 5, 81. https://doi.org/10.1186/1471-2105-5-81 13. Xia, S., & Yang, Y. (2022). An iterative model-free feature screening procedure: Forward recursive selection. Knowledge-Based Systems, 246, 108745. https://doi.org/10.1016/j.knosys.2022.108745 14. Calzolari M.. (2022). Shapicant (0.4.0) [Python]. https://github.com/manuel-calzolari/shapicant 15. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017). Feature Selection: A Data Perspective. ACM Computing Surveys, 50 (6), 1-45. https://doi.org/10.1145/3136625 16. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830. 17. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). An introduction to statistical learning (8th ed.). Springer Science+Business Media. 

Reference: PAGE 33 (2025) Abstr 11332 [www.page-meeting.org/?abstract=11332]

Poster: Methodology – AI/Machine Learning

PDF poster / presentation (click to open)