An Interactive Shiny Application for Machine Learning-Driven Covariate Selection in Pharmacometrics - PAGE Meeting (Population Approach Group Europe)

Ali Farnoud ¹, Luis Wirth ²

1 Boehringer Ingelheim Pharma GmbH & Co. KG (, Germany), 2 HMS Analytical Software GmbH (, Germany)

Introduction/Objectives:
Machine learning (ML)-based approaches are efficient alternatives to stepwise covariate modeling (SCM) for covariate selection in pharmacometrics [1,2]. The mlcov R package [3] implements a framework combining LASSO regularization [4] with the Boruta algorithm [5] using XGBoost [6], demonstrating substantial time savings compared to SCM. However, mlcov requires R programming proficiency, limiting accessibility. Additionally, automatic LASSO-based feature removal may exclude potentially relevant covariates without user oversight, and the package is limited to regression tasks with a single ML algorithm.
We developed an interactive Shiny application extending mlcov methodology. The primary objective was to provide a user-friendly interface for ML-based covariate selection (MCS) without coding. Secondary objectives included: (1) implementing user-controlled colinearity detection using multiple correlation methods; (2) expanding the ML algorithm portfolio to include XGBoost, CatBoost [7], and LightGBM [8] with automatic selection of the best-performing model; (3) extending the framework to support classification tasks with imbalanced data handling; and (4) ensuring reproducibility through portable analytical functions compatible with Quarto/RMarkdown reports for validated environments.

Methods:
The application was developed in R using Shiny with modular architecture separating portable analytical functions from user interface logic. Users upload datasets in standard formats (CSV, XPT, SAS7BDAT, etc.) or connect to a database, with automatic detection of continuous and categorical covariates. Unlike automatic LASSO removal in mlcov, users analyze feature correlations using Pearson correlation, Spearman correlation, and Mutual Information (MI), which captures both linear and non-linear dependencies. Interactive heatmaps display pairwise correlations, and users select which features to exclude based on domain knowledge and target correlation. The core ML analysis employs 5-fold cross-validation with XGBoost [6], CatBoost [7], and LightGBM [8]. Within each fold, Boruta [5] identifies relevant covariates by comparing feature importance scores against randomly permuted shadow features. The best-performing algorithm is automatically selected per target based on cross-validated RMSE (regression) or AUC (classification). A voting mechanism selects covariates appearing in ≥2 folds. For classification, inverse-frequency class weights address imbalanced datasets. Feature importance visualization uses SHAP values [9], displaying global rankings and individual impacts through beeswarm plots. Model performance is assessed through goodness-of-fit plots (regression) and confusion matrices (classification). Analytical functions are framework-agnostic and compatible with Quarto/RMarkdown for validated environments.

Results:
The application was evaluated on five population pharmacokinetic studies (regression) and one classification task. For regression, empirical Bayes estimates for clearance (CL) were extracted from NONMEM base models with corresponding covariates. In all studies, selected covariates were consistent with NONMEM-based SCM. Goodness-of-fit plots and Visual Predictive Checks confirmed adequate performance. Execution times were reduced from hours (SCM) to minutes, demonstrating substantial efficiency gains.
For classification, fibrosis stage prediction was conducted using a virtual patient population from a quantitative systems pharmacology model. The approach identified 15 biomarkers consistent with published literature and achieved an AUC of 0.98, F1-score of 0.85, and accuracy of 0.88, demonstrating robust performance despite class imbalance. The framework accommodated substantially larger candidate biomarker sets than traditional approaches. The user-controlled colinearity module provided transparent visualization enabling informed exclusion decisions compared to black-box LASSO removal in mlcov.

Conclusions:
This Shiny application democratizes MCS by removing the programming barrier in command-line tools, making ML methodologies accessible to pharmacometricians without coding expertise. User-controlled colinearity detection addresses automatic feature removal limitations, allowing informed decisions based on domain knowledge. Multiple gradient boosting algorithms with automatic model selection, classification support, and imbalanced data handling extend applicability beyond the regression-only scope of existing tools. Comprehensive visualization including goodness-of-fit plots, confusion matrices, and SHAP values provides transparent model evaluation essential for pharmacometric applications. The modular architecture enables validation in regulated environments through Quarto/RMarkdown compatibility.

References:
[1] Jonsson E, Karlsson M (1998) Automated covariate model building within NONMEM. Pharm Res 15(9):1463–1468.

[2] Sibieude E et al. (2021) Fast screening of covariates in population models empowered by machine learning. J Pharmacokinet Pharmacodyn 48(4):597–609

[3] Rebai I et al. mlcov: R package for Covariate Selection Using Machine Learning. PAGE 32 (2024) Abstr 10996

[4] Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22.

[5] Kursa M, Rudnicki W (2010) Feature selection with the Boruta package. J Stat Softw 36(11):1–13.

[6] Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), pp. 785–794.

[7] Prokhorenkova L et al. (2018) CatBoost: unbiased boosting with categorical features. In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018), pp. 6639–6649.

[8] Ke G et al. (2017) LightGBM: A highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 3146–3154.

[9] Lundberg S, Lee S (2017) A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 4765–4774.

Reference: PAGE 34 (2026) Abstr 11974 [www.page-meeting.org/?abstract=11974]

Poster: Methodology – AI/Machine Learning