Machine learning for covariate imputation – application in a real-world scenario - PAGE Meeting (Population Approach Group Europe)

Verena Schöning ¹, Claudia Suenderhauf ², Laura Hermann ^1,3, Stephan Krähenbühl ⁴, Manuel Haschke ¹, Felix Hammann ¹

1 Division of Clinical Pharmacology & Toxicology, Department of Internal Medicine, University Hospital Bern (Bern, Switzerland), 2 AarReha Schinznach AG (Schinznach, Switzerland), 3 Medical Clinic, Zug Cantonal Hospital (Baar, Switzerland), 4 Clinical Pharmacology & Toxicology, University Hospital Basel (Basel, Switzerland)

Objective: Population pharmacokinetic (PopPK) models describe the changes in drug concentration across diverse patient populations, leveraging covariate effects using a non-linear mixed-effects (NLME) modeling approach. This aims to describe the variability of drug exposure within populations by considering patient-specific covariates, e.g., age, sex, weight, disease state, or organ function. However, retrospective studies in particular are susceptible to missing covariate information. While several approaches for data imputation in pharmacometrics exist, machine learning (ML) paradigms are not yet widely used but can offer interesting additions to the analytical toolbox. The major advantage of ML imputation methods is the leverage of all existing covariates to assume missing data values. For example, estimation of renal function depends on several parameters such as sex, age, weight, etc. Trained ML models can capture the relationships between the parameters and use them to impute missing renal function, instead of merely replacing the value with the mean renal function of the whole study population. Therefore, even if some covariates are not part of the final PopPK dataset, they can improve the imputation of relevant covariates. Two popular ML techniques for imputation are Random Forest (RF) and k-nearest neighbor (kNN). We decided to use retrospective data from three aminoglycosides (gentamicin, amikacin, and tobramycin) for which covariate information was missing for a considerable share of patients. We built and compared PopPK models of differently imputed datasets (RF, kNN) with a complete case dataset (CCD) and findings from published literature. Additionally, a sensitivity analysis was conducted to assess the robustness of the approach.

Method: We conducted a retrospective, single-center study based on data collected at the University Hospital Basel from routine therapeutic drug monitoring (TDM) of gentamicin, amikacin, and tobramycin between January 1st 2014 to December 31st 2017. Data were collected from electronic health records (EHRs), and included demographic information, laboratory values (including plasma concentrations of the studied aminoglycosides), dosing and timing of administered aminoglycosides, diagnosis, and outcomes. Missing covariate information was imputed using two different ML algorithms: Random Forest (RF) and k-nearest neighbor (kNN). Additionally, we created a CCD, where occasions or patients with missing covariate information were excluded. We then used NLME modelling to build PopPK models for each dataset and validated the models with non-parametric bootstrap analyses. Furthermore, we conducted a sensitivity analysis to assess the robustness of the imputation by shifting the imputed values to 80% and 120% of the original values. Lastly, we compared the parameter estimates with previously published PopPK studies.

Result: We included 189 occasions with 300 plasma concentrations for gentamicin, 72 occasions with 132 plasma concentrations for amikacin, and 141 occasions with 280 concentrations for tobramycin in the analysis. Gentamicin had the highest share of most complete cases (82.7%), followed by amikacin (68.2%) and tobramycin (40.0%). After considering different structural models, we found that one-compartment models with a linear elimination described the data of all three aminoglycosides after intravenous administration best with estimated GFR on clearance and weight on volume of distribution. Overall, the point estimates from both imputation methods for gentamicin and amikacin are comparable and, in most cases, align closely with those from the CCD. For gentamicin, even though the point estimates align, the percentage relative standard errors (RSE%) of CLIOV and VIIV are notably lower in the models with imputed data compared to the CCD. In the amikacin models, the RSE% for VIIV and CLIIV are also lowest in the imputed models. For tobramycin, the VIOV are decreased in the RF model. Furthermore, the VIIV and CLIOV are lower in both imputed models. Re-estimation of the PopPK models with the shifted imputation values resulted in minor changes in parameter estimates. However, no clinically meaningful differences were observed.

Conclusion: The resulting parameter point estimates from RF and kNN showed no bias when compared with CCD while reducing random error, reflecting the ML imputation techniques’ propensity to generate supporting data points by picking up on the underlying distributions of covariates.

Reference: PAGE 34 (2026) Abstr 12311 [www.page-meeting.org/?abstract=12311]

Poster: Methodology – AI/Machine Learning

PDF poster / presentation (click to open)