I-064 Limei Cheng

Using machine learning to predict longitudinal platelet counts: a methodology framework

Enayetur Raheem (1), Limei Cheng (1), Hong Yang (1), and Jennifer Sheng (1)

(1) Incyte Research Institute, USA

Introduction: Conventional exposure-response (ER) models have been employed to model the dynamics of platelet changes during patient treatment, providing insights into the impact of drug exposure and thus guide dosing regimens [1-3]. The ER model analyses the empirical and quantitative relationship between drug exposure and platelet count. However, this approach can be limited by available information and assumptions inherent in ER methodologies. Consequently, this modelling may not fully reflect complex signalling pathways related to platelet dynamics, non-linear relationships and/or confounders that exist in the ER dynamic. The advent of machine learning (ML) provides an opportunity to advance the potential unknown covariate search and selection [4,5] and could fill in some of the current gaps in platelet count modelling [6].

Objectives: To explore the use of various ML methods and compare select ML modelling with conventional ER analyses typically used in platelet count prediction.

Methods: Data were pooled from 162 patients across 2 studies investigating safety and tolerability of INCB057643 (a small molecule bromodomain and extra-terminal inhibitor) in patients with advanced malignancies (NCT02711137) or myeloid neoplasms (NCT04279847). Patients received INCB057643 either as 8-16 mg qd monotherapy or 4-6 mg qd in combination with ruxolitinib. The dataset included the dependent variables longitudinal platelet counts, exposures and 26 time-invariant features. Two ER approaches were applied and compared: 1) ER statistical mixed models and 2) automated ML (AutoML) with feature engineering. A random intercept statistical mixed model was fitted to individual patient platelet counts. Grade 2 or higher thrombocytopenia was defined as platelet count <75×109/L. For AutoML, 4 tree-based algorithms were applied: 1) gradient boosting machine (GBM), 2) eXtreme Gradient Boosting (XGBoost), 3) CatBoost and 4) LightGBM. The best ML models were selected based on a comparison of training data and test (holdout) data area under the curve (AUC) values; namely, the area under the receiver operating characteristic curve. Both the statistical and ML models were fitted onto the training data and evaluated on the holdout data. The selected ML models were refined via grid search to optimise the hyperparameters. SHapley Additive exPlanations values and feature importance plots were used to describe the ML model predictions and calculate the contribution of variables to the model. All analyses were conducted using Python version 3.9.5 in the Databricks cloud computing environment.

Results: Training and holdout data predicted AUC values achieved with different methods were: statistical mixed model, 0.941 vs 0.828; GBM, 0.958 vs 0.850; CatBoost, 0.994 vs 0.903; XGBoost, 0.998 vs 0.882; LightGBM, 0.999 vs 0.904. AUC values using all data were 0.857, 0.931, 0.976, 0.972 and 0.971 for these respective models. Average cross-validation AUCs were 0.783, 0.776, 0.785 and 0.769 for the GBM, CatBoost, XGBoost and LightGBM ML models, respectively. The XGBoost and LightGBM ML models were determined to be the most predictive based on AUC data and selected for further analyses. The results of both conventional and ML ER models consistently identified baseline platelet counts and age as significant covariates or top features, suggesting their importance in relation to patient platelet counts. Additionally, the selected ML methods, XGBoost and LightGBM models, but not the conventional mixed models, repeatedly identified baseline albumin levels, BMI, Cmax, bilirubin levels and renal function (derived using the Modification of Diet in Renal Disease equation) as important features that impact patient platelet counts. It is hypothesised that patient health status is generally associated with organ function (eg, renal and liver functions by MDRD and ALT, respectively) and therefore may have a non-linear relationship with platelet counts during treatment. Additional factors, including baseline alanine transaminase and aspartate transferase and sex, were also identified by the XGBoost modelling method alone.

Conclusions: This study has established an efficient ML development framework for predicting platelet counts, including feature engineering, selection of ML models and interpretation of outputs. Furthermore, our findings highlight that ML methods can complement conventional ER analysis, providing additional insights to assist clinical data interpretation.

References
[1] Blanchet B, et al. Br J Cancer. Published online Jan 25, 2024.
[2] Liu C, et al. Clin Pharmacol Ther. 2017;101:657-666.
[3] van der Bij S, et al. Cancer Causes Control. 2013;24:1-12.
[4] Dou Y, et al. Comput Biol Med. 2024;170:108066.
[5] Feng D, et al. J Biomol Struct Dyn. 2024;1-13. DOI: 10.1080/07391102.2024.2301754.
[6] Pozdnyakova O, et al. J Clin Pathol. 2023;76.9:624-631.

Reference: PAGE 32 (2024) Abstr 10924 [www.page-meeting.org/?abstract=10924]

Poster: Methodology - New Modelling Approaches

PDF poster / presentation (click to open)