Integrating Reinforcement Learning and PK-PD modelling to enable precision dosing: a multi-objective optimization for the treatment of Polycithemia Vera patients with Givinostat
Alessandro De Carlo (1), Elena Maria Tosca (1), Paolo Magni (1)
(1) Laboratory of Bioinformatics, Mathematical Modelling and Synthetic Biology (BMS lab), Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
Objectives: Precision Dosing is an emerging approach aiming to optimize and customize the dose and the treatment schedule at individual patient level . This individual-oriented paradigm can be applied also for therapeutics under development to reduce attrition due to narrow therapeutic window and/or high inter-individual variability .
Reinforcement Learning (RL) is class of Artificial Intelligence decision-making algorithms which consists in training an Agent (i.e., computer) to learn which actions should be taken in certain situations evaluating their consequences . The RL application for Precision Dosing purpose is currently under investigation [2,4]
Givinostat is a drug under clinical investigation for the treatment of polycythemia vera (PV), a chronic myeloproliferative neoplasm causing an excessive increase of platelets (PLT), white blood cells (WBC) and haematocrit (HCT) level . Givinostat is administered in treatment cycles with a dose-adaptive administration protocol where dose is adjusted based on PLT, WBC and HCT levels, which are considered efficacy and safety endpoints. Thus, this is multi-objective therapy, as its goal is to simultaneously maintain all the 3 haematological parameters within their respective normal ranges (complete haematological response, CHR).
A model-based simulation platform for Givinostat treatment  is here combined with RL to optimize the treatment of individual PV patients. Further, the optimization framework is included in a Bayesian paradigm to facilitate its application in the clinical practise.
Methods: A popPK-PD model describing the Givinostat effect on PLT, WBC and HCT and a clinical trial simulation platform were already available . The individual model parameter distributions was used to generate a population of 98 PV patients on which the adaptive-dose protocol proposed for the clinical trial has been evaluated an considered as a reference.
Q-Learning (QL) was chosen as RL algorithm  and the Precision Dosing problem was formalized as follows:
- The Givinostat simulation platform  was embedded in the QL framework to simulate all the possible scenarios.
- QL algorithm was trained considering a time horizon of eight 28 day-long cycles, accordingly with the duration of planned phase III clinical trial.
- Agent actions were defined accounting for safety constrains. QL agent decides the optimal starting dose. Then, at the end of each treatment cycle, the previously administered dose can be confirmed for the next cycle or increased/decreased by a fixed amount. Treatment is temporary interrupted in presence of grade I or II thrombocytopenia and/or neutropenia. The QL Agent must select a resumption dose ≤ than the dose that caused the interruption.
- System states were described with a tuple containing the PLT, WBC and HCT levels discretized in inefficacy/efficacy/toxicity, the previous administered dose or the dose that caused temporary interruption.
- Reward function was defined to 1) normalize and maintain PLT, WBC and HC in their respective ranges as longer as possible 2) penalize the presence of severe toxicities.
A unique QL agent was trained considering the average contributes of all the 98 patients to identify an optimal protocol (QLpop) applicable to the whole population. QLpop performances were compared with that of the clinical protocol to validate the robustness of RL approach. Subsequently, an individual QL agent was trained for each subject in the population to personalize Givinostat treatment. QL-based individual protocols were then compared with both QLpop and clinical protocols to assess the efficacy of an individual-oriented optimization. To make more realistic the QL-based individual optimization strategy that require the a priori knowledge of the individual parameters, it was integrated within a Bayesian paradigm that allows a continuous learning of individual model parameters.
Results: The clinical and QLpop protocols similarly performed on the reference population, achieving a similar rewards (41481 vs 43185) and CHR at 8th cycle (79% vs 71%), confirming the validity of the overall formalization of the Givinostat Precision Dosing problem. The population-level optimization was then compared to the individual-oriented approach on the same population. QL-based individual protocols achieved the 94% of CHR at the 8th cycle without any severe toxicity event, outperforming the clinical protocol (79%, 10 severe toxicity events). The good performances were confirmed also extending the treatment period to 2 years. Further, individual protocols dramatically reduced time needed to reach PLT, WBC and HCT normalization with a median of 47 days against 105 days required by the clinical protocol (p-value of paired Wilcoxon Test was 5.82e-13).
As the CHR at 8th cycle is the main outcome of the planned clinical trial, QL individual agents were also trained to maximize this specific metric. To this aim, a bonus term was added to the reward function, if the patient was normalized at the end of the 8th treatment cycle, increasing the CHR percentage to 98%.
The main limitation of the individual-oriented approach consists in assuming known model individual parameters, an unrealistic condition at least at the beginning of the treatment. Therefore, the RL-popPK-PD approach was embedded into a Bayesian framework as follows:
- At the beginning of the treatment the initial dose is established according to the QLpop.
- At the end of each cycle, current measurements are used to perform a Bayesian update of the PK-PD individual parameters. Then, the new parameter values are used to retrain individual QL agents. The dose adjustment for the next cycle is selected by the new individual QL agent.
Conclusion: A hybrid framework coupling PK-PD models and RL was evaluated on a real case study with high degree of complexity due to the presence of multiple outcomes to optimize and constraints based on clinical practices. This approach resulted an efficient flexible tool which can be used to optimize individual treatments, potentially for both commercial and under development drugs. The integration within a Bayesian paradigm allows to overcome the main limitations of the methodology, therefore a further evaluation step on a wide range of case studies will be conducted.
 Tyson, R. J., Park, C. C., Powell, J. R., Patterson, J. H., Weiner, D., Watkins, P. B., & Gonzalez, D. (2020). Precision dosing priority criteria: drug, disease, and patient population variables. Frontiers in Pharmacology, 11, 420.
 Ribba, B. (2022). Reinforcement learning as an innovative model-based approach: Examples from precision dosing, digital health and computational psychiatry. Frontiers in Pharmacology, 13.
 Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
 Ribba, B., Dudal, S., Lavé, T., & Peck, R. W. (2020). Model‐informed artificial intelligence: reinforcement learning for precision dosing. Clinical Pharmacology & Therapeutics, 107(4), 853-857.
 Iurlo, A., Cattaneo, D., Bucelli, C., & Baldini, L. (2020). New perspectives on polycythemia vera: from diagnosis to therapy. International Journal of Molecular Sciences, 21(16), 5805.
 PAGE 30 (2022) Abstr 10054 [www.page-meeting.org/?abstract=10054]