IV-08 Katherine Briceno

Feedforward neural network as a surrogate for the reinforcement learning action-value function in a model-informed precision dosing oncology application

Katherine Briceño-Guerrero (1), Manfred Opper (2), Niklas Hartung (1), Wilhelm Huisinga (1), Jana de Wiljes (1)

(1) Institute of Mathematics, University of Potsdam, Potsdam, Germany, (2) Technische Universität Berlin, Berlin, Germany

Objective: Model-informed precision dosing (MIPD) leverages prior knowledge and biomarker data for individualized drug dosing. Such dose adjustments are particularly important in fields like oncology, with narrow therapeutic windows and severe dose-limiting effects such as neutropenia [1]. MIPD approaches have been proposed in this field, employing pharmacokinetic/pharmacodynamic models of drug therapy and neutrophil counts in the blood [2]. Classically, dose adjustments have been based on maximum-a-posteriori estimation of individual parameters in a Bayesian framework. More recently, reinforcement learning (RL) has been successfully applied to the MIPD problem [3]. This approach can be described as a set of decision-making rules that allow learning actions to take in specific situations/states by relying on an evaluation of the consequences of these actions [4]. However, standard methods for parameter estimation in such models fail when the dimension of the parameter space becomes large, e.g. in the same or possibly larger order than the sample size [5]. Our objective is to introduce deep reinforcement learning (DRL) to extend a previously proposed RL approach [3] to account for efficient parameters estimation and dose regime optimization on MIPD.

Methods: DRL is the combination of RL and deep learning (DL). RL learns how agents ought to take sequences of actions (policies) in an environment in order to maximize cumulative rewards (e.g. quality of life, therapy efficacy). A key concept in RL is the action-value function or Q-value (Q) which describes the expected long-term therapeutic benefit of a given state (e.g., neutropenia grade) and action (e.g. dose)[6]. Complementarily, DL relies on the idea of function approximators in this case to extend our knowledge of the RL agent environment. The DL approach parameterizes a function f that maps the state-action space to the Q-value using a set of parametric weights[7]. Function approximators via deep neural networks (NN) are at the heart of machine learning and are characterized by a succession of multiple processing layers. Each layer consists of a non-linear transformation and the sequence of these transformations leads to learning different levels of abstraction [8]. All these layers are trained to minimize the empirical error.

 The optimal policy for RL is not analytically accessible, thus sample-based techniques are applied to find the optimal policy. A Monte Carlo Tree Search (MCTS) as in [3] is used to approximate the Q-value for different drug concentrations by partially exploring the best candidates in the state-action state. With 61% observed state-action pairs out of 152.000, we fed the NN together with its corresponding action-value function. The NN fitting process is set to estimate 134 weights from an architecture selected to maintain a minimum complexity. Under the test set, the mean squared error does not drop significantly compared to the train set. Following completion of the fitting process, a virtual physician is emulated to have both the baseline methodology presented in [3] and the NN approximated Q. Selection of the action that maximizes the action-value function on each state yields the optimal policy/dose for a virtual patient. Effects on the optimal policy are evaluated using the median change in Q on both baseline and DRL optimal recommendations. For unobserved pairs, we consider a positive benefit when using the DRL if the approximated Q is greater than 0.

Results: It holds for unobserved action-state pairs that 96% of the advised doses with the NN have an approximated significant (higher than model error) positive benefit with a median of 0.974 Q. Changes in the optimal doses (evaluated for MCTS observed states) have a median significant difference of the advised dose of -1 mg and a median significant decreased benefit of 0.047 Q. These results exemplify the hidden risk/benefit of unobserved areas from the RL environment approximated by the introduction of a DRL.

Conclusion: The integration of DRL accounts for estimating a Q-value in a high dimensional space with limited access to samples (curse of dimensionality)[4]. Whereas introducing an NN as an intermediate step can be used to create plausible expectations for uncertainty quantification of unobserved state-action pairs due to ethical or economical reasons or limited computing resources. DRL allows efficient parametrization of the agent environment particularly in complex decision processes under sparse and scarce data commonly found in drug monitoring data used for MIPD.

References:
[1]Maier C, Hartung N, de Wiljes J, Kloft C, Huisinga W. Bayesian data assimilation to support informed decision making in individualized chemotherapy. CPT Pharmacometrics  Syst  Pharmacol.2020;9:153– 164
[2]Keizer, R.J., ter Heine, R., Frymoyer, A., Lesko, L.J., Mangat, R. and Goswami, S. (2018), Model-Informed Precision Dosing at the Bedside: Scientific Challenges and Opportunities. CPT Pharmacometrics Syst. Pharmacol., 7: 785-787. https://doi.org/10.1002/psp4.12353
[3] Maier, C., Hartung, N., Kloft, C., Huisinga, W., & de Wiljes, J. (2021). Reinforcement learning and Bayesian data assimilation for model-informed precision dosing in oncology. CPT: pharmacometrics & systems pharmacology, 10(3), 241–254. https://doi.org/10.1002/psp4.12588
[4] Ribba, B., Dudal, S., Lavé, T., & Peck, R. W. (2020). Model-Informed Artificial Intelligence: Reinforcement Learning for Precision Dosing. Clinical pharmacology and therapeutics, 107(4), 853–857. https://doi.org/10.1002/cpt.1777 
[5] Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge: Cambridge University Press. doi:10.1017/9781108627771
[6] Bertsekas DP. Reinforcement Learning and Optimal Control Belmont, MA: Athena Scientific; 2019
[7] Vincent François-Lavet; Peter Henderson; Riashat Islam; Marc G. Bellemare; Joelle Pineau, An Introduction to Deep Reinforcement Learning , 2018.
[8] Bishop, C. M. & Jordan, M., Kleinberg, J. & Schölkopf, B. (eds.) (2006). Pattern Recognition and Machine Learning (Vol. 4). Springer. ISBN: 9780387310732

Reference: PAGE 30 (2022) Abstr 10219 [www.page-meeting.org/?abstract=10219]

Poster: Methodology – AI/Machine Learning