Teresa Sotto Mayor(1), Itziar Irurzun Arana(2), Martin Johnson(2)
(1)Biomedical Sciences, Imperial College, London (2) Clinical Pharmacology Quantitative Pharmacology, BioPharmaceuticals R&D, AstraZeneca R&D, Cambridge, UK, (2) Imperial College London, UK.
Objectives:
Most oncology drugs have a narrow therapeutic index. Optimising the dose of such compounds could potentially maximise the benefit whilst minimising the risk for patients. In this context, dose optimisation could be defined based on rewards that maximise the desired outcome. Reinforcement learning (RI) can maximise a reward by optimising specific actions by interacting with the system (1). Yauney et al. applied RI to optimise doses for chemotherapy treatment based on changes in tumour size (TS) and maximising clinical benefit (2). In this current work, we use RI algorithms to find optimal dosing regimens for cancer treatment, balancing efficacy (as TS) and safety (as maintenance of neutrophils (NEUT) levels).
Methods:
Reinforcement learning algorithms consist of an environment, an agent and reward functions. We built the environment using PK-PD models simulating a typical cancer patient including tumour growth kinetics (3) and time course of neutrophil changes (4). From this environment, the agent observes the current state of the patient (tumour size and neutrophil count) and decides the next action (choosing a dose for the next cycle). Based on this action, the environment changes its state (changes in tumour size or neutrophil counts) and produces a reward for that action. The agent was allowed to iterate through the entire treatment duration (20 cycles) 1000 times and learned a dosing policy by choosing different doses for every cycle. We assumed the first 900 iterations as the learning phase, used the final 100 iterations to calculate the selected dose proportion for each cycle.
To achieve this, we tested the following three different rewards functions. The first function (RF1) evaluated the absolute change in tumour decrease (TS) and the absolute change in neutrophils count, the second reward function(RF2) considered tumour size reduction and whereas, the third reward function (RF3) accounts for the percentage changes in TS and NEUT. Besides, we included a safety threshold in all reward functions such that Grade 2 neutropenia would add a large penalty. Dosing policies based on RF1, RF2 and RF3 were denoted as DP1, DP2 and DP3.
We assessed the impact of dosing on treatment effect by comparing the time course of changes in TS obtained from all three policies with TS changes obtained from a 500mg (recommended) daily dosing regimen. Finally, the dosing policy that administers minimal total drug during the treatment duration was considered the best dosing policy.
Results:
Dosing policies DP1 and DP2 consistently chose 500 mg during the initial dosing cycles. However, after nine cycles, the dose selection remained entirely random, and this indicated that DP1 and DP2 could not achieve or learn a dosing policy for treatment options after nine cycles. The PKPD model (which created the environment) included a resistance component that increases the TS irrespective of dose. As both DP1 and DP2 weighted TS higher than NEUT, the agents struggled to determine a dose that would result in a positive reward. However, DP3, which weighted safety higher than efficacy, fluctuated between the highest and lowest doses throughout the 20 treatment cycles.
The total dose levels administered by DP1, DP2 and DP3 were 48, 60 and 37% of the 500 mg daily dosing regimen for 20 treatment cycles. Despite this lower administered total dose, these dosing regimens maintained similar tumour trajectories compared to the 500mg daily dosing.
Nevertheless, this analysis included a simple environment (PKPD model) which might not reflect the complex biology which controls the tumour system. Further work will include exploring complex models and patient variability as environments.
Conclusions:
Our work shows that RI learning-based algorithms can balance efficacy and safety to identify potential dosing regimens. These regimens could suggest administering a reduced total dose to achieve similar efficacy and better overall safety than the standard one dose level regimen.
References:
[1] Sutton R, Barto A. Reinforcement Learning: An Introduction. 2nd ed. The MIT Press; 2018.
[2] Yauney G, Shah P. Reinforcement learning with action-derived rewards for chemotherapy and clinical trial dosing regimen selection. Proceedings of machine learning research; 2018.
[3] Claret L, Girard P, Hoff PM, Van Cutsem E, Zuideveld KP, Jorga K, et al. Model-based prediction of phase III overall survival in colorectal cancer on the basis of phase II tumor dynamics. Journal of Clinical Oncology. 2009;27(25):4103–8.
[4] Friberg LE, Henningsson A, Maas H, Nguyen L, Karlsson MO. Model of chemotherapy-induced myelosuppression with parameter consistency across drugs. Journal of Clinical Oncology [Internet]. 2002;20(24):4713–21.
Reference: PAGE 30 (2022) Abstr 10151 [www.page-meeting.org/?abstract=10151]
Poster: Methodology – AI/Machine Learning