Large-Language-Model–Guided Discovery of Population Pharmacokinetic Structural Models: Evaluation Across Synthetic and Real Data

Romain Ferrara 1,2,3, Martin Soucail 1,3,5, Victor Gertner 4, Florence Gattacceca 1,3, Sébastien Benzekry 1,3

1 COMPutational pharmacology and clinical Oncology Team (COMPO), Inria Sophia-Antipolis – Mediterrannee, Cancer Research Center of Marseille, Inserm, CNRS (Marseille, France), 2 Institut Paoli-Calmettes, Comprehensive Cancer Centre, Nuclear Medicine Department (Marseille, France), 3 Aix-Marseille University (, France), 4 Centre for Computational Biology (CBIO), Mines Paris, PSL University (Paris, France), 5 Quantitative Pharmacology, Translational Medicine, Servier (Gif sur Yvette, France)

Introduction/Objectives: Structural model discovery in pharmacometrics is a highly iterative, manual, and time-intensive endeavor. Pharmacometricians must continuously propose structural hypotheses, fit them to concentration-time data using non-linear mixed-effects modeling (e.g., NONMEM, Monolix), and rigorously evaluate the output. Automated model selection methods have been proposed [1,2,3], but they rely on the optimization of a single numeric criterion, in a finite search space. Large Language Models (LLMs) optimized for complex reasoning and diagnostic interpretation can be deployed as autonomous agents within automated discovery loops. This work implements an LLM-based automated discovery approach, relying on agents iteratively reasoning and interacting to propose structural Ordinary Differential Equations (ODE) models written in MLXTRAN and fit locally using Monolix. We evaluate the capacity of LLM agents to discover pharmacokinetic (PK) model structures across a spectrum of conditions — from data-driven discovery to literature-informed refinement — and delineate practical boundaries on synthetic benchmarks (known ground-truth) and real clinical datasets.

Methods: The algorithm is inspired by HDTwinGen, an evolutionary algorithm that employs LLMs to autonomously discover ODE models [4]. Unlike HDTwinGen (implemented only on synthetic datasets with a non-population approach, non-realistic designs, and no comparison to ground truth), our implementation adapts this framework to population-based mixed-effects modeling, realistic designs, and performs rigorous benchmarking. In the first iteration, a model-builder agent (A1) proposes structural models and initial parameter values based on a prompt guiding domain knowledge, along with hard-coded instructions to write proper MLXTRAN ODE code. These models are automatically fit using Monolix. A diagnostic agent (A2) then iteratively analyzes the results (goodness-of-fit metrics, parameter values and RSEs) and proposes new models informed by the diagnostic analysis. Importantly, neither the data nor identifying elements from it are sent to the LLMs. Finally, a reflection agent (A3) evaluates all tested models and suggests the overall best, providing a motivated assessment of whether modeling should be continued further. An automatic report is generated at the end with: 1) the LLM interactions and reasoning, 2) the models tested, including their code and comparative fit performance, and 3) detailed goodness-of-fit plots and parameter identifiability metrics for the final model. To evaluate the modeling capabilities of the workflow, we designed a synthetic benchmark of 10 datasets spanning different routes of administration, numbers of compartments, and linear/nonlinear elimination. To approximate real-world modeling practice, the simulation designs and parameters were taken from published models, usually with rich sampling. We also evaluated the framework on 4 real clinical datasets: Theophylline [5], Warfarin [6] and two drugs from Phase I studies in oncology (Docetaxel [7] and Irinotecan [8]). These experimental designs were evaluated using Claude Sonnet 4.5.

Results: Synthetic benchmark: The workflow correctly identified model structures in 7/10 cases (consistent across 5 runs). Performance degraded with model complexity: 1-CMT structures were identified in 4/4 cases (linear and nonlinear elimination, with and without lag); 2-CMT and 3-CMT structures showed 2/4 and 1/2 correct identifications, respectively; but the structures discovered were close to the ground truth (e.g., 1-CMT Michaelis-Menten versus 2-CMT Michaelis-Menten). With diagnostics-enhanced reasoning using Claude Opus 4.5, 2 of 3 failures were resolved. Synthetic Paclitaxel (3-CMT Michaelis-Menten) remained unresolved; the agent confidently concluded that “nonlinear elimination is not supported by the data” — confusing poor identifiability with absence of mechanism. Real clinical data: Agent-identified models corresponded to published models in 1/4 cases (Docetaxel: 3-CMT linear). For Docetaxel, without knowing the drug, the workflow inferred that “the drug distributes into at least two peripheral tissues with different equilibration rates” and that “elimination pathways are NOT saturated” — both consistent with published pharmacology. For Irinotecan, the agent converged to 1-CMT versus published 3-CMT, but recovered the published model when pre-prompted with the drug name. For Theophylline and Warfarin, agent-suggested models differed from but outperformed literature models: lag-time versus first-order absorption (ΔBIC = +61) and transit compartment versus first-order absorption (ΔBIC = +200), respectively. Average runtime was under 10 minutes per compound at ~$0.11.

Conclusions: This automated discovery loop successfully navigates trade-offs between fit quality, parameter identifiability and parsimony. Limitations include degraded performance with increasing complexity and generation of plausible but unfounded mechanistic reasoning. The approach shows promise for initial model proposal, literature synthesis, and elementary pharmacometric interpretation. Further investigation is needed on reproducibility across LLM versions and experimental design requirements.

References:
[1] Bies RR, Muldoon MF, et al. A Genetic Algorithm-Based, Hybrid Machine Learning Approach to Model Selection. J Pharmacokinet Pharmacodyn 33, 195–221 (2006).
[2] Chen X, Nordgren R, et al. A fully automatic tool for development of population pharmacokinetic models. CPT Pharmacometrics Syst Pharmacol 13, 1784–1797 (2024).
[3] Richardson S, Irurzun Arana I, et al. A machine learning approach to population pharmacokinetic modelling automation. Commun Med 5, 327 (2025).
[4] Holt S, Liu T, Schaar M van der. Automatically Learning Hybrid Digital Twins of Dynamical Systems. Adv Neural Inf Process Syst 37, 72170–72218 (2025).
[5] Boeckmann AJ, Sheiner LB, Beal SL. NONMEM Users Guide — Part V (Introductory Guide). NONMEM Project Group, University of California, San Francisco (1994).
[6] O’Reilly RA. Studies on coumarin anticoagulant drugs. Initiation of warfarin therapy without a loading dose. Circulation 38, 169–177 (1968).
[7] Launay-Iliadis MC, Bruno R, et al. Population pharmacokinetics of docetaxel during phase I studies. Cancer Chemother Pharmacol 37, 47–54 (1995).
[8] Deyme L, Barbolosi D, et al. Population pharmacokinetic model of irinotecan and its four main metabolites. Cancer Chemother Pharmacol 88, 247–258 (2021).

Reference: PAGE 34 (2026) Abstr 11949 [www.page-meeting.org/?abstract=11949]

Poster: Oral: Methodology - New Modelling Approaches