Alina Melnikova 1, Kirill Zhudenkov 1,2, Leonid Stolbov 1, Victor Sokolov 1,3
1 M&S Decisions (Dubai, United Arab Emirates), 2 Research Center of Model-Informed Drug Development, Sechenov First Moscow State Medical University (Moscow, Russia), 3 Marchuk Institute of Numerical Mathematics of the Russian Academy of Sciences (INM RAS) (Moscow, Russia)
Introduction
In recent years, several approaches to data synthesis have been proposed that can be used for generating virtual populations from the observed input data, including deep learning-based methods such as variational autoencoders, generative adversarial networks, Bayesian networks, tree models (CART (classification and regression tree), random forests) and diffusion models [1]. In this work, we systematically assessed various data synthesis methods using two criteria: authenticity, i.e. statistical similarity between original and synthetic data, and usefulness, i.e. preservation of relationships that are important for subsequent analysis. Based on this assessment, we developed a tool for generation of virtual populations and integrated it into SimuRg R package.
Methods
This analysis incorporates a joint distribution synthesis algorithm for patient characteristics based on real biomedical data, using a random forest generative algorithm implemented in R software. The algorithm performs optimally on medical datasets characterized by a moderate number of observations (250–1000) and a limited number of variables (1–10).
The accuracy of individual distribution reproduction was evaluated using Kolmogorov-Smirnov tests, the Jensen-Shannon divergence measure (JSD), and the construction of comparative histograms of continuous and categorical variables for synthetic and original data. To diagnose the preservation of multidimensional distributions, the difference in correlation matrices for continuous variables and visualization based on the UMAP (Uniform Manifold Approximation and Projection) machine learning algorithm were used. The percentage of rows in the synthetic dataset with exact duplicates in the original dataset was used as a privacy metric.
Results
An R-based algorithm was developed and integrated into the SimuRg package. The algorithm employs a random forest generative approach that effectively handles both categorical and continuous data, even when distributions are non-normal or imbalanced. It requires minimal hyperparameter tuning compared to deep learning approaches and demonstrates improved stability over single-tree-based methods such as CART [2], [3]. The effectiveness of this tool has been evaluated against four well-established biomedical datasets: ACTG (AIDS) [4], breast cancer (GBSG2) [5] and (METABRIC) [6], and liver cirrhosis (PBC) [7]. The number of subjects across datasets ranged from 280 to 1000, with 4 to 7 categorical and 2 to 6 continuous variables. Variable distributions included skewed and non-normal continuous data, as well as imbalanced categorical variables.
For each dataset, 100 sets of synthetic data were generated, and summary diagnostic metrics were provided. Tests for the preservation of individual and multidimensional distributions showed adequate results for all the datasets studied. The average p-value of the Kolmogorov-Smirnov test was 0.80–0.84 (averaged over all continuous variables in the dataset); JSD < 0.002 (averaged over all categorical variables in the dataset). The average value of the absolute difference between the Kendall correlation coefficients of the corresponding variables in the original and synthetic data was 0.01–0.04. The number of rows in the synthetic datasets with exact duplicates in the original data was 0%.
Conclusions
A biomedical data generation tool based on random forest algorithm has been developed and deployed in R. The tool analyzes the characteristics of the original dataset and synthesizes data with maximally equivalent properties: a capability essential for evaluating diverse scenarios in quantitative systems pharmacology, as well as for the design and execution of virtual clinical trials.
References:
[1] H. Murtaza, M. Ahmed, N. F. Khan, G. Murtaza, S. Zafar, and A. Bano, “Synthetic data generation: State of the art in health care domain,” Comput. Sci. Rev., vol. 48, p. 100546, May 2023, doi: 10.1016/j.cosrev.2023.100546.
[2] M. Miletic, A. Bollinger, S. S. Allemann, and M. Sariyar, “Synthetic data for pharmacogenetics: enabling scalable and secure research,” JAMIA Open, vol. 8, no. 5, p. ooaf107, Sep. 2025, doi: 10.1093/jamiaopen/ooaf107.
[3] I. Akiya, T. Ishihara, and K. Yamamoto, “Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study,” JMIR Med. Inform., vol. 12, pp. e55118–e55118, Jun. 2024, doi: 10.2196/55118.
[4] S. M. Hammer et al., “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and CD4 cell counts of 200 per cubic millimeter or less. AIDS Clinical Trials Group 320 Study Team,” N. Engl. J. Med., vol. 337, no. 11, pp. 725–733, Sep. 1997, doi: 10.1056/NEJM199709113371101.
[5] M. Schumacher et al., “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German Breast Cancer Study Group.,” J. Clin. Oncol., vol. 12, no. 10, pp. 2086–2093, Oct. 1994, doi: 10.1200/JCO.1994.12.10.2086.
[6] B. Pereira et al., “The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes,” Nat. Commun., vol. 7, p. 11479, May 2016, doi: 10.1038/ncomms11479.
[7] R. E. Dickson, P. M. Grambsch, T. R. Fleming, L. D. Fisher, and A. Langworthy, “Prognosis in primary biliary cirrhosis: Model for decision making,” Hepatology, vol. 10, no. 1, pp. 1–7, Jul. 1989, doi: 10.1002/hep.1840100102.
Reference: PAGE 34 (2026) Abstr 12027 [www.page-meeting.org/?abstract=12027]
Poster: Methodology – AI/Machine Learning