Harsha Krishna 2, Luca Marzano 1,2, Adam Darwich 2, Jayanth Raghothama 2, Sebastiaan Meijer 2
1 Certara (Leiden, Netherlands), 2 KTH Royal Institute of Technology (Stockholm, Sweden)
Introduction/Objectives. Population approaches typically rely on relational, tabular datasets as the core structure for storing and transforming clinical information prior to model development [1,2]. While this approach is well established, the data landscape supporting pharmacometrics analyses has evolved substantially [3,4]. Contemporary clinical trial datasets, or real-world data, often become highly complex, heterogeneous, and difficult to manipulate within rigid table-based schemas [2,5]. These structural limitations translate into substantial operational burden: extensive preprocessing, ad hoc reshaping of time dependent variables, and labor-intensive harmonization are frequently required before a dataset is analysis ready [6,7].
This challenge is amplified when conducting cross study analyses (e.g., model based meta-analysis, federated learning, or real-world evidence integration), where inconsistencies in coding standards, variable structures, and patient level relationships across multiple sources create bottlenecks in the development of population models. To address these limitations, we explored the use of graph database principles [8] as an alternative data representation strategy for population approaches, enabling more flexible, relationship centric structuring of clinical information.
Methods. We introduce a workflow for implementing graph databases as a foundation for handling clinical populations data, thus leveraging their ability to represent information as graphs: nodes (entities) and edges (relationships). Graph representations allow direct encoding of hierarchical and nonlinear dependencies, such as patient to visit linkage, nested treatment cycles, covariate trajectories, and protocol events, without the constraints of predefined relational schemas [8].
As a representative case study, we used open access data from the Project Data Sphere SCLC repository [9], focusing on three phase III randomized controlled trials enrolling patients with extensive disease small cell lung cancer (ED SCLC) treated with platinum-etoposide chemotherapy (n=872). These datasets contain diverse variable structures, multiple time varying components, and trial specific protocol differences, making them an ideal test case for exploring graph-based data integration.
Graphs were built using the Neo4j environment [8], with ingestion, cleaning, and exploratory analysis performed in harmony with a Python workflow. No upfront harmonization or tabular restructuring was performed. Instead, datasets were ingested directly into the graph structure to evaluate how naturally the graph model captures relationships and facilitates exploratory analysis.
Results. Three independent graphs were successfully constructed, one for each randomized study. The resulting structure enabled immediate visualization of patient baselines, covariates, treatment outcomes, and data density, without merging or flattening operations typically required in relational preprocessing.
Using native graph queries, we were capable to perform exploratory data analysis (EDA) directly on the developed graph structures, including identification of missing data patterns, comparison of patients, mapping of sub-cohorts of interest, and extraction relevant information and summaries. Importantly, the graph model facilitated rapid linkage across the three trials, allowing harmonization of variable definitions and alignment of study structures through relationship-based mapping instead of manual table joins.
Compared with a previous analysis in which the same trials were processed and merged using a traditional relational database workflow [5], the graph-based approach resulted in substantially shorter and markedly more flexible preprocessing time.
Conclusions. Graph databases offer a promising and flexible alternative to traditional relational schemas for handling complex clinical datasets. Leveraging the relationships in the data as graph representations enables more intuitive exploration, easier harmonization, and more efficient integration across heterogeneous data sources. This approach has direct relevance for improving the scalability and reproducibility of population approaches, particularly in settings requiring multi study integration, real world data ingestion, federated learning frameworks, or AI/machine learning pipelines.
References:
1. Bauermeister, S., Phatak, M., Sparks, K., Sargent, L., Griswold, M., McHugh, C., … & Gallacher, J. (2023). Evaluating the harmonisation potential of diverse cohort datasets: Sarah Bauermeister et al. European Journal of Epidemiology, 38(6), 605-615.
2. Kumar, G., Basri, S., Imam, A. A., Khowaja, S. A., Capretz, L. F., & Balogun, A. O. (2021). Data harmonization for heterogeneous datasets: a systematic literature review. Applied Sciences, 11(17), 8275.
3. Freidlin, B., & Korn, E. L. (2023). Augmenting randomized clinical trial data with historical control data: precision medicine applications. JNCI: Journal of the National Cancer Institute, 115(1), 14-20.
4. Koch, G., Pfister, M., Daunhawer, I., Wilbaux, M., Wellmann, S., & Vogt, J. E. (2020). Pharmacometrics and machine learning partner to advance clinical data analysis. Clinical Pharmacology & Therapeutics, 107(4), 926-933.
5. Marzano, L., Darwich, A. S., Dan, A., Tendler, S., Lewensohn, R., De Petris, L., … & Meijer, S. (2024). Exploring the discrepancies between clinical trials and real‐world data: A small‐cell lung cancer study. Clinical and translational science, 17(8), e13909.
6. Hou, J., Zhao, R., Gronsbell, J., Lin, Y., Bonzel, C. L., Zeng, Q., … & Cai, T. (2023). Generate analysis-ready data for real-world evidence: tutorial for harnessing electronic health records with advanced informatic technologies. Journal of medical Internet research, 25, e45662.
7. Meng, J., Wu, M., Shi, F., Xie, Y., Wang, H., & Guo, Y. (2025). Medical laboratory data-based models: opportunities, obstacles, and solutions. Journal of Translational Medicine, 23(1), 823.
8. Meher, D., Bulakh, P., & Jabde, M. (2023). Learning graph databases: Neo4j an overview. International Journal of Engineering Applied Sciences and Technology, 8(02), 2455-2143.
9. Green, A. K., Reeder-Hayes, K. E., Corty, R. W., Basch, E., Milowsky, … & Wood, W. A. (2015). The project data sphere initiative: accelerating cancer research by sharing data. The oncologist, 20(5), 464-e20.
Reference: PAGE 34 (2026) Abstr 12129 [www.page-meeting.org/?abstract=12129]
Poster: Methodology - New Modelling Approaches