Andrew Dwi Permana 1, Eliya Abedi 2,3,4, Elinor Nemlander 2,3,5, Adam S. Darwich 1, Annika Sjövall 4,7, Axel C. Carlsson 2,3, Andreas Rosenblad 2,4,8,9, Sebastiaan Meijer 1, Marcela Ewing 10,11, Jan Hasselström 2,3, Jayanth Raghothama 1
1 Department of Biomedical Engineering and Health Systems, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute Of Technology (Stockholm, Sweden), 2 Department of Neurobiology, Care Sciences and Society, Division of Family Medicine and Primary Care, Karolinska Institutet (Stockholm, Sweden), 3 Academic Primary Health Care Centre (Stockholm, Sweden), 4 Regional Cancer Centre Stockholm-Gotland (Stockholm, Sweden), 5 Jakobsbergs Universitetsvårdcentral, Region Stockholm (Stockholm, Sweden), 6 Liljeholmens Universitetsvårdcentral, Region Stockholm (Stockholm, Sweden), 7 Division of Coloproctology, Department of Pelvic Cancer, Karolinska University Hospital, Department of Molecular Medicine and Surgery, Karolinska Institutet (Stockholm, Sweden), 8 Department of Statistics, Uppsala University (Uppsala, Sweden), 9 Department of Medical Sciences, Division of Clinical Diabetology and Metabolism, Uppsala University (Uppsala, Sweden), 10 Regional Cancer Centre West, Region Västra Götaland (Gothenburg, Sweden), 11 General Practice /Family Medicine, School of Public Health and Community Medicine, Institute of Medicine, Sahlgrenska Academy, University of Gothenburg (Gothenburg, Sweden)
Introduction: Colorectal cancer (CRC) remains a leading cause of cancer incidence and mortality worldwide[1]. In Sweden, it ranks as the third most prevalent cancer and a major contributor to mortality[2]. Despite improvements in survival rates, early diagnosis remains challenging because symptoms are often nonspecific and frequently overlap with benign conditions[3], [4]. In primary health care (PHC), where most patients first present, this diagnostic difficulty contributes to missed or delayed CRC identification[3], [5], [6]. Recent advancements in machine learning (ML) and natural language processing (NLP) have improved CRC risk prediction, particularly through the integration of structured data and unstructured clinical records[7], [8]. While multimodal approaches demonstrate promising gains in predictive performance, their applicability in clinical settings remains to be seen[9], [10]. Furthermore, most existing models rely on diagnostic codes or data recorded close to diagnosis time, thereby limiting their ability to identify CRC before clinical suspicion arises.
Objectives: To enhance the early detection of CRC in PHC by identifying subtle or previously under-recognised early clinical signals, while also reducing uncertainty in predictive modeling through the integration of multimodal electronic health records (EHRs).
Methods: We analysed EHRs from nearly 15,000 patients at PHC centres in Stockholm, Sweden, including 2,992 patients diagnosed with stage I-III CRC between 2015 and 2019. For each CRC patient case, data were collected one year prior to their diagnosis date. EHR data comprised structured variables (demographics, laboratory results), categorical lifestyle factors (tobacco and alcohol habits), International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) codes, and unstructured clinical text from general practitioner (GP) consultations and referral notes; referral notes were those sent to other clinics for further investigation. Unstructured Swedish clinical text was processed utilising transformer-based language models (KB/bert-base-swedish-cased and the fine-tuned SweDeClin-BERT[11], [12], with Named Entity Recognition and Topic Modeling employed to extract clinically meaningful features. We embedded the ICD-10 codes using a neural network embedding layer from PyTorch, which generated an 8-dimensional vector for each ICD-10 code. Then, we applied PCA to reduce the dimensionality.
To closely simulate real-world clinical decision-making and enhance model applicability, we proposed a longitudinal modelling approach that excludes data from the three months preceding the CRC diagnosis date. This exclusion window was established through consultations with practising GPs and supported by prior evidence indicating an increased frequency in consultations during the 100 days before a CRC diagnosis[13]. By withholding this period, our approach ensures that the model learns patterns present before any clinical suspicion arises.
For model development, missing values were initially retained, as the XGBoost classifier natively handles NaN entries without imputation. However, subsequent experiments demonstrated that explicitly replacing with null placeholders improved model outcomes, likely due to more consistent handling of sparse feature representations. Class imbalance was addressed using the scale_pos_weight parameter, which modulates the influence of CRC and control samples during the training. Model interpretability was evaluated utilising Shapley Additive exPlanations (SHAP) to quantify feature contributions at both the global and visit levels. To evaluate the model on different data subsets, optimise hyperparameters, and prevent overfitting, we employed 10-fold cross-validation with 80% training and 20% testing. The performance metrics used included accuracy, precision, recall, F1-score, specificity, AUC-ROC, and log loss to evaluate both discrimination and calibration.
Results: Our approach demonstrated robust performance, achieving 89.6% accuracy, 86.0% precision, 98.1% specificity, and an AUC-ROC of 84.2%. The explainability identified well-known risk factors and alarm symptoms, including older age, increased body weight, abdominal pain, and anaemia, which remained strong predictors and aligned with known clinical pathways. Importantly, the model also highlighted overlooked risk indicators associated with undiagnosed CRC, including smoking, elevated plasma creatinine, coagulation abnormalities (INR), hyperglycaemia or increased HbA1c, cardiac assessment, diabetes, bleeding tendencies, and hypertension. Beyond predictive performance, the model provides clinically meaningful added value over existing screening workflows by facilitating earlier risk assessment for patients with subtle or non-specific symptoms that fall outside existing screening criteria (e.g., age).
Conclusion: Our study demonstrates the feasibility and clinical potential of combining multimodal, longitudinal EHR data with ML and NLP techniques to predict individualised CRC risk in Swedish PHC. The model identified early symptom patterns and laboratory indicators associated with undiagnosed CRC, thereby facilitating enhanced risk stratification, more prompt diagnostic assessment, and potential complementing of existing screening strategies.
References:
[1] E. Morgan et al., “Global burden of colorectal cancer in 2020 and 2040: incidence and mortality estimates from GLOBOCAN,” Gut, 2023.
[2] OECD and European Commission, EU Country Cancer Profile: Sweden 2025, 2025.
[3] N. Calanzani et al., “Recognising Colorectal Cancer in Primary Care,” Adv. Ther., 2021.
[4] W. Hamilton, A. Round, D. Sharp, and T. J. Peters, “Clinical features of colorectal cancer before diagnosis: a population-based case–control study,” Br. J. Cancer, 2005.
[5] E. Nemlander et al., “A machine learning tool for identifying non-metastatic colorectal cancer in primary care,” Eur. J. Cancer, 2023.
[6] R. Fernholm et al., “Diagnostic errors reported in primary healthcare and emergency departments: A retrospective and descriptive cohort study of 4830 reported cases of preventable harm in Sweden,” Eur. J. Gen. Pract., 2019.
[7] V. Moglia et al., “Artificial intelligence methods applied to longitudinal data from electronic health records for prediction of cancer: a scoping review,” BMC Med. Res. Methodol., 2025.
[8] D. F. Redd et al., “Identification of colorectal cancer using structured and free text clinical data,” Health Informatics J., 2022.
[9] N. Hai et al., “A practical approach for colorectal cancer diagnosis based on machine learning,” PLOS One, 2025.
[10] M. Hoogendoorn et al., “Utilizing uncoded consultation notes from electronic medical records for predictive modeling of colorectal cancer,” Artif. Intell. Med., 2016.
[11] S. Almgren, S. Pavlov, and O. Mogren, “Named Entity Recognition in Swedish Health Records with Character-Based Deep Bidirectional LSTMs,” Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining, 2016.
[12] T. Vakili et al., “Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data”, 2022.
[13] M. Ewing et al., “Increased consultation frequency in primary care, a risk marker for cancer: a case–control study,” Scand. J. Prim. Health Care, 2016.
Reference: PAGE 34 (2026) Abstr 12232 [www.page-meeting.org/?abstract=12232]
Poster: Drug/Disease Modelling - Oncology