Victoria C. Smith (1,2), Ferran Gonzalez-Hernandez (3), Palang Chotsiri (4), Thanaporn Wattanakul (4), Jose Antonio Cordero Rigol (5), Maria Rosa Ballester (5,6), Mario Duran Hortola (5), Mireia Poch Mascaró (5), Gill Mundin (7), Watjana Lilaonitkul (2), Frank Kloprogge (8), Joseph F. Standing (1)
(1)Great Ormond Street Institute of Child Health, UCL, London, UK; (2) Institute of Health Informatics, UCL, London, UK; (3) CoMPLEX, UCL, London, UK; (4) Department of Clinical Pharmacology, Mahidol Oxford Tropical Medicine Research Unit; (5) Faculty of Health Sciences, Blanquerna, University Ramon Llull, Barcelona, Spain; (6) Centro de Investigación Biomédica en Red de Salud Mental, CIBERSAM, Spain; (7) DMPK, Oncology R&D, AstraZeneca, Cambridge, UK; (8) Institute for Global Health, UCL, London, UK.
Introduction: A centralised, standardised Pharmacokinetic (PK) database is a prerequisite for PK parameter predictions to benefit from the full potential of Machine Learning (ML). However, in vivo PK data is locked away in scientific literature, most comprehensively expressed in tables. Despite the rapid growth in PK literature, existing manually curated databases which collate information from these studies are limited to data on a handful of drugs [1] or lack sufficient information on covariates and study design [2]. Automated information extraction (IE), exploiting natural language processing methods, provides a framework to efficiently extract information from relevant PK tables. Automated Table Data IE can be split into tasks: (a) table classification, (b) named entity recognition (NER) of entities within table cells, (c) entity linking to a target knowledge base, and (d) understanding relations between entities within and between table cells. This work focuses on subtasks (a) and (b), table classification and NER in table cells.
Objectives: 1) Develop Labelled Corpora of PK tables and PK parameter and covariate entities within tables cells 2) Develop supervised ML pipelines to perform task (a) – Table Classification and task (b) NER of PK entities in table cells.
Methods: For subtasks (a) and (b), papers were downloaded from the PubMed Open Access data set in PMC-XML format. Papers containing in vivo PK parameters were selected with a PK document classifier [3]. 2500 and 2000 tables were randomly selected from the PK papers for task (a) and (b), respectively. In both cases, the data was split into training (60%), validation (15%) and test (25%) sets. Annotation for both tasks was carried out by a team of 6 PK experts, with at least two annotators for every sample. Annotator F1 Score (ranging from 0 –no agreement to 1- full agreement) was calculated and disagreements were resolved.
For task A, whole tables were annotated by the parameter and covariates they contained (Non-compartmental Parameters, Number of Subjects and Doses). A N-gram Convolutional Neural Network (CNN) was trained to encode html information and perform the classification task. Performance was also compared across different encoding methods including bag-of-words and learnable table embedding matrices. Additionally, methods to counteract the severe class imbalance were explored, including under-sampling and data augmentation, to create synthetic tables for minority classes.
For task B, specific entities appearing within table cells were annotated into categories (PK parameters, units, numeric values and various covariate categories e.g., number of subjects, demographics). The spacy NER model was trained to encode table cell information and perform Named Entity Recognition for PK parameter and covariate classes. Different tokenization strategies were compared (including whitespace, character-level and continuous numeric or alpha strings). Data augmentation and early stopping were also explored to reduce overfitting.
For task (a) and (b), the final best pipelines were applied to the unseen test sets and final F1-score, Precision and Recall were calculated. Model Robustness was assessed by carrying out a bootstrap on the test set, yielding confidence intervals for performance metrics.
Results: Inter-annotator agreement, ranged from 0.91-0.92 across classes in the task (a) corpus and from 0.70-0.92 across classes in the task (b) corpus respectively. For task (a), the best tuned N-gram CNN with table embeddings identified parameters and covariates contained within a table with F1-scores of: Non-compartmental Parameters- 0.94, Number of Subjects- 0.88 and Doses- 0.89. For task (b), the best tuned NER model, achieved a final macro F1-score of 0.91 overall and with good performance across classes, ranging from 0.71-0.95. In both tasks, data augmentation was key to achieving optimal performance.
Conclusions: This research presents the first annotated corpora to train and evaluate PK table classification and PK Table Cell NER. It also developed two effective ML pipelines: (1) to categorize PK tables and (2) to detect relevant entities in PK table cells, which both perform with high precision and recall on key classes. This work is uniquely valuable to aid researchers to efficiently filter usable PK data and provides the two initial steps towards automated table IE.
References:
[1] Grzegorzewski J et al 2021. Nucleic Acids Res. 8;49(D1): D1358-D1364
[2] KR Przybylak et al 2018. Expert opinion on drug metabolism & toxicology, 14(2):169–181
[3] Gonzalez-Hernandez F 2021. Wellcome Open Research. 202; 6(88):88
Reference: PAGE 30 (2022) Abstr 10110 [www.page-meeting.org/?abstract=10110]
Poster: Methodology – AI/Machine Learning