Ferran Gonzalez Hernandez (1,2), Simon J. Carter (3), Watjana Lilaonitkul (4,5), Juha Iso-Sipilä (6), Paul Goldsmith (7), Frank Kloprogge (8), Joseph Standing (3)
(1) Centre for Computation, Mathematics & Physics in the Life Sciences and Experimental Biology, University College London, London, UK, (2) The Alan Turing Institute, London, UK, (3) Great Ormond Street Institute of Child Health, University College London, London, UK, (4) Institute of Health Informatics, University College London, London, UK, (5) Health Data Research, London, UK, (6) BenevolentAI, London, UK, (7) Eli Lilly and Company, London, UK, (8) Institute for Global Health, University College London, London, UK
Objectives: The availability of large and high-quality ADME (Administration Distribution Metabolism and Excretion) datasets represents a critical factor to perform accurate preclinical predictions of in vivo pharmacokinetic (PK) parameters of new compounds. However, the construction of these datasets is highly limited by the extensive and unstructured scientific literature in which PK parameters are often reported, which makes dataset curation an extremely challenging and time-consuming task [1]. Hence, automated tools are required to accelerate in vivo ADME dataset construction and optimise drug development. This study aims to develop Natural Language Processing (NLP) algorithms to automatically identify and characterise scientific publications reporting in vivo PK parameters and make them available to the scientific community.
Methods: Corpus: A collection of documents were labelled as “Relevant” or “Not Relevant” depending on whether in vivo PK parameters were estimated. A balanced corpus of 3942 documents from heterogeneous sources was initially annotated. Additionally, a validation corpus of 800 documents sampled from the search “pharmacokinetics” in PubMed was labelled by 3 annotators to evaluate the generalisation of the algorithm in a different distribution of papers.
Classification pipeline: Textual information from the title, abstract and PubMed metadata was encoded to represent the documents. Different encoding approaches were analysed, including (1) sparse n-gram representations from Bag-of-Words (BoW) and (2) dense representations obtained by combining word embeddings from pre-trained versions of word2vec [2] and BERT [3]. Different decoders were then trained with 80% of the balanced corpus, including logistic regression, decision trees and extreme gradient boosting (XGBoost).
Large-scale application: The optimal encoder-decoder architecture was applied to the validation set and all the documents resulting from the PubMed search “pharmacokinetics” (n>500,000). Finally, the BERN algorithm [4] was used to perform named-entity recognition and normalisation of drugs, species and diseases mentioned in the abstract to facilitate the filtering of relevant PK publications. A web platform is currently being developed to make the selected corpus accessible to the scientific community.
Results: The optimal encoder-decoder architecture yielded an F1-score of 95.62% on the remaining 20% documents from the balanced corpus. This pipeline encoded documents using bigram features together with BERT embeddings and used XGBoost as a decoder. When the pipeline was applied to a different distribution (validation corpus) the F1-score obtained was 76.75%, caused by a drop in the recall rate. Finally, 93,543 documents were retrieved from PubMed with an estimated precision of 91.2%.
Conclusions: The pipeline developed exhibited high precision rates (>90%), but a lower recall was observed in the validation corpus due to a misrepresentation of specific types of PK documents in the training corpus (e.g. animal studies). A new corpus of 2,000 documents from the “pharmacokinetics” PubMed search distribution is currently being labelled to include a broader variety of PK studies. Finally, a large and rich corpus of scientific publications reporting in vivo PK parameters was automatically retrieved and characterised by drugs, diseases and species mentioned, and it will soon be released open-access at http://www.pkpdai.com/. Future studies will involve the extraction of specific PK parameters from full-text articles.
References:
[1] Przybylak, K. R., et al. “Characterisation of data resources for in silico modelling: benchmark datasets for ADME properties.” Expert opinion on drug metabolism & toxicology 14.2 (2018): 169-181.
[2] Chiu, Billy, et al. “How to train good word embeddings for biomedical NLP.” Proceedings of the 15th workshop on biomedical natural language processing. 2016.
[3] Lee, Jinhyuk, et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics 36.4 (2020): 1234-1240.
[4] Kim, Donghyeon, et al. “A neural named entity recognition and multi-type normalization tool for biomedical text mining.” IEEE Access 7 (2019): 73729-73740.
Reference: PAGE () Abstr 9530 [www.page-meeting.org/?abstract=9530]
Poster: Methodology – AI/Machine Learning