Welcome to the Population Approach Group in Europe

Development and benchmarking of non-generative and generative natural language processing approaches for AI-assisted pharmacometric literature curation

Wendong Ge, Sean Hayes, Akshita Chawla, Ka Lai Yee, Bela Patel, Gregory Bryman

Merck & Co., Inc., Rahway, NJ, USA

Introduction & Objectives:

Comprehensive meta-analysis of competitor data is crucial for driving drug development decisions. This data is traditionally curated manually from literature, an expensive and time-consuming process yielding only static data. Artificial intelligence (AI) has the potential to automate some or all of the curation process. We set out to quantify the performance of the state-of-the-art AI models in curation tasks, examining general-purpose large-language models (LLMs), such as OpenAI’s GPT4^[1], as well as smaller language models purpose-built for biomedical literature. We characterize the strengths and weaknesses of various available methods in the context of pharmacometric literature curation and provide recommendations on how to best leverage AI/ML advancements to support this task.

Our objectives for this first stage of the research were to compare the performance of purpose-built language models ^[2-7] that have been the gold standard in this space against newer, more flexible LLM models, such as GPT4 in terms of: 1) overall performance in identifying literature matching comparator database specifications; 2) ability to extract relevant information 3) training data requirements; 4) generalizability across disease areas.

Methods:

We benchmarked purpose-built language models (e.g. PubMed-BERT^[6]) against LLMs (e.g. Llama2^[8], GPT4) on literature selection and knowledge extraction for the purposes of model based meta-analysis (MBMA). We focused on a representative MBMA database curation task of identifying the therapeutic drugs tested in each study, using the outputs to select studies for inclusion or exclusion from an MBMA-focused dataset. A mix of public (DrugProt^[9]) and proprietary datasets across 9 disease areas were used for model development and benchmarking. These datasets consisted of Pubmed abstracts human-annotated for key outputs (e.g., drug treatment, disease, trial treatment arms, clinical endpoints). We quantified model ability to curate correct drug names using standard performance metrics relative to ground truth: precision, recall, and F1 score (average of precision and recall, which is reported below). We systematically quantified performance sensitivity to the amount of labeled training data required and other methodological details, such as prompt structure.

Results:

We found that purpose-built language models outperformed far larger and more computationally expensive state-of-the-art LLMs when the latter are used in a typical “few-shot learning” approach (no fine tuning, instead providing a few examples or “shots” in the LLM prompt). PubMed-BERT achieved the best performance (0.90 F1 Score), while the best performing LLM with few-shot learning, GPT4, performed significantly worse (0.62 F1 Score). LLM performance was highly sensitive to task-framing and prompt structure and content: best performance used 10 training examples engineered to improve model inference. Under the worst-case task structure, general-purpose LLMs often produced meaningless output. For somewhat smaller, open-source LLMs it is possible to partially refit the models to specific training data (‘fine-tuning’), which significantly improved performance (0.82 F1 Score using Llama2). We varied the training data size (n=1 to 2750) and found prompt engineering with GPT4 to be the best choice in cases with less than 50 examples to train from. Fine-tuning PubMed-BERT and Llama2 had competitive advantages with 100 or more training examples. When benchmarking end-to-end performance for curating abstracts to include in an MBMA dataset based on drugs of interest, performance varied from as high as 0.91 F1 to as low as 0.52 depending on the disease areas, suggesting immediate applicability to some disease areas and need for further refinement in others.

Conclusions:

While leading LLM approaches produce impressive results when training data is limited, on information extraction tasks relevant to MBMA, they are still outperformed by purpose-built language models in many cases. Especially when considering the larger computational cost of LLMs, the lighter weight purpose-built language models focused on biomedical literature are an important first option to consider for literature curation supporting pharmacometric analysis. Our results also suggest that developing specialized LLMs by fine-tuning with historical comparator database examples may offer the best path forward in the future and is likely to be an important avenue for future research.

References:
[1] Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. 2023 Mar 15.
[2] Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.
[3] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 2019 Jul 26.
[4] Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323. 2019 Apr 6.
[5] Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020 Feb 15;36(4):1234-40.
[6] Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH). 2021 Oct 15;3(1):1-23.
[7] Zhang S, Cheng H, Vashishth S, Wong C, Xiao J, Liu X, Naumann T, Gao J, Poon H. Knowledge-rich self-supervision for biomedical entity linking. arXiv preprint arXiv:2112.07887. 2021 Dec 15.
[8] Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288. 2023.
[9] Miranda A, Mehryary F, Luoma J, Pyysalo S, Valencia A, Krallinger M. Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations. InProceedings of the seventh BioCreative challenge evaluation workshop 2021 Nov (pp. 11-21).

PAGE 2024: Methodology � AI/Machine Learning
BELA PATEL

Development and benchmarking of non-generative and generative natural language processing approaches for AI-assisted pharmacometric literature curation

Reference: PAGE 32 (2024) Abstr 10993 [www.page-meeting.org/?abstract=10993]

Poster: Methodology � AI/Machine Learning