E-Locus - Institutional Repository of the University of Crete - A Machine Learning method to classify sentences containing biomedical entities in academic text

Home A Machine Learning method to classify sentences containing biomedical entities in academic text

Results - Details

[Add to Basket]

Identifier

000453963

Title

A Machine Learning method to classify sentences containing biomedical entities in academic text

Alternative Title

Μια μέθοδος μηχανικής μάθησης για την ταξινόμηση προτάσεων που περιέχουν βιοϊατρικές οντότητες σε ακαδημαϊκά κείμενα

Author

Μπουμπάκης, Απόστολος

Thesis advisor

Καντεράκης, Αλέξανδρος

Abstract

Nowadays, the amount of biomedical literature is getting larger and larger and thus Natural Language Processing (NLP) research in clinical documents is gaining a very significant role. The automated analysis of biomedical literature is rapidly growing, stimulating the development of several techniques of automatic Named Entity Recognition (NER) and document classification. However, despite the existence of so many techniques for the classification of biomedical entity sentences, few types of entities can be easily recognized. The aim of this study is to present the state-of-the-art Named Entity Recognition technique, Bidirectional Encoder Representations from Transformers (BERT), in order to recognize/extract Disease, Gene, SNP and Chemical entities from biomedical texts. The reason why BERT was chosen is the fact that it is the most widespread Neural Network architecture for training language models, having led to considerable improvements in various NLP tasks. In general, the more the parameters in a BERT model, the better the results obtained. Unfortunately, due to the fact that the memory consumption increases with the size of these models, the lighter BERT variant, distilBERT, was applied. This technique was evaluated on two NER tasks for each entity. All in all, in outline, hundreds of biomedical papers were parsed in an XML format, analyzed to their sentences, classified and labeled accordingly, in order to create different datasets. Finally, they were passed through the BERT model to recognize sentences that include (or not) the aforementioned entities. The results showed that by appropriately pre-training the BERT model, great recognition performance can be achieved, without extensive fine-tuning and optimization requirements, while outperforming previous models on NER biomedical text mining task. However, there is by all means space for further tuning and much more future work and new challenges.

Language

English

Subject

Document classification

Entity recognition

Αναγνώριση οντοτητων

Ταξινόμηση κειμένων

Issue date

2023-04-05

Collection

School/Department--School of Medicine--Department of Medicine--Post-graduate theses

Type of Work--Post-graduate theses

Views

368

Digital Documents
	Download document View document Views : 0