Abstract |
Nowadays, the amount of biomedical literature is getting larger and larger and thus Natural Language Processing (NLP) research in clinical documents is gaining a very significant role. The automated analysis of biomedical literature is rapidly growing, stimulating the development of several techniques of automatic Named Entity Recognition (NER) and document classification. However, despite the existence of so many techniques for the classification of biomedical entity sentences, few types of entities can be easily recognized.
The aim of this study is to present the state-of-the-art Named Entity Recognition technique, Bidirectional Encoder Representations from Transformers (BERT), in order to recognize/extract Disease, Gene, SNP and Chemical entities from biomedical texts. The reason why BERT was chosen is the fact that it is the most widespread Neural Network architecture for training language models, having led to considerable improvements in various NLP tasks.
In general, the more the parameters in a BERT model, the better the results obtained. Unfortunately, due to the fact that the memory consumption increases with the size of these models, the lighter BERT variant, distilBERT, was applied. This technique was evaluated on two NER tasks for each entity.
All in all, in outline, hundreds of biomedical papers were parsed in an XML format, analyzed to their sentences, classified and labeled accordingly, in order to create different datasets. Finally, they were passed through the BERT model to recognize sentences that include (or not) the aforementioned entities.
The results showed that by appropriately pre-training the BERT model, great recognition performance can be achieved, without extensive fine-tuning and optimization requirements, while outperforming previous models on NER biomedical text mining task. However, there is by all means space for further tuning and much more future work and new challenges.
|