Abstract |
Natural language processing (NLP) is the branch of computer science focused
on developing systems that allow computers to communicate with people using
everyday language. Many NLP techniques, including stemming, part of speech
tagging, named entity recognition, compound recognition, de-compounding,
chunking, word sense disambiguation and others, have been used for information
extraction. In many cases, semantic information is used to expand
knowledge about documents and to improve performance.
There is an increasing interest in NLP strategies applied to clinical texts
due to the increasing number of electronic documents in hospital information
systems. Biomedical text mining is a research field on the edge of natural language
processing and refers to text mining applied to clinical text or to the
literature of the biomedical domain.
In this work, we present a methodology which combines unsupervised word
classes with supervised machine learning methods in order to contribute to
named entity recognition on clinical reports. Named entity recognition is
performed generally by knowledge-based semantic resources. We present an
approach where data-driven word classes are evaluated and compared with
knowledge-based semantic classes when inserted as features in a Conditional
Random Field (CRF) classifier. We examine different methods to combine datadriven
word classes with knowledge-based semantic classes to improve named
entity recognition. Data-driven semantic classes achieve results with small
differences compared to knowledge-based semantic classes. Our case study
concluded that data-driven word classes can add important information and
are complementary with knowledge-based semantic classes.
|