Abstract |
Named Entity Extraction (NEE) is the process of identifying entities in texts and, very
commonly, linking them to related (Web) resources. This task is useful in several
applications, e.g. for question answering, annotating documents, processing of search
results, etc.
However, it is quite common for an entity name to correspond to more than
one semantic categories, e.g.
Argentina may refer either to Fish Species Argentina
or
to
Country Argentina.
This is the well
-
known Named Entity Disambiguation (NED) problem.
In addition to, existing NEE
and NED
tools lack an open or easy configuration although this
is very
important for building domain
-
specific applications.
For example, supporting a
new category of entities, or specifying how to link the detected entities with online
resources, is either impossible or very laborious.
In this thesis we show how we can exploit
semantic information (Linked Data) at real
-
time for
configuring
a NEE system and
disambiguating
the mined entities. We introduce an RDF/S vocabulary, called Open NEE
Configuration Model, which allows a NEE service to describe (and publish as Linked Data)
its entity mining capabilities, but also to be dynamically configured.
We
present
X
-
Link
a
NEE framework that realizes
this model, and contrary to the existing tools,
it
allows the
user to easily define the categories of entities that are interesting for the application at
hand
(
by exploiting Linked
Data).
Then we focus on the problem of
NED
in this context,
i.e.
on the problem of selecting
the right category for each
extracted entity.
To this end
we introduce
three
methods, each
approaching the problem from a different perspective.
The first method is based exclusively on NEE results and
selects as more probable
category the one
with the highest occurrence frequency. The second method moves a
step forward and exploits the semantic relations between
the
mined entities, using their
semantic resources, and returns
the semantic resource that is closer to
the
others
in the
semantic graph. The last method uses machine learning algorithms for classifying the entire
document into a specific category based on a train set.
Then we report the results
of a thorough comparative experimental evaluation using
search results from
Bing
search
engine.
We evaluate
the introduced
methods
over collections of documents of different
size and we measured
the achieved precision and
the
required
time for
disambiguation.
The results allowed us to identify the strong and weak aspects of each method.
Overall,
the third method works well in most cases apart from small snippets, e.g. tweets, where
it achieves almost the same precision with the second method.
|