Abstract |
Linked Dat
a
is a method
for
publishing
structured
data
that allows
them to
be interlinked (by
using URIs instead of simple values
)
for assisting
their integration
.
A big numb
er of
such
datasets
,
hereafter
sources
,
has already been published according to the principles of Linked data and their
number and size keeps increasing
.
However,
currently
it
is
not
evident
how connected these
datasets are
. In particular,
it is difficult
(a)
to
obtain
complete information about one partic
ular
URI (or
a
set of URIs), (b)
to
discover a dataset which is relevant to another one
, (c)
to
comput
e
and visualize
the
degree of
connectivity between two or more d
a
t
a
sets.
All the aforementioned
tasks
are
important
for the integration process
in an open
and involving environment To alleviate this problem
in this thesis
,
we introduce metrics, indexes and algorithms which
allow
the computation
and quan
tification of
connectivity among several datasets
. For
achieving
scalability
,
we
propose
(i
) a namespace
-
based prefix index, (ii) a sameAs catalog for computing
the symmetric
and transitive closure of the sameAs
relationships encountered in the datasets, (iii)
a semantics
-
aware element index (that explo
its
the aforementioned indexes),
(iv)
a lattice of the
common elements of any set of datasets
, and (v)
two lattice
-
based incremental algorithms for
sp
eeding up the computation of
the
lattice
.
We
apply and evaluate
the proposed approach in the context of a
real and
operational semantic
warehouse
containing information
about the
marine domain
(where the metrics are used for
assessing
the
quality of the
semantic
warehouse
and
its underlying sources, and for monitoring
the
quality
of the
semantic
warehouse
aft
er a
reconstruction
),
as well as for three hundred LOD
cloud datasets.
We
report measurements
that have
not
been carried out in the past
(
like
the
number of common URIs
among
three or more datasets,
the
frequency of prefixes,
i.a.
),
we offer
novel
service
s (
like
finding equivalent URIs, find the most relevant datasets
for a specific dataset,
i.a.
)
and finally we discuss the
speedup obtained by the proposed indexes and algorithms. Finally,
we propose an extension of the VoID ontology for publishing
, sharing
and
exploiting such
measurements.
|