Abstract |
The Semantic Web (SW) is an evolving extension of the World Wide Web in which
the content can be expressed not only in natural language, but also in formal lan-
guages (e.g. RDF/S) that can be read and used by software agents, permitting
them to find, share and integrate information more easily. The statement of Hera-
clitus ”Everything flows, nothing stands still” holds also in the context of the SW
since everything changes (the resources themselves, the ontologies, the resource
descriptions, etc). Consequently, the ability to compute the differences that exist
between two RDF/S Knowledge Bases (KBs), hereafter Delta, is very important.
In particular, Deltas can be employed to (a) aid humans understand the evolution
of knowledge, and (b) reduce the amount of data that need to be exchanged and
managed over the network in order to build SW synchronization, versioning and
replication services.
The comparison problem is not simple because RDF allows anonymous nodes.
A anonymous node, else called blank node, is a node in an RDF graph which is
not identified by a URI and is not a literal. Several RDF KBs rely heavily on
blank nodes (e.g. 7.5% of Linked Data are estimated to be blank nodes) as they
are convenient for representing complex attributes or resources whose identity is
unknown but their attributes (either literals or associations with other resources)
are known. Considering blank nodes as ”constants” unique to both graphs does not
help either in detecting equivalence between graphs nor in reducing the Delta. On
the contrary, matching the blank nodes of the two graphs can significantly reduce
the produced delta. This work is the first that focuses on methods of matching
the blank nodes of the one graph with the blank nodes of the other graph, by
approaching it as an optimization problem aiming at finding the mapping that
yields the minimum in size Delta (with the least number of triples to delete or add
to make the graphs equivalent). We prove that finding the optimal blank node
mapping is NP-Hard in the general case by reducing to the sub-graph isomorphism
problem. When graphs do not contain directly connected blank nodes (no triples
with more than one blank nodes exists), we show that the polynomial Hungarian
algorithm can be used to find the optimal blank node mapping. For the general
case we present various polynomial algorithms returning approximate solutions.
For making the application of our method feasible also to very large KBs we
present a signature-based mapping algorithm with NlogN time complexity.
Finally, for the proposed algorithms we report extensive comparative experi-
mental results, over real and synthetic KBs, regarding delta reduction (and its de-
viation from the optimal), equivalence detection, and computational requirements.
The results are very interesting. Indicatively the signature-based algorithm can
match KBs with up to 150,000 bnodes in a few seconds with Y % deviation from
the optimal.
|