E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Doctoral theses

Doctoral theses

Current Record: 40 of 121

[Add to Basket]

Identifier

000414273

Title

Entity Resolution in the Web of Data

Alternative Title

Ανάλυση οντοτήτων στον παγκόσμιο ιστό των δεδομένων

Author

Ευθυμίου, Βασίλειος Παναγιώτης

Thesis advisor

Χριστοφίδης, Βασίλειος

Reviewer

Πλεξουσάκης, Δημήτρης
Τζίτζικας, Γιάννης
Βελεγράκης, Γιάννης

Abstract

Entity resolution (ER) is the problem of identifying descriptions of the same real-world entities among or within knowledge bases (KBs). In this PhD thesis, we study the problem of ER in the Web of data, in which entities are described using graph-structured RDF data, following the principles of the Linked Data paradigm. The two core ER problems are: (a) how can we effectively compute similarity of Web entities, and (b) how can we efficiently resolve sets of entities within or across KBs. Compared to deduplication of entities described by tabular data, the new challenges for these problems stem from the Variety (i.e., multiple entity types and cross-domain descriptions), the Volume (i.e., thousands of Web KBs with billions of facts, hosting millions of entity descriptions) and Veracity (i.e., various forms of inconsistencies and errors) of entity descriptions published in the Web of data. At the core of an ER task lies the process of deciding whether a given pair of descriptions refer to the same real-world entity i.e., if they match (problem a). The matching decision typically depends on the assessment of the similarity of two descriptions, based on their content or their neighborhood descriptions (i.e., of related entity types). This process is usually iterative, as matches found in one iteration help the decisions at the next iteration, via similarity propagation until no more matches are found. The number of iterations to converge clearly depends on the size and the complexity of the resolved entity collections. Moreover, pairwise entity matching is by nature quadratic to the number of entity descriptions, and thus prohibitive at the Web scale (problem b). In this respect, blocking aims to discard as many comparisons as possible without missing matches. It places entity descriptions into overlapping or disjoint blocks, leaving to the matching phase comparisons only between descriptions belonging to the same block. For this reason, overlapping blocking methods are accompanied by Meta-blocking filtering techniques, which aim to discard comparisons suggested by blocking that are either repeated (i.e., suggested by different blocks) or unnecessary (i.e., unlikely to result in matches) due to the noise in entity descriptions. To address ER at the Web-scale, we need to relax a number of assumptions underlying several methods and techniques proposed in the context of database, machine learning and semantic Web communities. Overall, the Big Data characteristics of entity descriptions in the Web of data call for novel ER frameworks supporting: (i) near similarity (identify matches with low similarity in their content), (ii) schema-free (do not rely on a given set of attributes used by all descriptions), (iii) no human in the loop (do not rely on domain-experts for training data, aligned relations, matching rules), (iv) non-iterative (avoiding data-convergence methods at several iteration steps), and (v) scalable to very large volumes of entity collections (massively parallel architecture needed). To satisfy the requirements of a Web-scale ER, we introduce the MinoanER framework. Our framework exploits new similarity metrics for assessing matching evidence based on both the content and the neighbors of entities, without requiring knowledge or alignment of the entity types. These metrics allow for a compact representation of similarity evidence that can be obtained from different blocking schemes on the names and values of the descriptions, but also on the values of their entity neighbors. This enables the identification of nearly similar matches even from the step of blocking. This composite blocking, accompanied by a novel composite Meta-blocking capturing the similarity evidence from the different types of blocks, set the ground for a non-iterative matching. The matching algorithm, built on a massively parallel architecture, is equipped with computationally cheap heuristics to detect matches in a fixed number of steps. The main contribution of MinoanER is that it achieves at least equivalent results over homogeneous KBs (stemming from common data sources, thus exhibiting strongly similar matches) and significantly better results over heterogeneous KBs (stemming from different sources, thus exhibiting many nearly similar matches) to state-of-the-art ER tools, without requiring any domain-specific knowledge, in a non-iterative and highly efficient way.

Language

English

Subject

Blocking,Meta-blocking

Entity Resolution

Linked Data

MinoanER

Διασυνδεδεμένα δεδομένα

Μετα-συσταδοποίηση

Συσταδοποίηση

Issue date

2018-03-23