E-Locus - Institutional Repository of the University of Crete - Exploring importance measures for summarization on graph databases

Home Exploring importance measures for summarization on graph databases

Results - Details

[Add to Basket]

Identifier

000406891

Title

Exploring importance measures for summarization on graph databases

Alternative Title

Εξερεύνηση μέτρων σημαντικότητας για δημιουργία συνόψεων σε βάσεις δεδομένων γράφων

Author

Παππάς, Αλέξανδρος Σ.

Thesis advisor

Πλεξουσάκης, Δημήτρης

Reviewer

Γεωργακόπουλος, Γεώργιος
Φλουρής, Γεώργιος

Abstract

The real world is richly interconnected. As such the natural properties of graphs, render them extremely useful in modeling real world, understanding a wide diversity of data-sets and offering applied solutions in different fields of industry. A graph database is an on-line, operational database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model. Alternative to traditional relational databases, graph databases are being optimized and designed predominantly for graph workloads, traversal performance and executing graph algorithms on complex hierarchical structures. Given the explosive growth in the size and the complexity of the Data Web, it is estimated that by the end of 2018, 70% of leading organizations will have one or more utilizing graph databases. Triple stores are a subcategory of graph databases, modeled around the Resource Description Framework (RDF) specifications and designed as labeled, directed multi-graphs. To this direction, there is now more than ever, an increasing need to develop methods and tools in order to facilitate the understanding and exploration of RDF/S Knowledge Bases (KBs). Given the fact that the human brain can only interpret at most a few hundred nodes in one chart it becomes obvious that current data size and schema complexity are far beyond the exploration capability that any automated layout can provide. Summarization approaches try to produce an abridge d version of the original data source, highlighting the most representative concepts. Central questions to summarization are: how to identify the most important nodes and then how to link them in order to produce a valid sub-schema graph. In this thesis, we try to answer the first question by revisiting several measures covering a wide range of alternatives for selecting the most important nodes and adapting them for RDF/S KBs. Then, we proceed further to model the problem of linking those nodes as a graph Steiner-Tree problem (GSTP). Since the GSTP is NP-complete, we explore three approximations (SDIST, CHINS and HEUM) employing heuristics to speed up the execution of the respective algorithms. Our detailed experiments show the added value of our approach since a) our adaptations outperform current state of the art measures for selecting the most important nodes and b) the constructed summary has a better quality in terms of the additional nodes introduced to the generated summary as GSTP approximations outperform past approaches.

Language

English

Issue date

2017-03-17

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses

Type of Work--Post-graduate theses

Views

661

Digital Documents
	Download document View document Views : 45