E-Locus - Institutional Repository of the University of Crete - Entity-based Summarization of Web Search Results using MapReduce

Home Entity-based Summarization of Web Search Results using MapReduce

Results - Details

[Add to Basket]

Identifier

000381378

Title

Entity-based Summarization of Web Search Results using MapReduce

Alternative Title

Οντοκεντρική σύνοψη αποτελεσμάτων μηχανών αναζήτησης με τη χρήση MapReduce

Author

Κίτσος, Ιωάννης Γ.

Thesis advisor

Τζίτζικας, Ιωάννης

Abstract

Although Web Search Engines index and provide access to huge amounts of documents, user queries typically return only a linear list of hits. While this is often satisfactory for focalized search, it does not provide an exploration or deeper analysis of the results. One way to achieve advanced exploration facilities, while exploiting also the structured (and semantic) data that are now available, is to enrich the search process with entity mining over the full contents of the search results where the entities of interest can be specified by semantic sources. Such services provide the users with an initial overview of the information space, allowing them to gradually restrict it until locating the desired hits, even if they are low ranked. In this thesis we consider a general scenario of providing such services as meta- services (that is, layered over systems that support keywords search) without apriori indexing of the underlying document collection(s). To make such services feasible for large amounts of data we use the MapReduce distributed computation model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the required computational tasks can be factorized and expressed as MapReduce functions and we introduce two different evaluation procedures the Single-Job (SJ) and the Chain-Job (CJ). Moreover, we specify criteria that determine the selection and ranking of the (often numerous) discovered entities. In the sequel, we report experimental results about the achieved speedup in various settings. We show that with the SJ procedure the achieved speedup is close to the theoretically optimal speedup (2, 5% − 19, 7% lower than the optimal for a 300MB dataset and from 2 up to 8 Amazon EC2 VMs respectively) and justify this difference. Indicatively, we achieve a speedup of up to x6.4 on 8 EC2 VMs when analyzing 4, 365 hits (corresponding to 300MB) with a total runtime of less than 7 minutes (an infeasible task when using a single machine due to high computational and memory requirements). CJ exhibits somewhat lower scalability compared to SJ (x5.66 on 8 EC2 VMs) with a longer total runtime (about 30 secs more for a 300MB dataset) due to the overhead of using two rather than one MapReduce job. On the other hand, CJ offers the qualitative benefit of providing a quick preview of the results of the analysis. Another important contribution of this thesis is a thorough evaluation of platform configuration and tuning, an aspect that is often disregarded and inadequately addressed in prior work, but crucial for the efficient utilization of resources. We show that the proposed evaluation methods utilize well the resources (fully utilized CPU, efficient memory allocation), and the tasks do not have an unreasonably high overhead (e.g. garbage collection, unnecessarily startup/teardown of JVMs during task initialization and termination, imbalance in last-task execution times).

Language

English

Issue date

2013-11-15

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses

Type of Work--Post-graduate theses

Views

560

Digital Documents
	Download document View document Views : 9