Your browser does not support JavaScript!

Home    Entity-based Summarization of Web Search Results using MapReduce  

Results - Details

Add to Basket
[Add to Basket]
Identifier 000381378
Title Entity-based Summarization of Web Search Results using MapReduce
Alternative Title Οντοκεντρική σύνοψη αποτελεσμάτων μηχανών αναζήτησης με τη χρήση MapReduce
Author Κίτσος, Ιωάννης Γ.
Thesis advisor Τζίτζικας, Ιωάννης
Abstract Although Web Search Engines index and provide access to huge amounts of documents, user queries typically return only a linear list of hits. While this is often satisfactory for focalized search, it does not provide an exploration or deeper analysis of the results. One way to achieve advanced exploration facilities, while exploiting also the structured (and semantic) data that are now available, is to enrich the search process with entity mining over the full contents of the search results where the entities of interest can be specified by semantic sources. Such services provide the users with an initial overview of the information space, allowing them to gradually restrict it until locating the desired hits, even if they are low ranked. In this thesis we consider a general scenario of providing such services as meta- services (that is, layered over systems that support keywords search) without apriori indexing of the underlying document collection(s). To make such services feasible for large amounts of data we use the MapReduce distributed computation model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the required computational tasks can be factorized and expressed as MapReduce functions and we introduce two different evaluation procedures the Single-Job (SJ) and the Chain-Job (CJ). Moreover, we specify criteria that determine the selection and ranking of the (often numerous) discovered entities. In the sequel, we report experimental results about the achieved speedup in various settings. We show that with the SJ procedure the achieved speedup is close to the theoretically optimal speedup (2, 5% − 19, 7% lower than the optimal for a 300MB dataset and from 2 up to 8 Amazon EC2 VMs respectively) and justify this difference. Indicatively, we achieve a speedup of up to x6.4 on 8 EC2 VMs when analyzing 4, 365 hits (corresponding to 300MB) with a total runtime of less than 7 minutes (an infeasible task when using a single machine due to high computational and memory requirements). CJ exhibits somewhat lower scalability compared to SJ (x5.66 on 8 EC2 VMs) with a longer total runtime (about 30 secs more for a 300MB dataset) due to the overhead of using two rather than one MapReduce job. On the other hand, CJ offers the qualitative benefit of providing a quick preview of the results of the analysis. Another important contribution of this thesis is a thorough evaluation of platform configuration and tuning, an aspect that is often disregarded and inadequately addressed in prior work, but crucial for the efficient utilization of resources. We show that the proposed evaluation methods utilize well the resources (fully utilized CPU, efficient memory allocation), and the tasks do not have an unreasonably high overhead (e.g. garbage collection, unnecessarily startup/teardown of JVMs during task initialization and termination, imbalance in last-task execution times).
Language English
Issue date 2013-11-15
Collection   School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses
  Type of Work--Post-graduate theses
Views 560

Digital Documents
No preview available

Download document
View document
Views : 9