Abstract |
Although Web Search Engines index and provide access to huge amounts of
documents, user queries typically return only a linear list of hits. While this is
often satisfactory for focalized search, it does not provide an exploration or deeper
analysis of the results. One way to achieve advanced exploration facilities, while
exploiting also the structured (and semantic) data that are now available, is to
enrich the search process with entity mining over the full contents of the search
results where the entities of interest can be specified by semantic sources. Such
services provide the users with an initial overview of the information space, allowing
them to gradually restrict it until locating the desired hits, even if they are low
ranked.
In this thesis we consider a general scenario of providing such services as meta-
services (that is, layered over systems that support keywords search) without apriori
indexing of the underlying document collection(s). To make such services
feasible for large amounts of data we use the MapReduce distributed computation
model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the
required computational tasks can be factorized and expressed as MapReduce functions
and we introduce two different evaluation procedures the Single-Job (SJ) and
the Chain-Job (CJ). Moreover, we specify criteria that determine the selection and
ranking of the (often numerous) discovered entities.
In the sequel, we report experimental results about the achieved speedup in
various settings. We show that with the SJ procedure the achieved speedup is
close to the theoretically optimal speedup (2, 5% − 19, 7% lower than the optimal
for a 300MB dataset and from 2 up to 8 Amazon EC2 VMs respectively) and
justify this difference. Indicatively, we achieve a speedup of up to x6.4 on 8 EC2
VMs when analyzing 4, 365 hits (corresponding to 300MB) with a total runtime
of less than 7 minutes (an infeasible task when using a single machine due to high
computational and memory requirements). CJ exhibits somewhat lower scalability
compared to SJ (x5.66 on 8 EC2 VMs) with a longer total runtime (about 30 secs
more for a 300MB dataset) due to the overhead of using two rather than one
MapReduce job. On the other hand, CJ offers the qualitative benefit of providing
a quick preview of the results of the analysis.
Another important contribution of this thesis is a thorough evaluation of platform
configuration and tuning, an aspect that is often disregarded and inadequately
addressed in prior work, but crucial for the efficient utilization of resources. We
show that the proposed evaluation methods utilize well the resources (fully utilized
CPU, efficient memory allocation), and the tasks do not have an unreasonably high
overhead (e.g. garbage collection, unnecessarily startup/teardown of JVMs during
task initialization and termination, imbalance in last-task execution times).
|