Abstract |
This thesis elaborates on the problem of providing efficient and effective methods for results clustering in Web searching.
In brief, results clustering is useful for providing users with overviews of the search results and thus allowing them to restrict their focus to the desired parts of the returned answer. In addition, results clustering alleviates the problem of ambiguity of natural language words.
However, the task of deriving (single-word or multiple-word) names for the clusters (usually referred as cluster labeling) is a difficult task, because they have to be syntactically correct and predictive (should allow users to predict the contents of each cluster).
Furthermore, results clustering is an online task therefore efficiency is an important requirement.
This thesis surveys the methods that have been proposed and used for results clustering and focuses on the Suffix Tree Clustering (STC) approach. STC is a clustering technique where search results (mainly snippets) can be clustered fast (in linear time),
incrementally, and each cluster is labeled with a phrase. This thesis proposes two novel results clustering methods:
(a) a variation of the STC, called STC+, with a scoring formula that favors phrases that occur in document titles and differs in the way base clusters are merged, and (b) a novel algorithm, called HSTC, that results in hierarchically organized clusters.
The comparative user evaluation showed that both STC+ and HSTC are significantly more preferred than STC, and that HSTC is about two times faster than STC and STC+. These methods where applied over Mitos Web search engine and over Google. Moreover, HSTC was integrated with the Dynamic Faceted Taxonomies interaction scheme of Mitos.
The dynamic coupling of results clustering with dynamic faceted taxonomies results to an effective, flexible and efficient exploration experience. Finally, the thesis reports experimental and empirical results from applying these methods over Mitos and over Google.
|