E-Locus - Institutional Repository of the University of Crete - Subscription Indexes for Web Syndication Systems

Home Subscription Indexes for Web Syndication Systems

Results - Details

[Add to Basket]

Identifier

000364583

Title

Subscription Indexes for Web Syndication Systems

Alternative Title

Ευρετήρια συνδρομών για συστήματα δημοσιεύσεων περιεχομένου στ' ιστό

Author

Κουρδουνάκης, Χαράλαμπος Τσαμπίκος

Thesis advisor

Χριστοφίδης, Βασίλης

Reviewer

Τζίτζικας, Ιωάννης
Φατούρου, Παναγιώτα
Μπίλας, Αγγελος

Abstract

With the continuous growth of information available online, content syndication has become a popular means for timely information delivery on the Web. It essentially enhances traditional pull-oriented searching and browsing of web page with push-oriented publish and subscribe web feed protocols and interfaces. Today, web syndication technologies such as RSS or Atom are used in a wide variety of applications spreading from large-scale news broadcasting to small-scale personal information sharing between members of social networks. However, existing RSS or Atom syndication technologies exhibit serious limitations for coping with information overload in Web 2.0 while they imply a tight coupling between feed producers with consumers. In this work, we are interested in extending publish/subscribe systems proposed for the web with content-based filtering and aggregation facilities. In particular, we will like to enable users to express their interests as keyword-based queries (rather than subscribing to an entire channel) which will be matched at real time against the streams of incoming web feeds originating from different streams. Several data structures have been proposed in the literature for indexing subscriptions over streams of structured information under the form of sets of attribute-value pairs. Unfortunately, due the high dimensionality of unstructured information, these structures cannot be efficiently applied for indexing subscriptions formed by sets of keywords. In this work we propose a novel Trie-based data structure which exploits the subsumption relationships between subscriptions to address scalability concerns when indexing and searching sets of keywords. To accommodate high publishing rates of web feeds (i.e. bursty publication of news items) and large communities of subscribers (i.e. milllion of subscriptions), the search space build by the proposed Trie-based is significantly finer grained than by the Count-based index, especially when it involves highly frequent terms, while no additional structures (such as a counter) are required to discover the set of subscriptions covered (i.e. subset of terms) by an incoming news item. More precisely, the Trie depth is bounded by the size of the indexed subscriptions (in realistic settings is small from 2 up to 6 terms), while the width is bounded by the size of the vocabulary of indexed terms (in realistic settings is very large up to 1,500,000 terms). Furthermore, the total number of terms in the Trie and it's morphology is affected by the vocabulary terms' distribution and ranking (terms' frequency follows in reality power law distributions): when the terms' ranking in subscriptions donot violate the initial ranking considered for Trie construction, a great part of highly ranked terms would appear as left deep Trie structure, while in the opposite a large number of lowly ranked terms would appear as a right width Trie structure. Besides Trie morphology, matching time is mainly affected by the size of the incoming news items (in realistic setting lies between 5 and 50 terms). In this context, we have implemented space-wide optimized versionsof Trie and Count-based indices and experimentally evaluated their memory and time requirements. We show that the Trie-based index outperforms in matching time from 1 up to 3 orders of magnitude the Count-based index at the expanse of double memory requirements. Despite the large size ofreal subscription/items vocabulary, Trie matching time is sub-linear in the number of subscriptions , while matching time of the Count-based index increases linearly along with the number of indexed subscriptions. The throughput rate achieved by our Trie-based index when loaded with 10,000,000 subscriptions is ~500 items/sec.

Language

English

Issue date

2011-03-18

Collection

School/Department--School of Sciences and Engineering--Department of Computer Science--Post-graduate theses

Type of Work--Post-graduate theses

Views

522

Digital Documents
	Download document View document Views : 5