Abstract |
With the continuous growth of information available online, content syndication has
become a popular means for timely information delivery on the Web. It essentially
enhances traditional pull-oriented searching and browsing of web page with push-oriented
publish and subscribe web feed protocols and interfaces. Today, web syndication
technologies such as RSS or Atom are used in a wide variety of applications spreading
from large-scale news broadcasting to small-scale personal information sharing between
members of social networks. However, existing RSS or Atom syndication technologies
exhibit serious limitations for coping with information overload in Web 2.0 while they
imply a tight coupling between feed producers with consumers. In this work, we are
interested in extending publish/subscribe systems proposed for the web with content-based
filtering and aggregation facilities. In particular, we will like to enable users to
express their interests as keyword-based queries (rather than subscribing to an entire
channel) which will be matched at real time against the streams of incoming web feeds
originating from different streams.
Several data structures have been proposed in the literature for indexing subscriptions
over streams of structured information under the form of sets of attribute-value pairs.
Unfortunately, due the high dimensionality of unstructured information, these structures
cannot be efficiently applied for indexing subscriptions formed by sets of keywords. In
this work we propose a novel Trie-based data structure which exploits the subsumption
relationships between subscriptions to address scalability concerns when indexing and
searching sets of keywords.
To accommodate high publishing rates of web feeds (i.e. bursty publication of news items)
and large communities of subscribers (i.e. milllion of subscriptions), the search space
build by the proposed Trie-based is significantly finer grained than by the Count-based
index, especially when it involves highly frequent terms, while no additional structures
(such as a counter) are required to discover the set of subscriptions covered (i.e.
subset of terms) by an incoming news item. More precisely, the Trie depth is bounded by
the size of the indexed subscriptions (in realistic settings is small from 2 up to 6
terms), while the width is bounded by the size of the vocabulary of indexed terms (in
realistic settings is very large up to 1,500,000 terms). Furthermore, the total number of
terms in the Trie and it's morphology is affected by the vocabulary terms' distribution
and ranking (terms' frequency follows in reality power law distributions): when the
terms' ranking in subscriptions donot violate the initial ranking considered for Trie
construction, a great part of highly ranked terms would appear as left deep Trie
structure, while in the opposite a large number of lowly ranked terms would appear as a
right width Trie structure. Besides Trie morphology, matching time is mainly affected by
the size of the incoming news items (in realistic setting lies between 5 and 50 terms).
In this context, we have implemented space-wide optimized versionsof Trie and Count-based
indices and experimentally evaluated their memory and time requirements.
We show that the Trie-based index outperforms in matching time from 1 up to 3 orders of
magnitude the Count-based index at the expanse of double memory requirements. Despite the
large size ofreal subscription/items vocabulary, Trie matching time is sub-linear in the
number of subscriptions , while matching time of the Count-based index increases linearly
along with the number of indexed subscriptions. The throughput rate achieved by our
Trie-based index when loaded with 10,000,000 subscriptions is ~500 items/sec.
|