1 / 23

Early Profile Pruning on XML-aware Publish-Subscribe Systems

Early Profile Pruning on XML-aware Publish-Subscribe Systems. Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California, Riverside. Overview. Motivation Bottom-up Filtering FSM (BUFF) Bounding-based XML Filtering (BoxFilter) Core Modules Filtering algorithms

fia
Download Presentation

Early Profile Pruning on XML-aware Publish-Subscribe Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Early Profile Pruning on XML-aware Publish-Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California, Riverside

  2. Overview • Motivation • Bottom-up Filtering FSM (BUFF) • Bounding-based XML Filtering (BoxFilter) • Core Modules • Filtering algorithms • Experimental results

  3. Motivation • Publish-subscribe systems: The message transmission is defined by the message content • Examples: notification websites hotwire.com or ticketmaster.com Publisher Publisher Publisher Publisher Docu ments Docu ments Docu ments Docu ments Matching algorithm Re su l t Re su l t Re su l t Re su l t Prof ile Prof ile Prof ile Prof ile Submit, Update, Delete Submit, Update, Delete Submit, Update, Delete Submit, Update, Delete Subscriber Subscriber Subscriber Subscriber

  4. Bib article article vol year title title no 7 1996 proceedings t1 11 t2 year SIGMOD journal author 2006 TPDS last first author Florescu Daniela last mi first DeWitt J David Publish-subscribe systems • The data is exchanged in XML format. • Nodes - correspond to elements, attributes or text values • Edges represent immediate element-subelement or element-value relationships <Bib> <article vol=“7” no=“11”> <title>t1</title> <author> <last>DeWitt</last> <mi>J</mi> <first>David</first> </author> <journal>TPDS</journal> <year>1996</year> </article> <article> <title>t2</title> <author> <last>Florescu</last> <first>Daniela</first> </author> <proceedings>SIGMOD </proceedings> <year>2006</year> </article> </Bib> (a) Document (b) Tree representation

  5. Publish-subscribe systems (cont.) • The user profiles are expressed in XML query language (XPath, XQuery) • XML query contains • structural constraints • value-based constraints Structural constraints: ////article[/author[@last=``Smith'']]//procs[@conf=``VLDB''] Tree pattern: article author proceedings last conf

  6. Related Work/Our Contribution • Current work • Construction of overlay network • Dissemination/indexing of profiles (queries) • Processing of stream of messages • We focus on the matching process that takes place within a broker • Improves the performance of regular FSM by using a bottom-up evaluation of the document • Develop index-based filtering technique that performs early pruning of the query profile

  7. Overview • Motivation • Bottom-up Filtering FSM (BUFF) • Bounding-based XML Filtering (BoxFilter) • Core Modules • Filtering algorithms • Experimental results

  8. Bottom-up vs. Top-down filtering • State machines are among the most common methods for the XML matching process • Top-down approach: (i.e. in-order traversal or depth first order): advancing the state machine for each XML element (or attribute) read. • Do not consider any form of early pruning • Bottom-up approach: This approach takes into consideration the (usual) fact that an XML document has its more selective elements located at its leaves

  9. Q1 a Q4 a Q6 e b e g c f h d h Example • Top-down approach groups the queries according to their common prefixes • Bottom up: groups them according to their common suffixes. root Q2 a Q5 e Q3 a a a a a a a a a a a a c e f b b b b b b b b b b b d f h c c c c c c c c c c c d (a) Document (b) Queries c d a 3 2 4 3 4 b b c Q1 1 2 Q1 d a c d 6 1 5 5 e Q2 Q2 a f h f e a 7 8 9 6 7 0 8 0 Q4 Q3 Q3 h e h e a f f 12 11 11 12 10 Q5 10 Q5 Q4 9 g g h e 13 14 13 14 Q6 Q6 (c) Top-down (d) Bottom up

  10. BUFF • FSM-based Bottom-up approach for XML filtering. • BUFF avoids translating documents and queries to Prüfer sequences (as the other algorithms do), and employs a more direct evaluation algorithm. • The document is parsed through a SAX parser, which triggers events for specific marks (tags) in the XML document • The machine keeps a runtime stack that stores the current document path being processed.

  11. a1 1 </e> b2 b5 c3 c6 d c d4 d7 e9 b e8 a f10 5 1 </f> </e> 2 </d> 3,6 </c> 5 1 e 1,2 1,2 1,2,5 c c c b b b b a a a a BUFF Example e <e> d <d> c d b 1 2 3 4 e c <c> Q1 0 f b <b> c b a 8 5 6 7 a <a> Q2 (a) Document and BUFF (b)‏ (c)‏ (d)‏ (e)‏ (f)‏ (g)‏

  12. Overview • Motivation • Bottom-up Filtering FSM (BUFF) • Bounding-based XML Filtering (BoxFilter) • Core Modules • Filtering algorithms • Experimental results

  13. Profile Index Profiles P1 P2 P3 Bounding-based XML Filtering • Two major processes working asynchronously • Profile Management • Profile Matching Prüfer Sequence Profile Manager Matching Algorithm Matching Module Profiles (queries)‏ Input Documents Matched Documents

  14. Prüfer Sequence • A unique sequential encoding of a labeled tree • Algorithm: • Iteratively removes nodes from the tree until all nodes but the last two have been removed. • At each iteration, the algorithm finds and removes the leaf with the smallest label and adds to the Prüfer sequence the label of that leaf's parent. • Theorem: If a query tree Q is a subgraph of a document tree D then the Prüfer sequence of Q is a subsequence of the Prüfer sequence of D

  15. Sequence Envelope • Assume a set of k Prüfer sequences representing user profiles S1,..,Sk • We can derive two new sequences • Upper bound U: for each position take largest element • Lower bound L: for each position take smallest element • L and U form the smallest possible bounding envelope that encompasses all members of the set of sequences from above and below.

  16. Example • Assume 3 sequences with 11 symbols each abcabababcd cdcdecdcdec dedededebab

  17. Sequence Envelope (Cont.) • The sequence envelope structure is that it can be used as an aggregation of the sustaining set of sequences

  18. BoXFilter Tree • Sequence envelopes can be nested forming BoXFilter tree

  19. Filtering algorithms • The profiles in the system are organized in BoXFilter tree. Documents are traversed thought the tree • There are two variations of the filtering algorithm • Sequential – documents are processed one by one • Batch processing – documents are organized in a tree like the queries and both trees are joined • After the traversal of the BoXFilter tree, there is a verification step

  20. Overview • Motivation • Bottom-up Filtering FSM (BUFF) • Bounding-based XML Filtering (BoxFilter) • Core Modules • Filtering algorithms • Experimental results

  21. Experimental Results • We have generated datasets with 1000, 10000 and 100000 small documents (with up to 8KB) • We generated up to 100000 queries with selectivity fixed to 50% (a)‏ (b)‏ (c)‏

  22. Experimental Results (cont.) In this set of experiments, we vary the number of documents that match any of the profile queries. (selectivity 1\% means that one percent of the documents satisfy \textit{any} of the queries.)

  23. Thank You!

More Related