Fist scalable xml document filtering by sequencing twig patterns
Download
1 / 31

slides - PRIX: Indexing and Querying XML - PowerPoint PPT Presentation


  • 312 Views
  • Uploaded on

FiST. FiST: Scalable XML Document Filtering by Sequencing Twig Patterns. Joonho Kwon, Praveen Rao, Bongki Moon, Sukho Lee. School of Electrical Engineering and Computer Science, Seoul National University. Department of Computer Science, University of Arizona. Roadmap. Introduction

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'slides - PRIX: Indexing and Querying XML' - Michelle


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fist scalable xml document filtering by sequencing twig patterns l.jpg

FiST

FiST: Scalable XML Document Filtering by Sequencing Twig Patterns

Joonho Kwon, Praveen Rao, Bongki Moon, Sukho Lee

School of Electrical Engineering and Computer Science, Seoul National University

Department of Computer Science, University of Arizona


Roadmap l.jpg
Roadmap

  • Introduction

    • Background and Motivations

  • Index Structure

    • Profile Sequences

    • Sequence Index

  • Filtering Algorithm

    • Progressive Subsequence Matching

    • Refinement for Branch Node Verification

  • Experimental Results

  • Conclusions


Introduction l.jpg
Introduction

  • Publish-subscribe systems

    • Selective dissemination of information (SDI)

    • User profiles (or standing queries)

    • New content is matched against the user profiles and is delivered to interested users

  • XML document filtering

    • User profiles (or twig patterns) are specified in the XPath language

    • Incoming XML document is delivered to users whose profiles have a match in the document

    • Reversal in the roles of twig patterns and XML documents

  • Challenges:

    • To minimize the filtering cost by effectively organizing a large number of user profiles

    • To achieve good scalability


Introduction4 l.jpg
Introduction

  • XFilter (VLDB’00) and YFilter (TODS’03)

    • XFilter – each path expression is mapped to a FSM

    • YFilter – a single NFA for XPath expressions with shared processing

  • Motivations

    • To develop a scalable XML filtering system that supports processing of twig patterns

    • To support holisticmatching of twig patterns without first matching the linear paths in the patterns and then merging these matches during post-processing

    • To inherently support ordered matching where the nodes in the twig pattern follow the document order in XML


Tree to sequence transformation l.jpg

(A,9)

(C,8)

(B,5)

(B,2)

(D,4)

(C,7)

Tree to Sequence Transformation

  • Extended Prüfer Sequences (PRIX,ICDE’04)

    • Extend leaf nodes of the tree with dummy child nodes

(A,9)

(C,8)

(B,5)

(B,2)

(D,4)

(C,7)

(d,1)

(d,3)

(d,6)

Tree T

B

B

D

B

A

C

C

A

LPS(T):


Sequence representation l.jpg
Sequence Representation

XML Document

User Profile

A

A

E

B

C

B

D

G

F

B

D

C

Twig Pattern

Tree T

Q1: /A[B//D]//E[G]/F

LPS(T): B B D B A C C A

LPS(Q1): D B A G E F E A


Slide7 l.jpg
FiST

XML documents

Filtering

Algorithm

User profiles

Profile

sequences

Sequence

Index


Index structure profile sequence l.jpg

Label

Qid

Pos

Sym

Ancestor-Descendant

Parent-Child

Branch + Ancestor-Descendant

Branch + Root node

Branch

Index Structure: Profile Sequence

  • Each twig pattern is mapped to a profile sequence

    • Profile sequence is an ordered list of nodes

    • Each node has four attributes

Q1: /A[B//D]//E[G]/F

LPS(Q1) = D B A G E F E A

D

B

A

G

E

F

E

A

1

1

1

1

1

1

1

1

1

2

3

4

5

6

7

8

//

/

$

/

$

/

$//

$#


Index structure sequence index l.jpg
Index Structure: Sequence Index

Dynamic hash based index

Q1: /A[B//D]//E[G]/F

A

D

1

B

1

//

C

Q2: //B[E]/C

Q1,1

D

E

Pointers to nodes in the profile sequences

2

Q2,1

E

1

/


Our filtering algorithm l.jpg
Our Filtering Algorithm

  • Progressive Subsequence Matching

    • Property: If tree Q is a subtree of tree T, then LPS(Q) is a subsequence of LPS(T)

      • Praveen Rao and Bongki Moon.PRIX: Indexing and Querying XML using Prüfer sequences(ICDE’04)

    • Identify those profile sequences that have a matching subsequence in the document sequence

    • Necessary but not a sufficient condition

  • Refinement for Branch Node Verification

    • Progressive subsequence matching phase is followed by a refinement phase to discard false matches

    • The connectivity of the branch nodes in the candidates (twig patterns) is verified


Progressive subsequence matching l.jpg
Progressive Subsequence Matching

  • The sequence representation of the document can be constructed as the document is parsed (e.g., SAX parser)

  • The subsequence matching phase is progressive

    • The sequence representation of the document is generated incrementally and the profile sequences (of twig patterns) that are subsequences are identified in steps

  • Runtime global stack

    • The stack stores node labels from the current node of the document being processed to the root

    • Elements are pushed and popped from the stack in document traversal order

    • Stack size is upper bound by the depth of the document


Incremental generation of lps l.jpg

<A>

push

<B>

push

<D>

push

</D>

leaf, o/p, pop, o/p

<E>

push

</E>

leaf, o/p, pop, o/p

</B>

non-leaf, pop, o/p

non-leaf, pop

</A>

Incremental Generation of LPS

A

E

D

B

B

B

E

A

D

E

G

F

F

C

Stack

LPS(T): DB EBA CBA GE FE FEA

D

B

E

B

A

CBA GE FE FEA


Progressive subsequence matching13 l.jpg
Progressive Subsequence Matching

  • Sequence Index is used to simultaneously find the matching profiles by parsing the document only once

  • The Prüfer sequence label of the document is used as the hash key in the Sequence Index

  • Additional tasks are performed based on the Sym attribute value (e.g., ‘/’, ‘//’,‘$’) in profile sequence nodes to eliminate most false matches by using the runtime stack

    • The remaining false matches are eventually removed during the refinement phase


Conceptual view l.jpg

Last node - match

Conceptual View

  • The matching process progresses by copying nodes in the profile sequences into the Sequence Index (denotes transitions in a state machine)

Sequence Index

Profile Sequence of Q1

A

Q1,3

D

B

A

G

E

F

E

A

B

Q1,2

1

1

1

1

1

1

1

1

C

1

2

3

4

5

6

7

8

//

/

$

/

$

/

$//

$#

D

Q1,1

G


Progressive subsequence matching15 l.jpg
Progressive Subsequence Matching

  • Runtime stack contains a section of document LPS up to the nearest branch node

XML document T

A

E

D

B

B

C

B

C

C

A

stack

D

LPS(T): …. E D C B A … C B A …

E


Progressive subsequence matching16 l.jpg
Progressive Subsequence Matching

  • Benefits of the runtime stack

    • Testing relationships during subsequence matching based on the Sym attribute ‘/’ and ‘//’. Let q and q’ denote two consecutive nodes in the profile sequence

      • TestPC(q,q’) - tests parent-child relationship (/) between q’ and q in the document

      • TestAD(q,q’) - tests ancestor-descendant relationship (//) between q’ and q in the document

    • Avoiding frequent node copys to the Sequence Index

    • Limiting the range of subsequence matching


Testing relationships between nodes l.jpg

TestAD

TestPC

Testing Relationships between Nodes

XML document T

Twig pattern Q2

E

C

B

F

B

B

A

2

2

2

2

2

F

C

1

2

3

4

5

B

B

/

/

$#

//

$

Sym

E

E

D

C

C

C

Sequence Index

B

Recursively test

till the nearest branch node without a ‘/’ or ‘//’

D

A

A

stack

E

Q2,1

E

F


Avoiding frequent node copys l.jpg

A

E

Q2,1

F

Q2,4

Avoiding Frequent Node Copys

XML document T

Twig pattern Q2

E

C

B

F

B

B

A

2

2

2

2

2

F

C

1

2

3

4

5

B

B

/

/

$#

E

//

$

Sym

E

D

Not copied

C

C

C

Sequence Index

B

A

D

A

stack

E

Q2,1

E

F


Limiting the range of subsequence matching l.jpg

C and E do not share an ancestor descendant relation

Limiting the Range of Subsequence Matching

Twig pattern Q2

XML document T

B

A

F

C

B

B

E

E

D

C

C

C

E

C

B

F

B

B

2

2

2

2

2

D

A

1

2

3

4

5

stack

/

/

$#

//

$

E

LPS(T): … E D C B A … C B …

LPS(Q2): E C B F B


Refinement phase l.jpg
Refinement Phase

  • The connectedness property of branch nodes in the candidates (twig patterns) should be tested to identify true matches

  • To enable the refinement process

    • Branch node processing – branch nodes in the profile sequences during subsequence matching

  • The refinement phase

    • Root node processing - last node in the profile sequence

    • Uses the information collected during branch node processing


Branch and root node processing l.jpg

C

E

BranchID

sets store

node ids

2

5

Branch and Root Node Processing

XML document T

Twig pattern Q3

//B[E]/C

B

(A,1)

C

E

B

(B,2)

(B,5)

(E,7)

E

B

C

B

A

3

3

3

3

stack

(G,8)

(F,9)

(F,10)

(D,3)

(E,4)

(C,6)

1

2

3

4

/

$

/

$#

LPS(T): D B E B A C B A G E F E F E A

LPS(Q3): E B C B

Root node processing:The intersection of BranchID sets for each branch node in the candidate twig pattern is tested


Fist architecture overview l.jpg
FiST: Architecture Overview

XPath Twig Patterns

(User Profiles)

XML Document

XPath Parser

SAX Parser

Sequence

Transformation

Sequence Index + Profile Sequences

Filtering Algorithm

Send filtered

document

Users

Filtering Engine


Experimental results l.jpg
Experimental Results

  • We measured the performance of FiST and YFilter for a variety of XML document sizes and twig patterns.

  • Experimental setup

    • 2.4 GHz Pentium IV with 512 MB RAM running Linux

  • Datasets

    • Synthetic Treebank data using IBM’s XML data generator

    • 1000 documents were generated using Treebank DTD

    • Recursion of elements, maximum document depth was 36

    • Dataset sizes

      • [1KB, 10KB) – 1k

      • [10KB, 20KB) – 10k

      • [20KB, 30KB) – 20k

      • [30KB, 123KB) – 30k


Experimental results24 l.jpg
Experimental Results

  • User profiles (twig patterns) were generated using the XPath Generator in YFilter

    • Uniform – (z = 0)

    • Skewed – (z = 0.9)

    • Maximum depth – 10

    • # of branches – 3 to 7

    • # of twig patterns – 50000 to 150000

  • For each twig set and document set, we measured the average filtering cost per document

    • filtering time + document parsing time


Experimental results25 l.jpg
Experimental Results

  • We compared YFilter and FiST by observing the trends in filtering cost for three different aspects of scalability

size of input documents

# of twig patterns

# of branches


Experimental results26 l.jpg

(observed – base)

base

Experimental Results

  • FiST was implemented in C++ and YFilter was developed in Java

  • For fairness of comparison, we chose the following evaluation metric

    • scaleup =

  • Wall clock time (document parsing + filtering)

  • We observed that FiST scaled better than YFilter under various situations

    • FiST’s filtering cost decreased with decrease in the number of matching user profiles

    • YFilter’s filtering cost increased as the size of the twig patterns increased


Varying xml document sizes l.jpg

skewed

uniform

Varying XML Document Sizes

  • We measured the scaleup for FiST and YFilter

  • FiST’s filtering cost grew slower than YFilter


Varying number of branches l.jpg

YFilter (uniform, 20k)

FiST (uniform, 20k)

Varying Number of Branches

  • We measured the scaleup for FiST and YFilter

  • Increase in the branch size reduced the number of matched profiles


Varying number of twig patterns l.jpg

uniform

skewed

Varying Number of Twig Patterns

  • We measured the wall clock time for FiST and YFilter

  • FiST was significantly faster than YFilter for 20k and 30k


Conclusions l.jpg
Conclusions

  • We have developed an XML filtering system called FiST that supports holistic matching of twig patterns

    • Avoids first matching the linear paths in the twigs and then merging the matches during post-processing

    • Transform twig patterns into profile sequences

  • Inherent support for ordered matching of twig patterns

  • Runtime stack

    • Stack size is upper bound by the depth of the document

  • Holistic matching yielded good scalability for our filtering system under various situations


Questions l.jpg
Questions?

For more information,

www.cs.arizona.edu/~bkmoon


ad