Stack-based Algorithms for Pattern Matching on DAGs

Stack-based Algorithms for Pattern Matching on DAGs Li Chen, Amarnath Gupta, M. Erdem Kurul San Diego Supercomputer Center (SDSC), University of California, San Diego VLDB’05

Motivation • Graph model is important in databases and knowledge representation • Bibliographic citations, hypertext, ontology • A lot of scientific data are beyond XML tree model • Many of them are directed acyclic graphs (DAGs) • Taxonomy of proteins, chemical compounds, organisms • Data provenance graphs • Sequence data and multiple sequence alignments • Searching for highly similar substructures • Gives rise to numerous pattern matching problems • e.g., a novel (metabolic) pathway against a pathway database

Query patterns: path / twig / dag Stanford USA computer science UPenn Japan England France biomedical economy UWisconsin Germany Example & Its Abstraction graph-structured patent citation network Labeled DAG • node: patent/article • label: patent/article properties • year, contact_author, affiliation, country, etc. • directed edge: uniformly “cited-by” Query Model • node: matching a certain property of data nodes • edges: / for direct (resp. // for indirect ) “cited-by”

Problem Definition • Is this a (sub)graph isomorphism problem? Definition: Two graphsare isomorphic if there is a one-to-one correspondence between their vertices and there is an edge between two vertices of one graph if and only if there is an edge between the two corresponding vertices in the other. NP-hard ! • Is this a subgraph homeomorphism problem? Definition: The homeomorphic image of a pattern graph H in a data graph G is: the images of nodes in H are nodes of G, and the images of edges in H are paths in G. Neither!! … ours is easier (polynomial) • acyclic graph model • corresponding vertices (nodes) have the same label

computer science c2 c4 c1 biomedical economy b1 e1 b1 e1 b1 e2 query solutions Problem Definition (cont.) • Pattern matching on DAGs • The input is a (virtual) single rooted DAG G • A total mapping from Q to G, preserving parent-child / ancestor-descendant relationships • Branches represent “and” semantics  structural join • Return node bindings in witness structures m1 (c) c4 c1 p1 c2 a1 (b) (e) b1 e1 e2 m2 Twig pattern query DAG-structured data

Related Work • Exact v.s. Inexact graph matching [Shasha PODS02] • Exact: a total mapping from query nodes to data nodes (usually requiring label matching) • Inexact: either a partial mapping, or an approximated total mapping • Trade-off between space and time • Trade space for time • Path Index: materialize fixed or parameterized length of paths • Transitive closure computing: for queries involving ‘//’ • Store adjacency list and compute on-the-fly Q: Is there a method economic in both time and space? 

Outline of The Talk • Motivation • Problem Definition • Related Work • The inspiration of our idea • Our Approach • Linear-spacerepresentation for DAG • Stack-based algorithms for path, twig, and dag queries • Complexity analysis • Optimization by prefiltering • Experimental Evaluations • Conclusions and Future Work

book pre-order 1 30 year title authors chapter 19 20 10 2 6 29 5 9 21 7 14 15 24 28 18 11 8 3 4 25 XML author author section 2000 head 12 26 16 13 27 17 23 22 Bill Jake History head Inspiration from XML Pattern Matching: Interval Encodings y is a descendant of x, i.e., y x y.left > x.left and y.right < x.right interval encoding of a tree Implication of overlappingintervals ٨ x y  y x x y • Difficulties in directly applying interval encoding to DAG • Each tree node has at most one parent, while a graph node may have more • Multiple encoding may be a solution, but more comparisons are introduced, so likely not space nor time economic

Inspiration from XML Pattern Matching: Stack-based Algorithms for Holistic Joins • Stack-based Algorithm [Bruno et al. SIGMOD02] • Build a stream and a stack corresponding to each query node • Nodes in streams are pushed into stacks in their document order • Pop a node from its stack if its interval no longer overlaps the newly pushed node • For a node pushed into a leaf stack, output all root-to-leaf paths

Challenges • Whether stack-based algorithms are extendable to pattern matching on DAGs? • If possible, how? Is it economic in space and time?

Outline of The Talk • Motivation • Problem Definition • Related Work • The inspiration of our idea • Our Approach • Linear-spacerepresentation for DAG • Stack-based algorithms for path, twig, and dag queries • Complexity analysis • Optimization by prefiltering • Experimental Evaluations • Conclusions and Future Work

٨ ٨ ٨ ٨ ٨ d d d d d O(|E|) space O(|V| ) 2 ٨ DAG Representation • Partial order v.s. transitive closure • G = (V, E, ) • node partial order , i.e., e =<a,b> Eb a • transitive closure , i.e., p=<x,y> Py x • What do we do? • Not pre-compute and store • Neither store adjacency list for • Instead, store interval encoding of a tree-cover, covering part of • And index on the remaining linkages minimally but losslessly  ٨ ٨  ٨ ٨

nid encoding m1 c1 b1 c2 e1 p1 a1 e2 m2 c4 [1,20] [2,9] [3,4] [5,8] [6,7] [10,17] [11,16] [12,13] [14,15] [18,19] m1 c4 Surrogate & Surplus predecessor index (SSPI) surplus c1 p1 = + c2 a1 surrogate • Given a node w, its • surplus preds = directnon-tree preds • surrogate preds = nearestpreds that have surplus preds b1 e1 e2 m2 G ٨ • Two ways of inducing • thru checking of node intervals • e.g., b1[3,4] is contained by c1[2,9], hence b1 c1 • thru SSPI, w/o further node interval checking • e.g., a1  PL(b1), hence b1 p1, similarly, b1 c4, e2 c4 nid PL b1 a1 e2 m2 [c2,a1] [c4] [a1] [a1] ٨ ٨ ٨ ٨ Our DAG Representation • Decompose a DAG G into Tand GR • T = (V, ET) is a tree-cover (spanning tree) • GR= (VR, ER) is the remaining graph, ER=E -ET

Properties of Our DAG Representation ٨ • Lossless in inducing • Building costs (a tree-cover traversal of G) • Procedure • encode each node walong the traversal ofT • if w has surplus predsui in addition to its tree parentv, add ui in PL(w); and if v is also in SPPI, • add v in PL(w), if v does not have surplus preds itself • inherit PL(v) in PL(w), otherwise • Linear time & space • in terms of |V| for interval encoding • in terms of |E| for populating SSPI

Extending Stack-based Holistic Join Algorithms • Key ideas • Keep the data structures of streams and stacks • Add a new structure – partial solution pools • Put a popped node in its pool, rather than discard it • Grow partial solutions for the new-found • Exploit temporal properties to avoid vain attempts ٨

Algorithm Extension • SweepPartialSolutions: checking & building solutions in pools • When? • A node v is popped out of stack • Where? • Between v and the nodes in each of its children pools • What (condition)? • Check if each child pool has a node w, s.t. wv • How? • What (action)? • Expand: grow partial solutions headed by w to be headed by v ٨ Check if uPL(PL…(w)) s.t. u.L  v.L and u.R  v.R v w v v w w

c1 b1 c2 m c b ٨ c2 m2 m1 c2 b1 m1 c4 b1 Sb Pb Sc Sm Pc Pm (c) Stacks (d) Pools (e) Results PathStackD by Example m1 m1 m1 c1 b1 c4 c1 p1 b1 c1 m1 c2 a1 b1 e1 e2 m2 c2 m2 c1 b1 m1 (a) Data G (b) Query m2 c1 b1 c2 m1 c4 c1 b1 c4 m1 c2

w u u v v v v v v w u u for subsequent v’s, this u can be ignored Algorithm Analysis ٨ • The total containment ( ) checks in pools are • Not |Sm1| x |Sm2| x … x |Smn| times of SPPI look-ups • |Smi|: size of the ith stream, n: size of the path Q • But much tightly restricted due to temporal properties • Not all stream nodes, but child pool nodes (to the left of v) • Not entire SSPI is searched for checking if w v ٨ Function checkContainment(v,w) while (u:=next PL(w) and !found) if (u.L  v.L and u.R  v.R) return true else if (u.L  v.R) return false else if (u has no preds) remove u from PL(w) else {found = checkContainment(v,u); if (!found) PL(w)=PL(w)+PL(u)-{u} }

PathStackD Theorem 1Given a path query q and a DAG G, PathStackD correctly returns all the query answers for q. sound complete and Theorem 2 Given a path query q and a DAG G, PathStackD has the worst-case I/O and CPU time complexities of O(|q||Smi| + |q||Smi|d + |E|), i.e., max(|E|, |q||Smi|(max(|Smi|, d))). 2 Optimal compared to O(|V| |q|) |Smi|: average stream size |q|: query size d: diameter of G 2

m1 c4 c1 p1 c2 a1 b1 e1 e2 m2 m1 c1 b1 e1 m1 c2 b1 e1 (c) Stacks (d) Pools (e) Results Additional Changes in TwigStackD • Key changes: • getMinSources (original)  getMissings (ours) • sweepPartialSolutions 1. node with minimal left value 1. the same 2. has all the required descendant types 2. record which required types are missing 3. check if missing types are complemented by pool nodes m b1 c2 Sb Pb c1 m1 c Sc Sm Pc Pm e1 b e Se Pe (a) Data G (b) Query

A Prefiltering Step • Purpose • Improve efficiency by reducing the I/O factor |Smi| • Basic Idea • Impose structural constraints of the query pattern for filtering nodes to be put in streams e.g., each QBitVec captures required upwards structural constraints each QBitVec captures required downwards structural constraints a a 1111 1000 QBitVec b d b d 0011 0100 1010 1100 QBit c c 0001 1011

a 1111 b d 0011 0100 c 0001 encoded query pattern required constraints Two Passes for Prefiltering • Downwards Filtering By Example • Traverse data DAG and aggregate the satisfied descendant types • Match the satisfied with the required Data nodes are processed in post-order when exiting each edge directing from n to prev, do // myBitVec is the bitVector value for n myBitVec = bitOR(myBitVec,prevBitVec,QBit) // prev is query relevant if it matches a query label if (prev is query relevant && prev does not satisfies structural constraint) then myBitVec=bitAND(myBitVec,~prevQBit) if (n is query relevant && bitAND(myBitVec,QBitVec) == QBitVec) then n satisfies structural constraint put n into the corresponding stream a1 1111 e1 b1 d1 0001 0011 0100 c1 a2 m1 0001 0001 1000 c2 0001 encoded data DAG ?  satisfied constraints post-order : guarantees that a node is encoded before all its ancestors topological-order : guarantees that a node is encoded before all its descendants

Summary of Our Approach • The key ideas • Our DAG representation losslessly covers all transitivity closure • Interval encoding on tree-cover T for covering • SSPI and tree-cover encoding together cover the complete • Worst-case space is O(|V|+|E|), compared to O(|V|^2) if pre-compute and store all transitive closure • *_stackD algorithms leverage tradeoffs between space and time • Adopt a new structure, i.e., partial solution pools, in addition to streams and stacks • Modify/add procedures to handle stack-popped nodes in pools, where remaining solutions can be found • Worst-case time is O(max(|E|, |Smi|^2)), compared to O(|V|^2) if no path index is utilized • Prefiltering further optimizes performance by reducing |Smi| ٨ ٨

Outline of The Talk • Motivation • Problem Definition • Related Work • The inspiration of our idea • Our Approach • Linear-space representation for DAG • Stack-based algorithms for path, twig, and dag queries • Complexityanalysis • Optimization by prefiltering • Experimental Evaluations • Conclusions and Future Work

Experimental Evaluations • System implementation • Java 1.4 • Light-weight storage engine -- PSEPro from ObjectStore • Utilize its VMMA for memorydisk data structure mapping • Experimental setups • Tunable synthetic DAG data generator • Parameters: diameter, fan-out, fan-in, distinct # of labels • Real-life data • Gene ontology data, tree data from XMark benchmark augmented by random cross links • 2.6Ghz Pentium IV PC, 1GB MM, 2GB VM

Nav Nav Nav Nav Nav Exec Exec Exec Exec Exec Filter Filter Filter Filter Filter PQ TQ DQ Experiment 1 (ms) (ms) (ms) PQ TQ DQ PQ TQ DQ n=50K, m=90K n=100K, m=180K n=25K, m=45K (ms) (ms) PQ TQ DQ a a a b b b c e c f d d e d PQ TQ DQ PQ TQ DQ f n=200K, m=360K n=400K, m=720K n: |V| m: |E| Compare processing time (including prefiltering and query execution) of *StackD against Nav[Kanza PODS03]

TQ a b c f d e Experiment 2 (ms) (K) n=5K, m=5K, 10K, 20K, 30k, 40K, 50K, 60K, 70K Evaluate the performances of both algorithms with the changing characteristics (density) of DAG

Experiment 3 (ms) (ms) a a a a a a a a a b b b b b b b b c c b c c c c c c f f e d e d d d d d i g h e e e f f g Evaluate the performances of both algorithms with the changing characteristics (size) of query

Experiment 4 (ms) Evaluate the performance of PathStack-D with or without the aid of the prefiltering step

TQ a b c f d e Experiment 5 BuildSSPI TSDFilter TSDExec #Scan #Result NavAlgo 25K 50K 100K 200K 400K 100MB XML document (~ 1.4M nodes and ~ 1.6M edges) PQ=//site//person//age TQ=//site(//item//description, //category//name, //person//age) *StackD #Scan #Result NavAlgo PQ TQ

Conclusions and Future work • Conclusions • Gracefully generalized the stack-based algorithms for pattern matching on DAGs • The extended algorithms are sound and complete • The proposed approach is optimal among those that do not rely on precomputed transitive closure • Future Work • Further improvement by incorporating statistics on a graph structure and/or advanced indexing schemes • Allow for more general graph operations which gives rise to more challenging query optimizations

Questions?

Stack-based Algorithms for Pattern Matching on DAGs

Stack-based Algorithms for Pattern Matching on DAGs

Presentation Transcript

Pattern Matching Algorithms: An Overview

Pattern Matching

Pattern Matching

Pattern Matching

VLDB 2012 COMMENTS ON ‘STACK-BASED ALGORITHMS FOR PATTERN MATCHING ON DAGS’

Algorithms for pattern matching and pattern discovery in music

Pattern Matching

Clock Routing Based on X-Architecture Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Point-set algorithms for pattern discovery and pattern matching in music

Pattern matching

Strings and Pattern Matching Algorithms

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern matching

Pattern Matching

Pattern Matching