Computer Science and Engineering

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu#, Xuemin Lin#, Ke Zhu#, Wenjie Zhang#, Jeffrey Xu Yu† # The University of New South Wales † The Chinese University of Hong Kong

Outline Introduction State-of-the-Art Our Approach Experiments Conclusions 1

Introduction — Graph Data Chem-informatics Chemical Compounds (small size) Bio-informatics PPI Networks (medium size) Internet World Wide Web (large size) 2

Introduction — Exact All-Matching (I) Exact All-Matching Enumerate all exact (i.e. isomorphic) matches of a query graph q in a data graph G. Applications Query biological patterns in PPI networks. Detect suspicious bugs in software programs. A B A B A A B B A C D C C D C D D C G q exact matches 3

Introduction — Exact All-Matching (II) Dilemma of Exact All-Matching If q is issued by user for exploratory purpose … If G is noisy due to imprecise data collection … Potential Solutions Modify q/G and run exact all-matching again and again. Ask system to return approximate results (i.e., similarity all-matching) No exact matches can be found! A B A A B C D C C D G q' 4

SAPPER [VLDB’10 Zhang et al] (I) Similarity All-Matching Given a query graph q, a data graph G and a similarity threshold θ, enumerate all similarity matches of q in G (i.e., all connected subgraphs of G missing at most θ edges in q). Framework Enumerate a set of seeds QSAPPER (i.e., all connected subgraphs q’ of q missing θ edges in q). Exact all-matching on each seed q’ to obtain exact matches. Induce similarity matches based on exact matches of seeds. 5

SAPPER [VLDB’10 Zhang et al] (II) v1 u1 u2 v2 v5 Cost Model |QSAPPER | = # of exact all-matching tests q'1 q'2 A A A A D q (θ = 1) G C B C B q'4 q'3 u3 v4 v3 u4 F1 = {u1→v1, u2→v2,u3→v3, u4→v4} F2 = {u1→v2, u2→v1,u3→v3, u4→v4} A A A A A A A A A A A A A A A A A A A A C C B B C B C C B B C C B B C C B B C B 6

Our Approach — Overview (I) Tree-based Spanning Search Paradigm — TSpan Enumerate a set of seeds QT (i.e., spanning trees of q cover all connected subgraph q’ of q missing θ edges in q). Primary Contribution Reduce # of exact all-matching tests (i.e., # of seeds). Reduce the complexity of exact all-matching test from graph to graph to tree to graph. more SAPPER seeds 3 all-matching tests on connected subgraphs of q 1 all-matching tests on a spanning tree of q A B A B A B A B C D C D C D C D q (θ = 2) 7

Our Approach — Overview (II) Generating Similarity Maximal Matches Generating similarity maximal matches only can reduce # of exact all-matching tests. v1 u1 u2 v2 v5 A A A A D F1 = {u1→v1, u2→v2,u3→v3, u4→v4} q (θ = 1) G F2 = {u1→v2, u2→v1,u3→v3, u4→v4} C B C B u3 v4 v3 u4 similarity maximal matches A A A A A A A A A A A A C B C B C B C B C B C B 8

Our Approach — Problem Statement Similarity Maximal All-Matching Given a query graph q, a data graph G and a similarity threshold θ, enumerate all distinct similarity maximal matches of q in G conforming θ. 9

Our Approach — Seeding (I) PRIM Order on Spanning Trees Similar to the basic idea of minimum spanning tree. Given a total order on E(q), a spanning tree T = {T[0], T[1], …, T[|V(q)|- 1]} of q conforms PRIM order (T[0] is head vertex), if and only if each spanning edge T[i] has the smallest order in E(q) – {T[1], ..., T[i − 1]} and connects {T[0], T[1], ..., T[i − 1]}. e1 A B e1 e4 A B e6 e2 e5 e2 C D e3 C D e3 q T 10

Our Approach — Seeding (II) Avoid Duplicate Results Two spanning trees of q may induce duplicate similarity maximal matches. Associate an edge exclusion set T.R to each T in QT. T.R is a set of edges in E(q) – E(T) enforced to be mismatched in the similarity maximal matches induced by T. A A B B T1.R = ∅ T2.R = { (A,D)} C D C D q (θ = 2) A A B A B E C D C D T2 T1 G 11

Our Approach — Seeding (III) • QT Enumeration Algorithm e1 e1e2e3 A B go down e2 e4 e6 e5 XT1[2] e2 XT1[1] e1 alternate-reorder XT1[3] e3 C D e3 q (θ =2) e4e3e2 e1e2e4 e1e4e3 XT2[3] e4 XT4[3] e3 XT4[2] e4 XT7[1] e4 XT7[2] e3 XT7[3] e2 e1e2e5 e1e4e6 e1e5e3 e4e3e5 e4e5e2 e6e2e3 12

Our Approach — Seeding (IV) QT Enumeration Algorithm Correctness : Using QT to inducing similarity maximal matches neither generates duplicate results nor misses valid results. Minimality of QT : Missing any spanning tree in QT does not guarantee the completeness of results based on edge exclusion semantics. When |E(q)| = m, |V(q)| = n, (1)|QSAPPER| ≥ |QT|; (2) |QT| = |QSAPPER| only when θ = 0 orm − n + 1. 13

Our Approach — Searching (I) Effectively Storing QT Use DFS Traversal Tree to share computation cost. e1e2e3 R e4e3e2 e1e2e4 e1e4e3 e1 e4 e6 e1e2e5 e1e4e6 e1e5e3 e4e3e5 e4e5e2 e6e2e3 e2 e4 e5 e3 e5 e2 e3 e4 e5 e3 e6 e3 e2 e5 e2 e3 14

Our Approach — Searching (II) Similarity Maximal All-Matching Algorithm Sketch Traverse the DFS Traversal Tree in a depth-first backtrack search fashion. go-down : Beginning from the initial spanning tree, recursively drill down to extend the current partial match to the next spanning edge T[i] in the current spanning tree T. alternate : If T[i] can not be extended based on the current partial match and we can still afford to mismatch T[i] by conforming θ, alternate the algorithm from T to the alternative spanning tree T’ enumerated by replacing T[i] with T’[i] . 15

Our Approach — Optimizations Optimizations (I) EnumrateOnDemand Strategy Motivation : further reduce the number of seeds. Enumerate an alternative tree T’ based on the current tree T only when it is feasible to extend the current partial similarity maximal match conforming θ(1) on the next spanning edge T[i] or (2) on the next spanning edge T[i]’. Optimizations (II) Effective Search Order Motivation : terminate all-matching test as early as possible. Decide the search order of spanning edges in T based on the post-filtering candidate sets of each vertex in q. 16

Our Approach — Filtering & Ordering (I) Neighborhood Aggregate N(v, g) Given a set of labels ΣV = {L1, ..., Lm},N(v, g) = (x1, ..., xm)wherexiis the number of neighbors of v in g with label Li∈ΣV. Neighborhood-based Filtering Compute the candidate set C(u) for each u in q. B A A B B A N(u, q) = {2, 1, 0, 2} N(v, G) = {1, 2, 2, 0} D A D C A C v ∈ G u ∈ q 17

Our Approach — Filtering & Ordering (II) QI Search Ordering [VLDB’08 Shang et al.] Pick Head Vertex : The vertex u in q with minimum φ(u) (i.e., the occurrence of vertices in G with l(u)). Pick Next Spanning Edge : The edge (u1, u2)with minimum φ(u1, u2) (i.e., the occurrences of edges in G with (l(u1), l(u2))) where u1 is a vertex incident on previous picked spanning edges. Filtering-based Search Ordering Pick Head Vertex : The vertex u in q with minimum number of candidates (i.e., |C(u)|). Pick Next Spanning Edge : The edge (u1, u2) minimizing |C(u2)|×φ(u1, u2)/φ(u2) where u1 is vertex incident on previous picked spanning edges. 18

Experiments — Experimental Settings Data Graphs GH : HPRD network (|V(GH)| = 9,460, |E(GH)| = 37,081). GS : default synthetic data graph. Other synthetic data graphs generated by varying data graph settings. Query Graphs Random selected subgraphs of the corresponding data graphs. Parameter Settings (default settings in bold) 19

Experiments — # of exact all-matching tests |QSAPPER| : # of exact all-matching tests by SAPPER [VLDB’10]. |QT| : # of exact all-matching tests by EnumerateAll paradigm. TSpan : # of exact all-matching tests by EnumerateOnDemand paradigm. 20

Experiments — Total Processing Time Similarity All-Matching SAPPER : Generate all similarity matches. TSpan+ : Run TSpan first and then generate all similarity matches based on similarity maximal matches. Similarity Maximal All-Matching NaïveTSpan : Similarity maximal all-matching with no computation sharing. TSpan : Similarity maximal all-matching with computation sharing. 21

Experiments — Total Processing Time Enumeration Paradigms PrecTSpan : Similarity maximal all-matching by EnumerateAll. TSpan : Similarity maximal all-matching by EnumerateOnDemand. Filtering & Ordering TSpanQI : TSpan algorithm with QI searching ordering. TSpanNF : TSpan algorithm with no filtering technique. 22

Experiments — Large-scale Data Graphs TSpan on Large-scale Datasets 23

Conclusions Tree-based Spanning Search Paradigm EnumerateOnDemand Strategy Filtering-based Search Ordering 24

Thank You! Any Questions?

Computer Science and Engineering