Download Presentation
## Graph Indexing: Tree + Δ ≥ Graph

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Peixiang Zhao Jeffrey Xu Yu Philip S. Yu**CS@UIUC SEEM@CUHK IBM T. J. Watson Research Center September 12th, 2007 Graph Indexing: Tree + Δ ≥ Graph VLDB’07 Vienna, Austria**Synopsis**• Introduction • Graph Containment Query • Algorithmic Framework • Related Work • Tree + Δ • Indexability of frequent Trees • Discriminative graph feature selection: Δ • Experimental Study • Conclusion VLDB’07 Vienna, Austria**Introduction**• Graph is a mathematical construct and a general data structure representing relations among entities • The emergence and the dominance of graphs asks for effective graph data management and mining tools so that users can organize, access, and analyze graph data efficiently • Structural Pattern Mining: Given a graph database, what are the potentially interesting structural patterns and how can we find them? • Graph Indexing and Search:How can we index graphs and perform searching, either exactly or approximately, in large graph databases? VLDB’07 Vienna, Austria**Introduction**• Graph Containment Query • Given a graph database G = {g1, g2, …, gN} and a query graph q, find the set • NP, since subgraph-isomorphism checking is NP-Complete • Infeasible to check subgraph isomorphism sequentially for every gi in G, especially challenging when graphs in G are large, or G is large and diverse • Graph indexing! VLDB’07 Vienna, Austria**Graph Indexing: Algorithmic Framework**• Index construction generates the index feature set F from the graph database G. For each feature f, sup(f) is maintained • Query processingis performed in a filtering-verification fashion: • The filtering phase uses indexing features contained in q to compute the candidate answer set Every graph in Cqcontains all q's indexed features. Therefore, the query answer set, sup(q), is a subset of Cq • The verification phase checks subgraph isomorphism for every graph in Cq. False positives are pruned and the true answer set sup(q) is returned VLDB’07 Vienna, Austria**Query Cost Model**• The cost of processing a graph containment query q upon G, denoted C, can be modeled as below • Cf: the filtering cost • Cv: the verification cost (NP-Complete) • Analysis • The key issue to improve query performance is to minimize |Cq| • The indexing feature set F is quite relevant to Cf and |Cq| • Index construction performance: the feature selection cost Cfs to construct F from among G VLDB’07 Vienna, Austria**Related Work**• Path-based Indexing approach • All existing paths up to a certain length lp are enumerated as indexing features • Index can be constructed efficiently • Index size is quite large when lp is not small • Limited pruning power, mainly because the structural information exhibited in graphs is lost when breaking graphs into paths • GraphGrep (PODS’02) • Graph-based Indexing approach • Subgraphs of G with different characteristics are selected as indexing features • A costly index construction process • Compact index structure • Great pruning power, since structural information of graph is well-preserved • gIndex (SIGMOD’04, PODS’05), C-Tree (ICDE’06), GString (ICDE’07), GDIndex (ICDE’07), FG-Index (SIGMOD’07) VLDB’07 Vienna, Austria**An alternative approach: (Tree + Δ)**• Tree-based Graph Indexing • Tree: Better indexability in comparison with path and graph • The majority of frequent graph-features of G are usually tree-features indeed • Frequent tree-features and graph-features share similar distributions and frequent tree-features have similar pruning power like graph-features • tree mining can be done much more efficiently than graph mining on G • Δ: On-demand select a small number of discriminative graph-features without conducting costly graph mining beforehand • Orders of magnitude smaller in index size, but performs much better than existing approaches in indexing construction and query processing VLDB’07 Vienna, Austria**Indexability of Path, Tree and Graph**• Frequent features (paths, trees, graphs) expose intrinsic characteristics of a graph database, G. They are representatives to discriminate between different groups of graphs in a graph database • Which one should we index? Path, Tree or Graph? • The frequent feature set size: | F | • The feature selection cost: Cfs • the candidate answer set size: |Cq| VLDB’07 Vienna, Austria**The Frequent Feature Set Size: | F |**• Evidences: • Among all frequent graph-features of G, a majority of them are trees indeed • All subtrees of a frequent graph are frequent • There is little chance that subtrees of frequent graph g coincide with those of frequent graph g’, due to the structural diversity and label variety • Frequent pathsshare a very small portion, because a path-feature has a simple linear structure, which has little variety in structural complexity • In terms of feature distributions, tree-features and graph-features share a very similar distribution w.r.t. feature size VLDB’07 Vienna, Austria**Experiments on Two Datasets w.r.t.| F |**The Real Dataset The Synthetic Dataset VLDB’07 Vienna, Austria**The feature selection cost: Cfs**• Given a graph database, G, and a minimum support threshold, σ, to discover the frequent feature set F (FP/ FT/ FG) from G • Tree • A good compromise between • the more expressive, but computationally harder general graph • the faster but less expressive path • Specialization of general graph avoiding undesirable theoretical properties and algorithmic complexity incurred by graph VLDB’07 Vienna, Austria**The Candidate Answer Set Size: |Cq|**• We define the pruning power power(f) of a frequent feature f as • The pruning power of a frequent feature set S = {f1, f2 , …, fn} • Theorem 1: Given a frequent graph-feature g, and let its frequent sub-tree set be T (g) = {t1, t2 , …, tn}. Then, power(g) ≥ power(T (g)) • Theorem 2: Given a frequent tree-feature t, and let its frequent sub-path set be P (t) = {p1, p2 , …, pm}. Then, power(t) ≥ power(P (t)) VLDB’07 Vienna, Austria**Pruning Power**• The pruning power of all frequent subtree features, T (g), of a frequent graph-feature g can be similar to the pruning power of g • There is a big gap between the pruning power of a graph-feature g and that of all its frequent sub-path features, P(g) The Real Dataset VLDB’07 Vienna, Austria**Indexability of Path, Tree and Graph**• It is feasible and effective to select FT, instead of FG, as indexing features for the graph containment query problem • The frequent tree-feature set, FT, dominates FG • Discovering frequent tree-features from G can be done much more efficiently than mining frequent general graph-features • FTcan contribute similar pruning power like that provided by FG VLDB’07 Vienna, Austria**Δ**Discriminative Graph Features • Consider a query graph q which contains a subgraph g • If power(T (g)) ≈power(g), there is no need to index the graph-feature g, because its subtrees jointly have the similar pruning power • if power(g) >>power(T (g)), it will be necessary to select g as an index feature because g is more discriminative than T (g), in terms of pruning • Discriminative graph-features (w.r.t. its subtree-features, controlled by ε0) are selected from queries on-demand, without mining the whole set of frequent graph-features from G beforehand • Discriminative graph-features are used as additional indexing features, denoted Δ, which can also be reused further to answer subsequent queries VLDB’07 Vienna, Austria**Discriminative Graph Selection**• Given two graphs g, g’ q , where g g’ • If the gap between power(g’) and power(g) is large enough, g’ will be reclaimed from G; • Otherwise, g is discriminative enough for pruning purpose, and there is no need to reclaim g’ in the presence of g • Approximate the discriminative computation between g’ and g, in the presence of our knowledge on frequent tree-features discovered VLDB’07 Vienna, Austria**Discriminative Graph Selection**• The occurrence probability of g in the graph database, G • the conditional occurrence probability of g’, w.r.t. g, models the probability to select g’ from G in the presence of g • The upper and lower bound of Pr(g’|g) • The conditional occurrence probability of Pr(g’|g), is solely upper-bounded by T (g’) VLDB’07 Vienna, Austria**Experimental Studies**• The Real Dataset • The AIDS antiviral screen dataset from Developmental Theroapeutics Program in NCI/NIH • 42390 compounds retrieved from DTP's Drug Information System • 63 kinds of atoms in this dataset, most of which are C, H, O, S, etc. • Three kinds of bonds are popular in these compounds: single-bond, double-bond and aromatic-bond • On average, compounds in the dataset has 43 vertices and 45 edges. • The graph of maximum size has 221 vertices and 234 edges VLDB’07 Vienna, Austria**Experimental Studies**• The real dataset: index construction VLDB’07 Vienna, Austria**Experimental Studies**• The real dataset: false positive ratio (|Cq|/|sup(q)|) w.r.t. the database size (= 1,000; 2,000; 4,000; 8,000; 10,000) VLDB’07 Vienna, Austria**Experimental Studies**• The Synthetic Dataset • Generated by a widely-used graph generator, which is controlled by the following parameters: VLDB’07 Vienna, Austria**Experimental Studies**• The synthetic dataset: false positive ratio VLDB’07 Vienna, Austria**Conclusion**• Graph indexing plays a critical role in graph containment query processing on large graph databases • Path-based and graph-based indexing approaches suffer from overly large index size, substantial index construction overhead and expensive query processing cost • (Tree+Δ) is an effective and efficient graph indexing feature to answer graph containment queries • (Tree+Δ) holds a compact index structure, achieves good performance in index construction and most importantly, provides satisfactory query performance for answering graph containment queries over large graph databases VLDB’07 Vienna, Austria**Thank you**VLDB’07 Vienna, Austria