Efficient Tree Query Association Mining in Graph Data

Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline Hoekx (U Hasselt, Belgium)

Graph Data A (directed) graph over a set of nodes N is a set G of edges: ordered pairs ij with ij  N. Snapshot of a graph representing the metabolic pathway of a human. Applications: life sciences, biology, social sciences, WWW, ...

Graph Mining Transactional category • dataset: set of many small graphs (transactions) • frequency: transactions in which the pattern occurs (at least once) • ILP:Warmr [AGM, FSG, TreeMiner, gSpan, FFSM, Horvath-Ramon-Wrobel] Single graph category • dataset: single large graph • frequency: copies of the pattern in the large graph [Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom] Focus on pattern mining, few work on association rule mining!

Tree-Query Pattern • powerful tree-shaped pattern • inspired by conjunctive database queries • special features: • existential nodes • parameterized nodes • occurrence of the pattern in G is any homomorphism from the pattern in G frequency:x z:0zGz8 Gzx G

Association rules • fully fledged associations over tree-query patterns • example:

Experimental results: Real-life datasets • Food webnodesedges confidence = 89% frequency = 176

Experimental results: Food web nodesedges 45% 55%

Experimental results: Real-life datasets • Protein interactions graph nodesedges confidence = 10%

Experimental results: Protein interaction graphnodesedges 90%

Outline rest of the talk • Formal problem definition • Algorithm • overall approach • levelwise generation of tree patterns • generation of containment mappings • generation of parameter assignments • Equivalent association rules • Certhia • Performance and Experimental results • Future work

Tree pattern

Tree pattern select distinct G3.to as x from G G1, G G2, G G3 where G1.from=5 and G1.to=G2.from and G1.to=G3.from and G2.to=8

Matching

Frequency   frequency = 3

Tree Query H, head P, body • Q = (H,P)

Association Rule • AR: Q1 Q2 • Confidence (AR) = freq(Q2)/freq(Q1) • Q2 Q1 { (x1,x2,x3) | Q1(x1,x2,x3)  G}  { (x,x,6) | Q2(x,x,6)  G }

Examples of Association Rules (1) (2)

Association Rule • AR: Q1 Q2 • Confidence (AR) = freq(Q2)/freq(Q1) • Q2 Q1 { (x1,x2,x3) | Q1(x1,x2,x3)  G}  { (x,x,6) | Q2(x,x,6)  G }

Containment Mapping containment mapping

Containment Mapping containment mapping Q Q  containment mapping fromQ to Q

Problem statement: Mining tree queries Given a graph G and a threshold k, find all tree queries that have frequency at least k in G, those queries are called frequent.

Problem statement: Association rules • Input: • a graph G • minsup • Qleft frequent in G • minconf • Output: All association rules QleftQ • frequent in G • confident in G.

Algorithm: mining tree queries x1 x2 x4 x3 x        x2 x2 x1 x1 Outer loop: Generate,incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees. Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency. ...

Outer loop • It is well known how to efficiently generate all trees uniquely up to isomorphism • Based on canonical form of trees. • [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]

Inner loop: Levelwise approach • A query Q is characterized by • Q set of existential nodes • Q set of selected nodes • Labeling Qof the selected nodes by constants. • Qspecializes Q if , and  agrees with  on . • If Qspecializes Q then freqQ freqQ • Most general query: T = (, , )

Inner loop: Candidate generation • CanTab is a candidate query FreqTabis a frequent query • Q’=’’ is aparent of Q= if either: • ’ and  has precisely one more node than ’, or • ’ and  has precisely one more node than ’ • Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.

Inner loop: Frequency counting • Each candidacy table can be computed by a single SQL query. (ref. Join lemma). • Suppose: Gfromto table in the database, then each frequency table can be computed with a single SQL query. •  • formulate in SQL and count •  • formulate  in SQLE • natural join of E with CanTab • group by  • count each group

Inner loop: Example x x x x x

Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx

Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx⋈FreqTabxx

Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx

Inner loop: Example x x x x x • SQL expression E for x select distinct G1.from as x1, G2.to as x3, G3.to as x4 from G G1, G G2, G G3 where G1.to = G2.from and G3.from = G2.from

Inner loop: Example x x x x x • SQL expression for filling the frequency table: select distinct E.x1, E.x3, count(E.x4) from E, CanTab{x2}{x1,x3} as CT where E.x1 = CT.x1 and E.x3 = CT.x3 group by E.x1, E.x3 having count(E.x4) >= k

Algorithm: Mining association rules Loop 1:Generate incrementally all possible trees T of increasing sizes. Loop 2: For each T, generate all frequent tree patterns P based T. Loop 3: For each P, generate all containment mappings f from Pleft to P. Loop 4: For each f, generate Q=(f(Hleft),P) and all parameter instantiations for Qleft Q.

Pattern database • For each P a table FreqTabP, that contains all frequent parameter instantiations.  Pattern Database

Efficient Tree Query Association Mining in Graph Data

Efficient Tree Query Association Mining in Graph Data

Presentation Transcript

Peta-Graph Mining

Mining Query Logs

Exploring the Query-Flow Graph with a Mixture Model for Query Recommendation

The Graph Query Language: Towards a Unification of Graph Query Approaches

Query Preserving Graph Compression

Data Mining for Query Optimization

Fast Frequent Free Tree Mining in Graph Databases

Data Mining for Query Optimization

Graph Indexing: Tree + Δ ≥ Graph

Reachability Query Over A Large Graph

Query Tree Disassembly

Mining Tree-Query Associations in a Graph

Graph mining in bioinformatics

Large Graph Mining

Data Mining for Query Optimization

Large Graph Mining

Mining Associations

Mining Query Logs

The Graph Query Language: Towards a Unification of Graph Query Approaches

R-MAT: A Recursive Model for Graph Mining