660 likes | 758 Views
This study focuses on mining tree queries associations in graph data, offering insights on algorithms, association rules, and experimental results. Learn more about formal problem definitions, containing mappings, and future work in this field.
E N D
Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline Hoekx (U Hasselt, Belgium)
Graph Data A (directed) graph over a set of nodes N is a set G of edges: ordered pairs ij with ij N. Snapshot of a graph representing the metabolic pathway of a human. Applications: life sciences, biology, social sciences, WWW, ...
Graph Mining Transactional category • dataset: set of many small graphs (transactions) • frequency: transactions in which the pattern occurs (at least once) • ILP:Warmr [AGM, FSG, TreeMiner, gSpan, FFSM, Horvath-Ramon-Wrobel] Single graph category • dataset: single large graph • frequency: copies of the pattern in the large graph [Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom] Focus on pattern mining, few work on association rule mining!
Tree-Query Pattern • powerful tree-shaped pattern • inspired by conjunctive database queries • special features: • existential nodes • parameterized nodes • occurrence of the pattern in G is any homomorphism from the pattern in G frequency:x z:0zGz8 Gzx G
Association rules • fully fledged associations over tree-query patterns • example:
Experimental results: Real-life datasets • Food webnodesedges confidence = 89% frequency = 176
Experimental results: Real-life datasets • Food webnodesedges confidence = 89% frequency = 176
Experimental results: Food web nodesedges 45% 55%
Experimental results: Real-life datasets • Protein interactions graph nodesedges confidence = 10%
Experimental results: Protein interaction graphnodesedges 90%
Outline rest of the talk • Formal problem definition • Algorithm • overall approach • levelwise generation of tree patterns • generation of containment mappings • generation of parameter assignments • Equivalent association rules • Certhia • Performance and Experimental results • Future work
Tree pattern select distinct G3.to as x from G G1, G G2, G G3 where G1.from=5 and G1.to=G2.from and G1.to=G3.from and G2.to=8
Frequency frequency = 3
Tree Query H, head P, body • Q = (H,P)
Association Rule • AR: Q1 Q2 • Confidence (AR) = freq(Q2)/freq(Q1) • Q2 Q1 { (x1,x2,x3) | Q1(x1,x2,x3) G} { (x,x,6) | Q2(x,x,6) G }
Examples of Association Rules (1) (2)
Association Rule • AR: Q1 Q2 • Confidence (AR) = freq(Q2)/freq(Q1) • Q2 Q1 { (x1,x2,x3) | Q1(x1,x2,x3) G} { (x,x,6) | Q2(x,x,6) G }
Containment Mapping containment mapping
Containment Mapping containment mapping
Containment Mapping containment mapping
Containment Mapping containment mapping
Containment Mapping containment mapping Q Q containment mapping fromQ to Q
Problem statement: Mining tree queries Given a graph G and a threshold k, find all tree queries that have frequency at least k in G, those queries are called frequent.
Problem statement: Association rules • Input: • a graph G • minsup • Qleft frequent in G • minconf • Output: All association rules QleftQ • frequent in G • confident in G.
Algorithm: mining tree queries x1 x2 x4 x3 x x2 x2 x1 x1 Outer loop: Generate,incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees. Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency. ...
Outer loop • It is well known how to efficiently generate all trees uniquely up to isomorphism • Based on canonical form of trees. • [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]
Inner loop: Levelwise approach • A query Q is characterized by • Q set of existential nodes • Q set of selected nodes • Labeling Qof the selected nodes by constants. • Qspecializes Q if , and agrees with on . • If Qspecializes Q then freqQ freqQ • Most general query: T = (, , )
Inner loop: Candidate generation • CanTab is a candidate query FreqTabis a frequent query • Q’=’’ is aparent of Q= if either: • ’ and has precisely one more node than ’, or • ’ and has precisely one more node than ’ • Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.
Inner loop: Frequency counting • Each candidacy table can be computed by a single SQL query. (ref. Join lemma). • Suppose: Gfromto table in the database, then each frequency table can be computed with a single SQL query. • • formulate in SQL and count • • formulate in SQLE • natural join of E with CanTab • group by • count each group
Inner loop: Example x x x x x
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx⋈FreqTabxx
Inner loop: Example x x x x x • Join expression: • CanTab{x}{x,x} = FreqTabxx⋈FreqTabxx ⋈FreqTabxx
Inner loop: Example x x x x x • SQL expression E for x select distinct G1.from as x1, G2.to as x3, G3.to as x4 from G G1, G G2, G G3 where G1.to = G2.from and G3.from = G2.from
Inner loop: Example x x x x x • SQL expression for filling the frequency table: select distinct E.x1, E.x3, count(E.x4) from E, CanTab{x2}{x1,x3} as CT where E.x1 = CT.x1 and E.x3 = CT.x3 group by E.x1, E.x3 having count(E.x4) >= k
Algorithm: Mining association rules Loop 1:Generate incrementally all possible trees T of increasing sizes. Loop 2: For each T, generate all frequent tree patterns P based T. Loop 3: For each P, generate all containment mappings f from Pleft to P. Loop 4: For each f, generate Q=(f(Hleft),P) and all parameter instantiations for Qleft Q.
Pattern database • For each P a table FreqTabP, that contains all frequent parameter instantiations. Pattern Database