Frequent Structure Mining

Frequent Structure Mining Robert Howe University of Vermont Spring 2014

Original Authors • This presentation is based on the paper Zaki MJ (2002). Efficiently mining frequent trees in a forest. Proceedings of the 8th ACM SIGKDD International Conference. • The author’s original presentation was used to make this one. • I further adapted this from Ahmed R. Nabhan’s modifications.

Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation and Contributions of author • Problem Definition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions

Why Graph Mining? • Graphs are convenient structures that can represent many complex relationships. • We are drowning in graph data: • Social Networks • Biological Networks • World Wide Web

UVM • High School • BU • Facebook Data • (Source: Wolfram|Alpha Facebook Report)

Facebook Data • (Source: Wolfram|Alpha Facebook Report)

Biological Data • (Source: KEGG Pathways Database)

Some Graph Mining Problems • Pattern Discovery • Graph Clustering • Graph Classification and Label Propagation • Structure and Dynamics of Evolving Graphs

Graph Mining Framework • Mining graph patterns is a fundamental problem in data mining. • Exponential Pattern Space • Relevant Patterns • Mine • Select • Graph Data • Structure Indices • Exploratory Task • Clustering • Classification

A • A • B • B Basic Concepts • C • C • D • Graph – A graph G is a 3-tuple G = (V, E, L) where • V is the finite set of nodes. • E ⊆ V × V is the set of edges • L is a labeling function for edges and nodes. • Subgraph – A graph G1 = (V1, E1, L1) is a subgraph of G2 = (V2, E2, L2) iff: • V1 ⊆ V2 • E1 ⊆ E2 • L1(v) = L2(v) for all v ∈ V1. • L1(e) = L2(e) for all e ∈ E1.

3 • A • 5 • B Basic Concepts • 4 • C • D • E • 1 • 2 • Graph Isomorphism – “A bijection between the vertex sets of G1 and G2 such that any two vertices u and v which are adjacent in G1 are also adjacent in G2.” (Wikipedia) • Subgraph Isomorphism is even harder (NP-Complete!)

Basic Concepts • Graph Isomorphism – Let G1 = (V1, E1, L1) and G2 = (V2, E2, L2). A graph isomorphism is a bijective function f: V1 → V2 satisfying • L1(u) = L1( f (u)) for all u ∈ V1. • For each edge e1 = (u,v) ∈ E1, there exists e2 = ( f(u), f(v)) ∈ E2 such that L1(e1) = L2(e2). • For each edge e2 = (u,v) ∈ E2, there exists e1 = ( f –1(u), f –1(v)) ∈ E1 such that L1(e1) = L2(e2).

Discovering Subgraphs • TreeMiner and gSpan both employ subgraph or substructure pattern mining. • Graph or subgraph isomorphism can be used as an equivalence relation between two structures. • There is an exponential number of subgraph patterns inside a larger graph (as there are 2n node subsets in each graph and then there are edges.) • Finding frequent subgraphs (or subtrees) tends to be useful in data mining.

Outline • Graph Mining Overview • MiningComplex Structures - Introduction • Motivation and Contributions of author • Problem Definition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions

Mining Complex Structures • Frequent structure mining tasks • Item sets – Transactional, unordered data. • Sequences – Temporal/positional, text, biological sequences. • Tree Patterns – Semi-structured data, web mining, bioinformatics, etc. • Graph Patterns – Bioinformatics, Web Data • “Frequent” is a broad term • Maximal or closed patterns in dense data • Correlation and other statistical metrics • Interesting, rare, non-redundant patterns.

Anti-Monotonicity • The black line is always decreasing • A monotonic function is a consistently increasing or decreasing function*. • The author refers to a monotonically decreasing function as anti-monotonic. • The frequency of a super-graph cannot be greater than the frequency of a subgraph (similar to Apriori). • * Very Informal Definition • (Source: SIGMOD ’08)

Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation andContributions of author • Problem Definition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions

Tree Mining – Motivation • Capture intricate (subspace) patterns • Can be used (as features) to build global models (classification, clustering, etc.) • Ideally suited for categorical, high-dimensional, complex, and massive data. • Interesting Applications • Semi-structured Data – Mine structure and content • Web usage mining – Log mining (user sessions as trees) • Bioinformatics – RNA secondary structures, Phylogenetic trees • (Source: University of Washington)

Classification Example • Subgraph patterns can be used as features for classification. • “Hexagons are a commonly occurring subgraph in organic compounds.” • Off-the-shelf classifiers (like neural networks or genetic algorithms) can be trained using these vectors. • Feature selection is very useful too.

Contributions • Systematic subtree enumeration. • Extensions for mining unlabeled or unordered subtrees or sub-forests. • Optimizations • Representing trees as strings. • Scope-lists for subtree occurrences.

Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation and Contributions of author • ProblemDefinition and Case Examples • Main Ingredients for Efficient Pattern Extraction • Experimental Results • Conclusions

How does searching for patterns work? • Start with graphs with small sizes. • Extend k-size graphs by one node to generate k + 1 candidate patterns. • Use a scoring function to evaluate each candidate. • A popular scoring function is one that defines the minimum support. Only graphs with frequency greater than minisup are kept.

How does searching for patterns work? • “The generation of size k + 1 subgraph candidates from size k frequent subgraphs is more complicated and more costly than that of itemsets” – Yan and Han (2002), on gSpan • Where do we add a new edge? • It is possible to add a new edge to a pattern and then find that doesn’t exist in the database. • The main story of this presentation is on good candidate generation strategies.

TreeMiner • TreeMiner uses a technique for numbering tree nodes based on DFS. • This numbering is used to encode trees as vectors. • Subtrees sharing a common prefix (e.g. the first k numbers in vectors) form an equivalence class. • Generate new candidate (k + 1)-subtrees from equivalence classes of k-subtrees (e.g. Apriori)

TreeMiner • This is important because candidate subtrees are generated only once! • (Remember the subgraph isomorphism problem that makes it likely to generate the same pattern over and over)

Definitions • Tree – An undirected graph where there is exactly one path between any two vertices. • Rooted Tree – Tree with a special node called root. • This tree has no root node. • It is an unrooted tree. • This tree has a root node. • It is a rooted tree.

Definitions • Ordered Tree – The ordering of a node’s children matters. • Example: XML Documents • Exercise – Prove that ordered trees must be rooted. • ≠ • v2 • v1 • v3 • v1 • v2 • v3

Definitions • Labeled Tree – Nodes have labels. • Rooted trees also have some special terminology. • Parent – The node one closer to the root. • Ancestor – The node n edges closer to the root, for any n. • Siblings – Two nodes with the same parent. • ancestor • embedded sibling • parent • embedded sibling • sibling • ancestor(X,Y) :- • parent(X,Y). • ancestor(X,Y) :- • parent(Z,Y), • ancestor(X,Z). • sibling(X,Y) :- • parent(Z,X), • parent(Z,Y).

Definitions • Embedded Siblings – Two nodes sharing a common ancestor. • Numbering – The node’s position in a traversal (normally DFS) of the tree. • A node has a number ni and a label L(ni). • Scope – The scope of a node nl is [l, r], where nris the rightmost leaf under nl (again, DFS numbering).

Definitions • v0 • Embedded Subtrees – S = (Ns, Bs) is an embedded subtree of T = (N, B)if and only if the following conditions are met: • Ns ⊆ N (the nodes have to be a subset). • b = (nx, ny) ∊ Bs iff nx is an ancestor of ny. • For each subset of nodes Ns there is one embedded subtree or subforest. • v1 • v6 • v7 • v8 • v2 • v3 • v4 • v5 • subtree • v1 • v4 • v5 • (Colors are only on this graph to show corresponding nodes)

Definitions • v0 • Match Label – The node numbers (DFS numbers) in T of the nodes in S with matching labels. • A match label uniquely identifies a subtree. • This is useful because a labeling function must be surjective but will not necessarily be bijective. {v1, v4, v5} or {1, 4, 5} • v1 • v6 • v7 • v8 • v2 • v3 • v4 • v5 • subtree • v1 • v4 • v5 • (Colors are only on this graph to show corresponding nodes)

Definitions • v0 • Subforest – A disconnected pattern generated in the same way as an embedded subtree. • v1 • v6 • v7 • v8 • v2 • v3 • v4 • v5 • subforest • v1 • v7 • v4 • v8 • (Colors are only on this graph to show corresponding nodes)

Problem Definition • Given a database (forest) D of trees, find all frequent embedded subtrees. • Frequent – Occurring a minimum number of times (use user-defined minisup). • Support(S) – The number of trees in D that contain at least one occurrence of S. • Weighted-Support(S) – The number of occurrences of S across all trees in D.

v1 • v0 Exercise • v1 • v6 • v7 • v2 • v5 • v8 • v2 • v3 Generate an embedded subtree or subforest for the set of nodes Ns = {v1, v2, v5}. Is this an embedded subtree or subforest, and why? Assume a labeling function L(x) = x. • v4 • v5 • This is an embedded subtree because all of the nodes are connected. • (*Cough* Exam Question *Cough*)

Outline • Graph Mining Overview • Mining Complex Structures - Introduction • Motivation and Contributions of author • Problem Definition and Case Examples • MainIngredients for Efficient Pattern Extraction • Experimental Results • Conclusions

Main Ingredients • Pattern Representation • Trees as strings • Candidate Generation • No duplicates. • Pattern Counting • Scope-based List (TreeMiner) • Pattern-based Matching (PatternMatcher)

String Representation • With N nodes, M branches, and a max fanout of F: • An adjacency matrix takes (N)(F + 1) space. • An adjacency list requires 4N – 2 space. • A tree of (node, child, sibling) requires 3N space. • String representation requires 2N – 1 space.

0 String Representation • 2 • 1 • 3 • 2 • String representation is labels with a backtrack operator, –1. • 1 • 2

Candidate Generation • Equivalence Classes – Two subtrees are in the same equivalence class iff they share a common prefix string P up to the (k – 1)-th node. • This gives us simple equivalence testing of a fixed-size array. • Fast and parallel – Can be run on a GPU. • Caveat – The order of the tree matters.

Candidate Generation • Generate new candidate (k + 1)-subtrees from equivalence classes of k-subtrees. • Consider each pair of elements in a class, including self-extensions. • Up to two new candidates for each pair of joined elements. • All possible candidate subtrees are enumerated. • Each subtree is generated only once!

Candidate Generation • Each class is represented in memory by a prefix string and a set of ordered pairs indicating nodes that exist in that class. • A class is extended by applying a join operator ⊗ on all ordered pairs in the class.

Candidate Generation • Equivalence Class • Prefix String 12 • 1 • 1 • 2 • 4 • 2 • 3 • This generates two elements: (3, v1) and (4, v0) • The element notation can be confusing because the first item is a label and the second item is a DFS node number.

Candidate Generation • Theorem 1. Define a join operator ⊗ on two elements as (x, i) ⊗ (y, j). Then apply one of the following cases: • If i = j and P is not empty, add (y, j) and (y, j + 1) to class [Px]. If P is empty, only add (y, j + 1) to [Px]. • If i > j, add (y, j) to [Px]. • If i < j, no new candidate is possible.

Candidate Generation • Consider the prefix class from the previous example: P = (1, 2) which contains two elements, (3, v1) and (4, v0). • Join (3, v1) ⊗ (3, v1) – Case (1) applies, producing (3, v1) and (3, v2) for the new class P3 = (1, 2, 3). • Join (3, v1) ⊗ (4, v0) – Case (2) applies. (Don’t worry, there’s an illustration on the next slide.)

1 • 1 Candidate Generation • 2 • 2 • 1 • 1 • 3 • 3 • = • ⊗ • 2 • 2 • 3 • 3 • 3 • A class with prefix {1,2} contains a node with label 3. This is written as (3, v1), meaning a node labeled ‘3’ is added at position 1 in DFS order of nodes. • 3 • Prefix = (1, 2, 3) • New nodes = (3, v2), (3, v1)

1 • 1 • 1 • 1 Candidate Generation • 2 • 2 • 4 • 2 • 2 • 3 • 3 • 3 • 1 • 3 • 3 • 4 • 2 • = • ⊗ • 3 • Prefix = (1, 2, 3) • New nodes = (3, v2), (3, v1),(4, v0)

The Algorithm TreeMiner( D, minisup ): F1 = { frequent 1-subtrees} F2 = { classes [P]1 of frequent 2-subtrees } for all [P]1 ∈ E do Enumerate-Frequent-Subtrees( [P]1 ) Enumerate-Frequent-Subtrees( [P] ): for each element (x, i) ∈ [P] do [Px] = ∅ for each element (y, j) ∈ [P] do R = { (x, i) ⊗ (y, j) } L(R) = { L(x) ∩⊗ L(y) } if for any R ∈ R, R is frequent, then [Px] = [Px] ∪ {R} Enumerate-Frequent-Subtrees( [Px] )

ScopeList Join • Recall that the scope is the interval between the lowest numbered child (or self) node and the highest numbered child node, using DFS numbering. • This can be used to calculate support. • [0, 8] • v0 • [1, 5] • v1 • v6 • v7 • [7, 8] • [2, 2] • v8 • [8, 8] • v2 • v3 • [3, 5] • v4 • v5 • [4, 4] • [5, 5]

ScopeList Join • ScopeLists are used to calculate support. • Let x and y be nodes with scopes sx = [lx, ux], sy = [ly, uy]. • sx contains syifflx ≤ ly and ux ≥ uy. • A scope list represents the entire forest.

Frequent Structure Mining

Frequent Structure Mining

Presentation Transcript

Mining Frequent Patterns without Candidate Generation

Frequent Item Mining

Frequent Pattern Mining

Summarization of Frequent Pattern Mining

Mining Frequent Patterns and Association Rules

On Frequent Chatters Mining

Mining Frequent Patterns

Fast Algorithms for Mining Frequent Itemsets

Frequent Subgraph Mining

Data Mining: Concepts and Techniques Mining Frequent Patterns

Frequent Itemset Mining on Graphics Processors

Mining Frequent Subgraphs

Mining Frequent Subgraphs

Mining Frequent Patterns without Candidate Generation

CanTree: a tree structure for efficient incremental mining of frequent patterns

Chapter 4 – Frequent Pattern Mining

Mining Compressed Frequent-Pattern Sets

Frequent Pattern Mining

Mining Compressed Frequent-Pattern Sets