- By
**jude** - Follow User

- 115 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Frequent Structure Mining' - jude

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

Original Authors

- This presentation is based on the paper

Zaki MJ (2002). Efficiently mining frequent trees in a forest. Proceedings of the 8th ACM SIGKDD International Conference.

- The author’s original presentation was used to make this one.
- I further adapted this from Ahmed R. Nabhan’s modifications.

Outline

- Graph Mining Overview
- Mining Complex Structures - Introduction
- Motivation and Contributions of author
- Problem Definition and Case Examples
- Main Ingredients for Efficient Pattern Extraction
- Experimental Results
- Conclusions

Outline

- Graph Mining Overview
- Mining Complex Structures - Introduction
- Motivation and Contributions of author
- Problem Definition and Case Examples
- Main Ingredients for Efficient Pattern Extraction
- Experimental Results
- Conclusions

Why Graph Mining?

- Graphs are convenient structures that can represent many complex relationships.
- We are drowning in graph data:
- Social Networks
- Biological Networks
- World Wide Web

- (Source: Wolfram|Alpha Facebook Report)

- (Source: KEGG Pathways Database)

Some Graph Mining Problems

- Pattern Discovery
- Graph Clustering
- Graph Classification and Label Propagation
- Structure and Dynamics of Evolving Graphs

Graph Mining Framework

- Mining graph patterns is a fundamental problem in data mining.

- Exponential Pattern Space

- Relevant Patterns

- Mine

- Select

- Graph Data

- Structure Indices

- Exploratory Task

- Clustering

- Classification

- A

- B

- B

- C

- C

- D

- Graph – A graph G is a 3-tuple G = (V, E, L) where
- V is the finite set of nodes.
- E ⊆ V × V is the set of edges
- L is a labeling function for edges and nodes.
- Subgraph – A graph G1 = (V1, E1, L1) is a subgraph of G2 = (V2, E2, L2) iff:
- V1 ⊆ V2
- E1 ⊆ E2
- L1(v) = L2(v) for all v ∈ V1.
- L1(e) = L2(e) for all e ∈ E1.

- A

- 5

- B

- 4

- C

- D

- E

- 1

- 2

- Graph Isomorphism – “A bijection between the vertex sets of G1 and G2 such that any two vertices u and v which are adjacent in G1 are also adjacent in G2.” (Wikipedia)

- Subgraph Isomorphism is even harder (NP-Complete!)

Basic Concepts

- Graph Isomorphism – Let G1 = (V1, E1, L1) and G2 = (V2, E2, L2). A graph isomorphism is a bijective function f: V1 → V2 satisfying
- L1(u) = L1( f (u)) for all u ∈ V1.
- For each edge e1 = (u,v) ∈ E1, there exists e2 = ( f(u), f(v)) ∈ E2 such that L1(e1) = L2(e2).
- For each edge e2 = (u,v) ∈ E2, there exists e1 = ( f –1(u), f –1(v)) ∈ E1 such that L1(e1) = L2(e2).

Discovering Subgraphs

- TreeMiner and gSpan both employ subgraph or substructure pattern mining.
- Graph or subgraph isomorphism can be used as an equivalence relation between two structures.
- There is an exponential number of subgraph patterns inside a larger graph (as there are 2n node subsets in each graph and then there are edges.)
- Finding frequent subgraphs (or subtrees) tends to be useful in data mining.

Outline

- Graph Mining Overview
- MiningComplex Structures - Introduction
- Motivation and Contributions of author
- Problem Definition and Case Examples
- Main Ingredients for Efficient Pattern Extraction
- Experimental Results
- Conclusions

Mining Complex Structures

- Frequent structure mining tasks
- Item sets – Transactional, unordered data.
- Sequences – Temporal/positional, text, biological sequences.
- Tree Patterns – Semi-structured data, web mining, bioinformatics, etc.
- Graph Patterns – Bioinformatics, Web Data
- “Frequent” is a broad term
- Maximal or closed patterns in dense data
- Correlation and other statistical metrics
- Interesting, rare, non-redundant patterns.

Anti-Monotonicity

- The black line is always decreasing

- A monotonic function is a consistently increasing or decreasing function*.
- The author refers to a monotonically decreasing function as anti-monotonic.
- The frequency of a super-graph cannot be greater than the frequency of a subgraph (similar to Apriori).

- * Very Informal Definition

- (Source: SIGMOD ’08)

Outline

- Graph Mining Overview
- Mining Complex Structures - Introduction
- Motivation andContributions of author
- Problem Definition and Case Examples
- Main Ingredients for Efficient Pattern Extraction
- Experimental Results
- Conclusions

Tree Mining – Motivation

- Capture intricate (subspace) patterns
- Can be used (as features) to build global models (classification, clustering, etc.)
- Ideally suited for categorical, high-dimensional, complex, and massive data.
- Interesting Applications
- Semi-structured Data – Mine structure and content
- Web usage mining – Log mining (user sessions as trees)
- Bioinformatics – RNA secondary structures, Phylogenetic trees

- (Source: University of Washington)

Classification Example

- Subgraph patterns can be used as features for classification.

- “Hexagons are a commonly occurring subgraph in organic compounds.”

- Off-the-shelf classifiers (like neural networks or genetic algorithms) can be trained using these vectors.
- Feature selection is very useful too.

Contributions

- Systematic subtree enumeration.
- Extensions for mining unlabeled or unordered subtrees or sub-forests.
- Optimizations
- Representing trees as strings.
- Scope-lists for subtree occurrences.

Outline

- Graph Mining Overview
- Mining Complex Structures - Introduction
- Motivation and Contributions of author
- ProblemDefinition and Case Examples
- Main Ingredients for Efficient Pattern Extraction
- Experimental Results
- Conclusions

How does searching for patterns work?

- Start with graphs with small sizes.
- Extend k-size graphs by one node to generate k + 1 candidate patterns.
- Use a scoring function to evaluate each candidate.
- A popular scoring function is one that defines the minimum support. Only graphs with frequency greater than minisup are kept.

How does searching for patterns work?

- “The generation of size k + 1 subgraph candidates from size k frequent subgraphs is more complicated and more costly than that of itemsets” – Yan and Han (2002), on gSpan
- Where do we add a new edge?
- It is possible to add a new edge to a pattern and then find that doesn’t exist in the database.
- The main story of this presentation is on good candidate generation strategies.

TreeMiner

- TreeMiner uses a technique for numbering tree nodes based on DFS.
- This numbering is used to encode trees as vectors.
- Subtrees sharing a common prefix (e.g. the first k numbers in vectors) form an equivalence class.
- Generate new candidate (k + 1)-subtrees from equivalence classes of k-subtrees (e.g. Apriori)

TreeMiner

- This is important because candidate subtrees are generated only once!
- (Remember the subgraph isomorphism problem that makes it likely to generate the same pattern over and over)

Definitions

- Tree – An undirected graph where there is exactly one path between any two vertices.
- Rooted Tree – Tree with a special node called root.

- This tree has no root node.
- It is an unrooted tree.

- This tree has a root node.
- It is a rooted tree.

Definitions

- Ordered Tree – The ordering of a node’s children matters.
- Example: XML Documents
- Exercise – Prove that ordered trees must be rooted.

- ≠

- v2

- v1

- v3

- v1

- v2

- v3

Definitions

- Labeled Tree – Nodes have labels.
- Rooted trees also have some special terminology.
- Parent – The node one closer to the root.
- Ancestor – The node n edges closer to the root, for any n.
- Siblings – Two nodes with the same parent.

- ancestor

- embedded sibling

- parent

- embedded sibling

- sibling

- ancestor(X,Y) :-
- parent(X,Y).
- ancestor(X,Y) :-
- parent(Z,Y),
- ancestor(X,Z).
- sibling(X,Y) :-
- parent(Z,X),
- parent(Z,Y).

Definitions

- Embedded Siblings – Two nodes sharing a common ancestor.
- Numbering – The node’s position in a traversal (normally DFS) of the tree.
- A node has a number ni and a label L(ni).
- Scope – The scope of a node nl is [l, r], where nris the rightmost leaf under nl (again, DFS numbering).

Definitions

- v0

- Embedded Subtrees – S = (Ns, Bs) is an embedded subtree of T = (N, B)if and only if the following conditions are met:
- Ns ⊆ N (the nodes have to be a subset).
- b = (nx, ny) ∊ Bs iff nx is an ancestor of ny.
- For each subset of nodes Ns there is one embedded subtree or subforest.

- v1

- v6

- v7

- v8

- v2

- v3

- v4

- v5

- subtree

- v1

- v4

- v5

- (Colors are only on this graph to show corresponding nodes)

Definitions

- v0

- Match Label – The node numbers (DFS numbers) in T of the nodes in S with matching labels.
- A match label uniquely identifies a subtree.
- This is useful because a labeling function must be surjective but will not necessarily be bijective.

{v1, v4, v5} or {1, 4, 5}

- v1

- v6

- v7

- v8

- v2

- v3

- v4

- v5

- subtree

- v1

- v4

- v5

- (Colors are only on this graph to show corresponding nodes)

Definitions

- v0

- Subforest – A disconnected pattern generated in the same way as an embedded subtree.

- v1

- v6

- v7

- v8

- v2

- v3

- v4

- v5

- subforest

- v1

- v7

- v4

- v8

- (Colors are only on this graph to show corresponding nodes)

Problem Definition

- Given a database (forest) D of trees, find all frequent embedded subtrees.
- Frequent – Occurring a minimum number of times (use user-defined minisup).
- Support(S) – The number of trees in D that contain at least one occurrence of S.
- Weighted-Support(S) – The number of occurrences of S across all trees in D.

- v0

- v1

- v6

- v7

- v2

- v5

- v8

- v2

- v3

Generate an embedded subtree or subforest for the set of nodes Ns = {v1, v2, v5}. Is this an embedded subtree or subforest, and why? Assume a labeling function L(x) = x.

- v4

- v5

- This is an embedded subtree because all of the nodes are connected.

- (*Cough* Exam Question *Cough*)

Outline

- Graph Mining Overview
- Mining Complex Structures - Introduction
- Motivation and Contributions of author
- Problem Definition and Case Examples
- MainIngredients for Efficient Pattern Extraction
- Experimental Results
- Conclusions

Main Ingredients

- Pattern Representation
- Trees as strings
- Candidate Generation
- No duplicates.
- Pattern Counting
- Scope-based List (TreeMiner)
- Pattern-based Matching (PatternMatcher)

String Representation

- With N nodes, M branches, and a max fanout of F:
- An adjacency matrix takes (N)(F + 1) space.
- An adjacency list requires 4N – 2 space.
- A tree of (node, child, sibling) requires 3N space.
- String representation requires 2N – 1 space.

Candidate Generation

- Equivalence Classes – Two subtrees are in the same equivalence class iff they share a common prefix string P up to the (k – 1)-th node.
- This gives us simple equivalence testing of a fixed-size array.
- Fast and parallel – Can be run on a GPU.
- Caveat – The order of the tree matters.

Candidate Generation

- Generate new candidate (k + 1)-subtrees from equivalence classes of k-subtrees.
- Consider each pair of elements in a class, including self-extensions.
- Up to two new candidates for each pair of joined elements.
- All possible candidate subtrees are enumerated.
- Each subtree is generated only once!

Candidate Generation

- Each class is represented in memory by a prefix string and a set of ordered pairs indicating nodes that exist in that class.
- A class is extended by applying a join operator ⊗ on all ordered pairs in the class.

Candidate Generation

- Equivalence Class
- Prefix String 12

- 1

- 1

- 2

- 4

- 2

- 3

- This generates two elements: (3, v1) and (4, v0)
- The element notation can be confusing because the first item is a label and the second item is a DFS node number.

Candidate Generation

- Theorem 1. Define a join operator ⊗ on two elements as (x, i) ⊗ (y, j). Then apply one of the following cases:
- If i = j and P is not empty, add (y, j) and (y, j + 1) to class [Px]. If P is empty, only add (y, j + 1) to [Px].
- If i > j, add (y, j) to [Px].
- If i < j, no new candidate is possible.

Candidate Generation

- Consider the prefix class from the previous example: P = (1, 2) which contains two elements, (3, v1) and (4, v0).
- Join (3, v1) ⊗ (3, v1) – Case (1) applies, producing (3, v1) and (3, v2) for the new class P3 = (1, 2, 3).
- Join (3, v1) ⊗ (4, v0) – Case (2) applies.

(Don’t worry, there’s an illustration on the next slide.)

- 1

- 2

- 2

- 1

- 1

- 3

- 3

- =

- ⊗

- 2

- 2

- 3

- 3

- 3

- A class with prefix {1,2} contains a node with label 3. This is written as (3, v1), meaning a node labeled ‘3’ is added at position 1 in DFS order of nodes.

- 3

- Prefix = (1, 2, 3)
- New nodes = (3, v2), (3, v1)

The Algorithm

TreeMiner( D, minisup ):

F1 = { frequent 1-subtrees}

F2 = { classes [P]1 of frequent 2-subtrees }

for all [P]1 ∈ E do

Enumerate-Frequent-Subtrees( [P]1 )

Enumerate-Frequent-Subtrees( [P] ):

for each element (x, i) ∈ [P] do

[Px] = ∅

for each element (y, j) ∈ [P] do

R = { (x, i) ⊗ (y, j) }

L(R) = { L(x) ∩⊗ L(y) }

if for any R ∈ R, R is frequent, then

[Px] = [Px] ∪ {R}

Enumerate-Frequent-Subtrees( [Px] )

ScopeList Join

- Recall that the scope is the interval between the lowest numbered child (or self) node and the highest numbered child node, using DFS numbering.
- This can be used to calculate support.

- [0, 8]

- v0

- [1, 5]

- v1

- v6

- v7

- [7, 8]

- [2, 2]

- v8

- [8, 8]

- v2

- v3

- [3, 5]

- v4

- v5

- [4, 4]

- [5, 5]

ScopeList Join

- ScopeLists are used to calculate support.
- Let x and y be nodes with scopes sx = [lx, ux], sy = [ly, uy].
- sx contains syifflx ≤ ly and ux ≥ uy.
- A scope list represents the entire forest.

ScopeList Join

- A ScopeList is a list of (t, m, s) 3-tuples.
- t is the tree ID.
- m is the match label of the (k – 1)-length prefix of xk.
- s is the scope of the last item, xk.
- The use of scope lists allows constant time computations of whether y is a descendent or embedded sibling of x.

Outline

- Graph Mining Overview
- Mining Complex Structures - Introduction
- Motivation and Contributions of author
- Problem Definition and Case Examples
- Main Ingredients for Efficient Pattern Extraction
- Experimental Results
- Conclusions

Experimental Results

- Machine: 500Mhz PentiumII, 512MB memory, 9GB disk, RHEL 6.0
- Synthetic Data: Web browsing
- Parameters: N = #Labels, M = #Nodes, F = Max Fanout, D = Max Depth, T = #Trees
- Create master website tree W
- For each node in W, generate #children (0 to F)
- Assign probabilities of following each child or to backtrack; adding up to 1
- Recursively continue until D is reached
- Generate a database of T subtrees of W
- Start at root. Recursively at each node generate a random number
- (0 – 1) to decide which child to follow or to backtrack.
- Default parameters: N=100, M=10,000, D=10, F=10, T=100,000
- Three Datasets: D10 (all default values), F5 (F=5), T1M (T=106)
- Real Data: CSLOGS – 1 month web log files at RPI CS

- Over 13361 pages accessed (#labels)
- Obtained 59,691 user browsing trees (#number of trees)
- Average string length of 23.3 per tree

Distribution of Frequent Trees

- F5: Max-Fanout = 5
- T1M: 106 Trees

- Sparse

- Dense

- Take-Home Point: Many large, frequent trees can be discovered.

Experiments (Sparse)

- Sparse

- Dense

- Take-Home Point: Both algorithms are able to cope with relatively short patterns in sparse data.

Experiments (Dense)

- Sparse
- (Artificial Dataset)

- Dense
- (Real-World Dataset)

- Take-Home Point: Long patterns at low-support (length=20); the level-wise approach suffers.
- The authors use the artificial dataset to justify TreeMiner as 20 times faster than pattern matcher.

- Graph Mining Overview
- Mining Complex Structures - Introduction
- Motivation and Contributions of author
- Problem Definition and Case Examples
- Main Ingredients for Efficient Pattern Extraction
- Experimental Results
- Conclusions

Conclusions

- TreeMiner: A novel tree mining approach.
- Non-duplicate candidate generation.
- Scope-List joins for frequency comparison.
- Framework for tree-mining tasks
- Frequent subtrees in a forest of rooted, labeled, ordered trees.
- Frequent subtrees in a single tree.
- There are extensions for unlabeled and unordered trees.

Caveats

- Frequent does not always mean significant!
- Exhaustive enumeration is a problem even though the candidate generation in TreeMiner is efficient.
- Low min_sup values increases true positives at the cost of increasing false positives.
- State-of-the-art graph miners make use of the structure of the search space (e.g. the LEAP search algorithm) to extract only significant structures.
- Candidate structures can be generated by tree miners and evaluated by some other mean.

- v0

- v1

- v6

- v7

- v2

- v5

- v8

- v2

- v3

Generate an embedded subtree or subforest for the set of nodes Ns = {v1, v2, v5}. Is this an embedded subtree or subforest, and why? Assume a labeling function L(x) = x.

- v4

- v5

- This is an embedded subtree because all of the nodes are connected.

Question Two

- Why is the frequency of subgraphs a good function to evaluate candidate patterns? How could it be better?

- Answer. The frequency of subgraphs is a monotonically decreasing function, meaning supergraphs are not more frequent than subgraphs. This is a desirable property combined with a minimum support threshold to reduce the search space as subgraph patterns get bigger.
- However, frequency does not always imply significance – another metric must be used to evaluate the candidates generated by a graph miner for significance.

Question Three

- How is a string representation of a tree useful in graph mining? What requirements does it place on the graph?

- Answer. A string representation of a tree is useful because string comparisons are worst-case O(n) and can be easily optimized. However, it requires that a tree be rooted and ordered, because otherwise the string comparison operator would not be valid.

Download Presentation

Connecting to Server..