1 / 16

Efficiently Mining Frequent Trees in a Forest

Efficiently Mining Frequent Trees in a Forest. Mohammed J. Zaki. Frequent Structure Mining (FSM). Dealing with extracting patterns (association, sequence, frequent tree, graph, and etc.) in massive databases Typical application Bioinformatics Web mining Mining semi-structured documents .

jane-ashley
Download Presentation

Efficiently Mining Frequent Trees in a Forest

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki

  2. Frequent Structure Mining (FSM) • Dealing with extracting patterns (association, sequence, frequent tree, graph, and etc.) in massive databases • Typical application • Bioinformatics • Web mining • Mining semi-structured documents

  3. Tree Mining Problems • Goal: to efficiently enumerate all frequent subtrees in a forest (database of trees) according to a given minimum support (minsup) • The support of a subtree S is the number of trees in D that contains one occurrence of S. • A subtree S is frequent if its support is more than or equal to a user specified minsup value.

  4. Rooted, Ordered & Labeled tree • A tree is an acyclic connected graph • Rooted: exist one vertices which is distinguished from others • Ordered: the children of each node in a rooted tree are ordered. • Labeled: each node is associated with a label. Every tree in the paper is a rooted, ordered and labeled tree.

  5. Definition of Subtrees • We denote a tree as T = {N, B}. N is a set of labeled nodes and B is a set of branches. • We say that a tree S = {Ns, Bs} is an embedded subtree of T = {N, B}, if: • Ns is a subset of N • A branch appears in S iff two vertices are on the same path from the root to a leaf in T. • A disconnected pattern is a sub-forest of T. Hence, embedded trees allow not only direct parent-child branches, but also ancestor-descendant branches.

  6. Examples of subtrees: 0 0 1 2 3 4 2 4 Subtree S 1 1 3 2 2 4 1 1 3 2 Tree T Not a subtree, a sub-forest 1

  7. Each node has a well-defined number, i, according to its position in a depth-first traversal of a tree The label of each node is taken from a set of labels L = {0, 1, …, m-1}. It represents the value of each node. Node Numbers and Labels 0 0 1 5 2 4 7 6 2 4 1 3 2 1 3 1

  8. Scope of Node [0,7] • The scope of each node ni is given as [i, r], i.e., the lower bound is the position (i) of itself, and the upper bound is the position (r) of its right-most leaf node. • Assume two node x, y has the following scope Sx = [ix, rx] and Sy = [iy, ry]. • Sx is strictly less than (<) Sy iff rx < ly, i.e., Sx occurs before Sy. It means that y is an embedded sibling of x • Sx contains Sy iff lx <= ly and rx >= ry. It means that y is a descendant of x 0 [1,4] [5,7] 2 4 [2,3] [4,4] [6,7] [7,7] 1 3 2 1 [3,3] 1

  9. The String Encoding: 0 2 1 1 –1 –1 1 –1 –1 4 3 –1 2 –1 -1 To create String encoding, which is denoted as t, we perform a depth-first search starting (also ending) at the root, adding the current node’s label x to t. Whenever we backtrack from a child to its parent we add an special symbol –1 to the string. Representing trees as Strings 0 2 4 1 1 3 2 1

  10. Equivalence Classes • Two k-subtrees X, Y are in the same prefix equivalence class iff they share a common prefix up to the (k-1)th nodes • Prefix String: 2 1 0 –1 3 • The following three subtrees are in the same prefix equivalence class: • 2 1 0 –1 3 –1 –1 x –1 // (x, 0) • 2 1 0 –1 3 –1 x –1 –1 // (x, 1) • 2 1 0 –1 3 x –1 –1 –1 // (x, 3) • Element list: (label, the position of the node which x is attached) • (x, 0); (x, 1); (x, 3) • A valid element x may be attached to only those that lie on the path from the root to the right-most leaf. 2 x 1 x 3 0 x x Not a valid element!

  11. Candidate Generation: • Goal: Given an equivalence class of k-subtrees, try to obtain candidate (k+1)-subtrees. • Main idea: consider each pair of elements in the class for extension, including self-extension. • Theorem: Assume elements are kept sorted by node label as the primary key and position as the secondary key. Let P be a prefix class, and (x,i) and (y, j) denote any two elements in the class. Px denotes the class representing extension of element (x, i). Define (y,j) join (x,i ) as follows: Case I ( i = j ): 1) If P ≠ 0, add (y, j) and (y, j+1) to Px. 2) If P = 0, add (y, j) to Px. Case II ( i > j ): add (y,j) to Px Case III ( i < j ): no new element is possible in this case • The Theorem has a mistake.

  12. 1 1 Prefix: 1 2 Element List: (3, 1); (4, 0) 2 2 4 3 Prefix = 1 2 3 Prefix = 1 2 –1 4 1 1 1 1 1 2 4 2 4 4 2 2 2 4 (4,2) 4 (4,0) 3 3 3 3 (4,0) join (4,0) (4,0) (3,2) (3,1) 3 If we add (y, j+1), i.e., (4, 1), we get the following tree: 1 2 4 –1 4, wrong! (3,1) join (3,1) (4,0) join (3,1)

  13. TreeMiner Algorithm • TreeMiner (D(database of tree, Forest), minsup) • F1 = { frequent 1-subtrees }; • F2 = { classes [P]1 of frequent 2-subtrees }; • For all [P], do Enumerate-Frequent-Subtree; • Enumerate-Frequent-Subtree Fk • For each element (x, i) € [P] do • For each element (y, j) € [P] do • (y,j) join (x, i) => at most two new candidate subtrees • For each subtree, do scope-list joins • If it is frequent, then we add the subtree to the list of frequent-subtree. • Repeated until all frequent subtrees have been enumerated. P: prefix class. [P]1 means the prefix size = 1, i.e., only one node in the prefix class. Px refers to the new prefix tree formed by adding (x, i) to P. Fk: the set of all frequent subtrees of size k.

  14. An example of TreeMiner Algorithm 0 D in Horizontal Format: (tree-id, string encoding): (T0, 1 2 –1 3 4 –1 –1) (T1, 2 1 2 –1 4 –1 –1 2 –1 3 –1) (T2, 1 3 2 –1 –1 5 1 2 –1 3 4 –1 –1 –1 -1) 0 1 1 1 3 1 2 3 5 2 3 2 4 3 2 1 4 5 6 2 3 D in Vertical Format ( tree-id, scope) pairs: 1 2 3 4 5 0, [0,3] 0, [1,1] 0, [2,3] 0, [3,3] 2, [3,7] 1, [1,3] 1, [0,5] 1, [5,5] 1, [3,3] 2, [0,7] 1, [2,2] 2, [1,2] 2, [7,7] 2, [4,7] 1, [4,4] 2, [6,7] 2, [2,2] 2, [5,5] Tree T0 7 4 0 2 Tree T2 1 4 5 2 1 3 Database D of 3 Trees 2 3 4 2 Tree T1

  15. Scope-List Joins Example: minsup = 100% Step 1: Calculate F1: Prefix = {}, Element list: (1,-1), (2,-1), (3,-1), (4,-1) Step 2: Calculate F2: Suppose Prefix = {1}, Element list:(2,0), (4,0) Step 3: Calculate F3: Suppose Prefix = {1,2}, Element list:(4,0) 1 2 3 4 1 1 1 0,[0,3]* 0,[1,1] 0,[2,3] 0,[3,3] 1,[1,3] 1,[0,5] 1,[5,5] 1,[3,3] 2,[0,7] 1,[2,2] 2,[1,2] 2,[7,7] 2,[4,7] 1,[4,4] 2,[6,7] 2,[2,2] 2,[5,5] 4 4 2 2 0,0,[1,1]* 0,0,[3,3] 1,1,[2,2] 1,1,[3,3] 2,0,[2,2] 2,0,[7,7] 2,0,[5,5] 2,4,[7,7] 2,4,[5,5] 0,01,[3,3]* 1,12,[3,3] 2,02,[7,7] 2,05,[7,7] 2,45,[7,7] Infrequent Element: (5,-1) *: 0 – tree id [0,3] – node scope Infrequent Element: (1,0), (3,0) Infrequent Element: (2,0), (2,1), (4,0) *: 0 – tree id 0 – the node number (position) of the prefix {1} [1,1] – scope of the element node. *: 0 – tree id 01 – the node number (position) of the prefix {12} [3,3] – scope of the element node.

  16. Conclusion • Introduce the notion of mining embedded subtrees in a (forest) database of trees • Systematic candidate subtree generation. No subtree is generated more than once. (but has a mistake) • Use a string encoding of tree to store dataset efficiently • Use a node’s scope to develop scope-lists • Introduce a new algorithm – TreeMiner

More Related