1 / 13

Efficiently Mining Long Patterns from Databases

Efficiently Mining Long Patterns from Databases. Roberto J. Bayardo Jr. IBM Almaden Research Center. Abstract. Max-Miner : scale roughly linearly in the number of maximal patterns, irrespective of the length of the longest pattern

orrin
Download Presentation

Efficiently Mining Long Patterns from Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center

  2. Abstract • Max-Miner : scale roughly linearly in the number of maximal patterns, irrespective of the length of the longest pattern • previous algorithms : scale exponentially with longest pattern length

  3. Introduction (Cont’d) • pattern mining algorithms have been developed to operate on databases where the longest patterns are relatively short • Interesting data-sets with long patterns • sales transactions detailing the purchases made by regular customers over a large time window • biological data from the fields of DNA and protein analysis

  4. Introduction (Cont’d) • Apriori-like algorithms are inadequate on data-sets with long patterns • a bottom-up search • enumerates every single frequent itemset • exponential complexity • in order to produce a frequent itemset of length l, it must produce all 2l of its subsets since they too must be frequent • restricts Apriori-like algorithms to discovering only short patterns

  5. Introduction • Max-Miner algorithm • for efficiently extracting only the maximal frequent itemsets • roughly linear in the number of maximal frequent itemsets • “look ahead” , not bottom-up search • can prune all its subsets from consideration, by identifying a long frequent itemset early on

  6. Max-Miner (Cont’d) • Rymon’s generic set-enumeration tree search frame work • ex. Figure 1. • breadth-first search • in order to limit the number of passes • pruning strategies • subset infrequency pruning (as does Apriori) • superset frequency pruning

  7. { } 1 2 3 4 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 1,3,4 2,3,4 1,2,3,4 Figure 1. Complete set-enumeration tree over four items

  8. Max-Miner (Cont’d) • candidate group g, • head, h(g) • represents the itemsets enumerated by the node • tail, t(g) • an ordered set • contains all items not in h(g) that can potentially appear in any sub-node • ex. the node enumerating itemset {1} •  h(g) = {1}, t(g) = {2, 3, 4}

  9. Max-Miner • counting the support of a candidate group g, • computing the support of itemsets h(g), h(g)  t(g) and h(g)  {i} for all i  t(g) • superset-frequency pruning • halting sub-node expansion at any candidate group g for which h(g)  t(g) is frequent • subset-infrequency pruning • removing any such tail item from candidate group before expanding its sub-nodes

  10. MAX-MINER (Data-set T) ;; Returns the set of maximal frequent itemsets present in T Set of Candidate Groups C  { } Set of Itemsets F  {GEN-INITIAL-GROUP(T, C)} while C is non-empty do scan T to count the support of all candidate groups in C for each g  C such that h(g)  t(g) is frequent do F  F  {h(g)  t(g)} Set of Candidate Groups Cnew  { } for each g  C such that h(g)  t(g) is infrequent do F  F  {GEN-SUB-NODES(g, Cnew)} C  Cnew remove from F any itemset with a proper superset in F remove from C any group g such that h(g)  t(g) has a superset in F return F Figure 2. Max-Miner at its top level

  11. GEN-INITIAL-GROUPS (Data-set T, Set of Candidate Groups C) ;; C is passed by reference and returns the candidate groups ;; The return value of the function is a frequent 1-itemset scan T to obtain F1, the set of frequent 1-itemsets impose an ordering on the items in F1 for each item i in F1 other than the greatest item do let g be a new candidate with h(g) = {i} and t(g) = {j | j follows i in the ordering} C  C  {g} return the itemset in F1 containing the greatest item Figure 3. Generating the initial candidate groups

  12. GEN-SUB-NODES (Candidate Group g, Set of Cand. Groups C) ;; C is passed by reference and returns the sub-nodes of g ;; The return value of the function is a frequent itemset remove any item i from t(g) if h(g)  {i} is infrequent reorder the items in t(g) for each i  t(g) other than greatest do let g’ be a new candidate with h(g’) = h(g)  {i} and t(g’) = { j | j  t(g) and j follows i in t(g)} C  C  {g’} return h(g)  {m} where m is the greatest item in t(g), or h(g) if t(g) is empty Figure 4. Generating sub-nodes

  13. Item Ordering Policies • to increase the effectiveness of superset-frequency pruning • to position the most frequent items last • ordering : Gen-Initial-Group • in increasing order of sup({i}) • reordering : Gen-Sub-Nodes • in increasing order of sup(h(g)  {i}) • consider only the subset of transactions relevant to the given node

More Related