1 / 16

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining. National Institute of Informatics, JAPAN National Institute of Informatics, JAPAN Hokkaido University, JAPAN. Takeaki Uno Masashi Kiyomi Hiroki Arimura. 20/Aug/2005 Open Source Data Mining ’05.

orenda
Download Presentation

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining National Institute of Informatics, JAPAN National Institute of Informatics, JAPAN Hokkaido University, JAPAN Takeaki Uno Masashi Kiyomi Hiroki Arimura 20/Aug/2005 Open Source Data Mining ’05

  2. Computation of Pattern Mining For frequent itemsets and closed itemsets, enumeration methods are almost optimal coding technique = × TIME + #iterations time of an iteration I/O (#iterations) is not so larger than (#solutions) (linearly bounded!) • linear time in #solution • polynomial delay •frequency counting, •data structure reconstruction, •closure operation, •pruning, ... Goal: clarify feature of •enumeration algorithms •real-world data sets For what cases(parameter), which technique is good? “theoretical intuitions/evidences” are important We focus on data structure and computation in an iteration

  3. Motivation Good: dense data, large support Bad: sparse data, small support • Some data structures have been proposed for storing huge datasets, and accelerate the computation (frequency counting) • Each has its own advantages and disadvantages 1. Bitmap 2. Prefix Tree 3.Array List (with deletion of duplications) c e Good: non-sparse data, structured data Bad: sparse data, non-structured data b a d c e g Good: non-dense data Bad: very dense data Datasets have both dense part and sparse part How can we fit?

  4. Observations d e n s e sparse transactions items •Usually, databases satisfy power law  the part of few items is dense, and the rest is very sparse •Using reduced conditional databases, in almost all iterations, the size of the database is very small  Quick operations for small database are very efficient rec. depth ...

  5. c items d e n s e sparse Idea of Combination •Use bitmap and array lists for dense and sparse parts •Use prefix tree of constant size for frequency counting • Choose a constant c • F =c items of largest frequency • Split each transaction T in two parts, dense part composed of items in F sparse part composed of items not in F • Store dense part by bitmap, and sparse part by array list transactions items We can take all their advantages

  6. Complete Prefix Tree c d b d c d a d c d d b c d d We use complete prefix tree: prefix tree including all patterns

  7. Complete Prefix Tree 0111 1111 0011 1001 0101 1101 0001 1001 0110 1110 0010 1010 0100 1100 0000 1000 We use complete prefix tree: prefix tree including all patterns Parent of a pattern is obtained by clearing the highest bit (Ex. 010110000110)  no pointer is needed Ex) transactions {a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactions If c is small, then its size 2c is not huge

  8. Complete Prefix Tree 1111 0111 0011 1001 0101 1101 0001 1001 0110 1110 0010 1010 1100 0100 0000 1000 We use complete prefix tree: prefix tree including all patterns  Any prefix tree is its subtree Parent of a pattern is obtained by clearing the highest bit (Ex. 010110000110)  no pointer is needed Ex) transactions {a,b,c}, {a,c,d}, {c,d} We construct the complete prefix tree for dense part of the transactions If c is small, then its size 2c is not huge Ex) transactions {a,b,c,d}, {a}, {a,d}

  9. Frequency counting 0111 0111 1111 0011 0011 1001 0101 1101 0101 1101 0001 0001 1001 0110 1110 0010 1010 1100 0100 1100 0100 0000 1000 • Frequency of a pattern (vertex) =# descendant leaves • Occurrence by adding item i = patterns with ith bit = 1  Bottom up sweep is good 2 1 2 3 Linear time in the size of prefix tree

  10. “Constant Size” Dominates •How much iterations input “constant size database” ? “Small iterations” dominate computation time, “Strategy change” is not a heavy task

  11. More Advantages •Reconstruction of prefix trees is a heavy task  complete prefix tree needs no reconstruction •Coding prefix trees is not easy  complete prefix tree is easy to be coded • Radix sort used for detecting the identical transactions is heavy when data is dense  Bitmaps for dense parts accelerate the radix sort

  12. For Closed/Maximal Itemsets prefix prefix 0111 1111 •Compute the closure/maximality by storing the previously obtained itemsets  No additional function is needed •Depth-first search (closure extension type)  Need prefix of each itemsets prefix prefix 0011 1001 prefix 0101 1101 By taking intersection/weighted union of prefixes at each node of the prefix tree, we can compute efficiently (from LCM v2) 0001 1001 0110 1110 prefix prefix 0010 1010 0100 1100 0000 1000

  13. Experiments We applied the data structure to LCM2 CPU, memory, OS: Pentium4 3.2GHz, 2GB memory, Linux Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI, nonodrfp, aim2, DCI-closed (All these marked high scores atcompetition FIMI04) 14 datasets of FIMI repository Memory usage decreased to half, for dense datasets, but not for sparse datasets

  14. Experimental Results

  15. Experimental Results

  16. Discussion and Future Work •Combination of bitmaps and array lists reduces memory space efficiently, for dense datasets •Using prefix trees for constant number of item is sufficient for speeding up frequency counting, for non-sparse datasets, •The data structure is orthogonal to other methods for closed/maximal itemset mining, maximality check, pruning, closure operations etc. Future work: other pattern mining problems •Bitmaps and prefix trees are not so efficient for semi-structured data (semi-structure gives huge variations, hardly represented by bits, and to be shared) •Simplify the techniques so that they can be applied easily •Stablememory allocation (no need to dynamic allocation)

More Related