130 likes | 533 Views
Parallel Mining of Maximal Frequent Itemsets form Databases. Soon M.Chunf and Congnan Luo Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI ’ 03). Outline. Introduction Max-Miner Algorithm Parallel Max-Miner (PMM) Algorithm
E N D
Parallel Mining of Maximal Frequent Itemsets form Databases Soon M.Chunf and Congnan Luo Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’03)
Outline • Introduction • Max-Miner Algorithm • Parallel Max-Miner (PMM) Algorithm • Performance Evaluation • Conclusion
Introduction (1) • In mining association rules, the most time-consuming job is finding all frequent itemsets from a large database with respect to a given minimum support • In Apriori, the subset-infrequency based pruning step prevents many candidate k-itemsets from being counted in each pass k • In Apriori-like algorithms, if there is a frequent itemset with length l, then they will generate and count its 2l subsets.
Introduction (2) • Our basic idea is that if we find a large frequent itemset early, we can avoid counting all its subsets because they are all frequent • We propose a parallel algorithm, named Parallel Max-Miner (PMM), for mining maximal frequent items • The PMM requires multiple passes over the database, like the Count Distribution algorithm, need synchronization between nodes at every pass end
Max-Miner algorithm • Unlike Apriori, the Max-Miner algorithm extracts only the maximal frequent itemset • Superset-frequency based pruning • Max-miner always attempts to look ahead in order to identify large frequent itemsets early • So all subsets of these discovered frequent itemsets can be pruned form the search space
Set-enumeration tree of Max-Miner (2) • Each node in the tree is called a candidate group • A candidate group g consists of two components which are actually two itemsets • The first itemset is called the head of the group and denoted by h(g) • The second itemset is called the tail of the group and denoted by t(g) • t(g) is an ordered set and contains all the items not in h(g) but can potentially appear in any subnode derived from node g
The main procedure of Max-Miner (1) • From the root of the tree at level 0, count the support of 1-itemsets. • Only the 1-itemsets which are frequent can be enumerated at level 1 • 4 nodes are generated at level 1 if 1, 2, 3, and 4 are all frequent 1-itemsets • For the node g1, we need to count the support of {h(g1) t(g1)}={1,2,3,4} • If the support of {h(g1) t(g1)} is equal or greater than minsup, then we do not need to expand the tree from the node g1 anymore
The main procedure of Max-Miner (2) • At any node g, if {h(g) t(g)} is not frequent, for each item I in t(g), we check if {h(g) i} is frequent • If {h(g) i} is frequent, a corresponding subnode is generated • We notice that for a candidate group node g, if an item appears last in the tail of g in ordering, it will appear in most offsprings of the node g • To discover the maximal frequent itemsets early, we better order the subnodes of each node in ascending order of their support
Parallel Max-Miner (PMM) algorithm • The database is evenly divided into N partitions {D0, D1, D2, …, DN-1}, one for each of the N nodes {P0, P1, P2, …, PN-1} • Each node has the same number of transactions allocated • PMM requires multiple passes over database • For each pass k, all the nodes have exactly the same set of candidate groups, Ck. • Each node count the support of Ck in local database, independently • At the end of each pass, all nodes exchange the count information so that they can generate the same set of Ck-1 for the next pass
Performance Evaluation Speedup of PMM Sizeup of PMM
Conclusion • We proposed a parallel maximal frequent itemset mining algorithm, Parallel Max-Miner, for shared-nothing multiprocessor systems • Drawback: quire synchronization between nodes to exchange the count information at the end of every pass