1 / 37

Fast Algorithms for Mining Frequent Itemsets

挖掘頻繁項目集合之快速演算法研究. Fast Algorithms for Mining Frequent Itemsets. 博士論文計畫書. 指導教授 : 張真誠 教授 研究生 : 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005. Outline. Introduction Background and Related Work

julian-best
Download Presentation

Fast Algorithms for Mining Frequent Itemsets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 挖掘頻繁項目集合之快速演算法研究 Fast Algorithms for Mining Frequent Itemsets 博士論文計畫書 指導教授: 張真誠 教授 研究生: 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: January 20, 2005

  2. Outline • Introduction • Background and Related Work • A New FP-Tree for Mining Frequent Itemsets • Efficient Algorithms for Mining Share-Frequent itemsets • Conclusions

  3. Introduction • Data mining techniques have been developed to fine a small set of precious nugget from reams of data • Mining association rules constitutes one of the most important data mining problem • Two sub-problem • Identifying all frequent itemsets • Using these frequent itemsets to generate association rules • The first sub-problem plays an essential role in mining association rules • Mining frequent itemsets & mining share-frequent itemsets

  4. Background and Related Work • Support-Confidence Framework • Each item is a binary variable denoting whether an item was purchased • Apriori (Agrawal & Swami, 1994) & Apriori-like algorithms • Pattern-growth algorithms (Han et al, 2000; Han et al, 2004) • Share-Confidence Framework (Carter et al., 1997 ) • Support-confidence framework does not analyze the exact number of products purchased. • The support count method does not measure the profit or cost of an itemset • Exhaustive search algorithm • Fast algorithms (but with errors)

  5. Support-Confidence Framework (1/3) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%

  6. Support-Confidence Framework (2/3) • FP-growth algorithm (Han et al. 2000; Han et al., 2004)

  7. Support-Confidence Framework (3/3) Conditional FP-tree of “D” Conditional FP-tree of “BD”

  8. Share-Confidence Framework (1/6) • Measure value: mv(ip, Tq) • mv({D}, T01) = 1 • mv({C}, T03) = 3 • Transaction measure value: tmv(Tq) = • tmv(T02) = 9 • Total measure value: Tmv(DB)= • Tmv(DB)=44 • Itemset measure value: imv(X, Tq)= • imv({A, E}, T02)=4 • Local measure value: lmv(X)= • lmv({BC})=2+4+5=11

  9. Share-Confidence Framework (2/6) • Itemset share: SH(X)= • SH({BC})=11/44=25% • SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset minShare=30%

  10. Share-Confidence Framework (3/6) • ZP(Zero Pruning)、ZSP(Zero Subset Pruning) • variants of exhaustive search • prune the candidate itemsets whose local measure values are exactly zero • SIP(Share Infrequent Pruning) • like Apriori • with errors • CAC(Combine All Counted)、PCAC(Parametric CAC) • From ZSP, using a predict function • with errors • IAB(Item Add-Back)、PIAB(Parametric IAB) • join each share frequent itemset with each 1-itemset • with errors • Existing algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets

  11. Share-Confidence Framework (4/6) ZP Algorithm SIP & IAB Algorithms

  12. Share-Confidence Framework (5/6) ZSP Algorithm

  13. Share-Confidence Framework (6/6) PSH(XY)=SH(X)+(SH(Y) × |dbx|/|DB|), |dbx|<|dbY|…(1) PSH(XY)=SH(Y)+(SH(X) × |dbY|/|DB|), |dbY|<|dbX|…(2) PSH(XY)=((1)+(2))/2, |dbY|=|dbX| PSH(AB)=(22.7%+18.2% × 4/6+18.2% + 22.7% × 4/6)/2=34.1% PSH(AE)=9.1%+22.7% × (2/6)=16.7% < 30% CAC Algorithm

  14. A New FP-Tree for Mining Frequent Itemsets (1/3) • NFP-growth Algorithm • NFP-tree construction

  15. A New FP-Tree for Mining Frequent Itemsets (2/3)

  16. A New FP-Tree for Mining Frequent Itemsets (3/3) Conditional NFP-tree of “D(3,4)”

  17. Experimental Results (1/4) • PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional • All algorithms were coded in VC++ 6.0 • Datasets: • Real: BMS-Web View-1, BMS-Web View-2, Connect 4 • Artificial: generated by IBM synthetic data generator

  18. Experimental Results (2/4)

  19. Experimental Results (3/4)

  20. Experimental Results (4/4)

  21. A Fast Algorithm for Mining Share-Frequent Itemsets • FSM: Fast Share Measure algorithm • ML: Maximum transaction length in DB • MV: Maximum measure valuein DB • min_lmv=minShare×Tmv • Level Closure Property: Given a minShare and a k-itemset X • Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent • Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all supersets of X with length k+k’ are infrequent • Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent

  22. FSM: Fast Share Measure algorithm • minShare=30% • Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) • Prune X if CF(X)<min_lmv • CF({ABC})=3+(3/3)×3×(6-3)=12<14=min_lmv

  23. ExperimentalResults (1/2) • T4.I2.D100k.N50.S10 • minShare = 0.8% • ML=14

  24. Experimental Results (2/2)

  25. Efficient Algorithms for Mining Share-Frequent itemsets • EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently • Reduce time complexity from O(n2k-2) to O(nk)

  26. Efficient Algorithms for Mining Share-Frequent itemsets • Xk+1:arbitrary superset of X with length k+1 in DB • S(Xk+1): the set which contains all Xk+1 in DB • dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1 • SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM • SuFSM (Support-counted FSM): • Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML –k)< min_lmv, all supersets of X are infrequent

  27. SuFSM (Support-counted FSM) • lmv(X)/k Sup(X) Sup(S(Xk+1)) maxSup(Xk+1) • EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2, maxSup(Xk+1)=1 • If there is no superset of X is an SH-frequent itemset, then the following four equations hold • lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • lmv(X)+Sup(X) ×MV×(ML - k) < min_lmv • lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • lmv(X)+maxSup(Xk+1) ×MV×(ML - k) < min_lmv

  28. ShFSM (Share-counted FSM) • ShFSM (Share-counted FSM): • Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent • FSM:lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • ShFSM: Tmv(dbS(Xk+1)) < min_lmv

  29. ShFSM (Share-counted FSM) • Ex. X={AB} • Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv

  30. Experimental Results (1/4)

  31. Experimental Results (2/4) minShare=0.3%

  32. Experimental Results (3/4) minShare=0.3%

  33. Experimental Results (4/4) • T6.I4.D100k.N200.S10 • minShare = 0.1% • ML=20

  34. Conclusions • Support measure • Uses two counters per tree node to reduce the number of the tree nodes. • Applies a smaller tree and header table to discover frequent itemsets efficiently • Consider the development of superior data structures and extend the pattern-growth approach

  35. Share measure • Proposed algorithms efficiently decrease the candidate number to be counted • The performance of ShFSM is the best • Consider the development of superior algorithms to accelerate the process of identifying all SH-frequent itemsets

  36. ShFSM: Tmv(dbS(Xk+1)) < min_lmv

  37. Thank You!

More Related