630 likes | 668 Views
This doctoral thesis explores fast algorithms for mining frequent itemsets, introducing the NFP-Tree structure and the Fast Share Measure (FSM) Algorithm. It includes efficient strategies like Direct Candidate Generate (DCG) and Maximum Item Conflict First (MICF). The study compares methods for mining association rules and utility, emphasizing privacy-preserving techniques. Experimental results and algorithm properties are discussed.
E N D
探勘頻繁項目集合之快速演算法研究 Fast Algorithms for Mining Frequent Itemsets 博士論文初稿 指導教授: 張真誠 教授 研究生: 李育強 Dept. of Computer Science and Information Engineering, National Chung Cheng University Date: May 31, 2007
Outline • Introduction • Background and Related Work • NFP-Tree Structure • Fast Share Measure (FSM) Algorithm • Three Efficient Algorithms • Direct Candidate Generate (DCG) Algorithm • Isolated Items Discarding Strategy (IIDS) • Maximum Item Conflict First (MICF) Sanitization Method • Conclusions
Introduction • Data mining techniques have been developed to find a small set of precious nugget from reams of data (Cabena et al., 1998; Kantardzic, 2002) • Mining association rules constitutes one of the most important data mining problem • Two sub-problem (Agrawal & Srikant, 1994) • Identifying all frequent itemsets • Using these frequent itemsets to generate association rules • The first sub-problem plays an essential role in mining association rules
Introduction (con’t) • Mining frequent itemsets • Mining share-frequent itemsets • Mining high utility itemsets • Hiding sensitive patterns
Support-Confidence Framework (1/4) Apriori algorithm (Agrawal and Srikant, 1994): minSup = 40%
Support-Confidence Framework (2/4) • FP-growth algorithm (Han et al., 2000; Han et al., 2004)
Support-Confidence Framework (4/4) Conditional FP-tree of “D” Conditional FP-tree of “BD”
Share-Confidence Framework (1/4) • Measure value: mv(ip, Tq) • mv({D}, T01) = 1 • mv({C}, T03) = 3 • Transaction measure value: tmv(Tq) = • tmv(T02) = 10 • Total measure value: Tmv(DB)= • Tmv(DB)=47 • Itemset measure value: imv(X, Tq)= • imv({A, E}, T02)=5 • Local measure value: lmv(X)= • lmv({BC})=2+5+5=12
Share-Confidence Framework (2/4) • Itemset share: SH(X)= • SH({BC})=12/47=25.5% • SH-frequent: if SH(X) >= minShare, X is a share-frequent (SH-frequent) itemset minShare=30%
Share-Confidence Framework (3/4) • ZP(Zero Pruning)、ZSP(Zero Subset Pruning) (Barber & Hamilton, 2003) • variants of exhaustive search • prune the candidate itemsets whose local measure values are exactly zero • SIP(Share Infrequent Pruning) (Barber & Hamilton, 2003) • like Apriori • with errors • The three algorithms are either inefficient or do not discover complete share-frequent (SH-frequent) itemsets
Share-Confidence Framework (4/4) ZSP Algorithm SIP Algorithm
Utility Mining (1/2) • Internal utility: iu(ip, Tq) • iu({D}, T01) = 1 • iu({C}, T03) = 3 • External utility: eu(ip) • eu({D}) = 3 • eu({C}) = 1 • Utility value in a transaction: • util({C, E, F}, T02) = util(C, T02) + util(E, T02) + util(F, T02) = 3X1+1X5+2X2=12 • Local utility: • Lutil({C, D}) = util({C, D}, T01) + util({C, D}, T04) + util({C, D}, T06) = 4 + 7 + 5 = 16
Utility Mining (2/2) • Total utility: Tutil(DB) = • Tutil(DB) = 122 • The utility value of X in DB: UTIL(X)= • UTIL({C, D}) = 16/122 =13.1% • High utility itemset: if UTIL(X) >= minUtil, X is a high utility itemset
Privacy-Preserving in Mining Frequent Itemsets • NP-hard problem (Atallah et al., 1999) • DB: database, DB’: released database • RI: the set of restrictive itemsets • ~RI: the set of non-restrictive itemsets • Misses cost = • Sanitization algorithms (Oliveira and Zaïane, 2002; Oliveira and Zaïane, 2003; Saygin et al., 2001)
NFP-Tree (1/4) • NFP-growth Algorithm • NFP-tree construction
NFP-Tree (2/4)
NFP-Tree (3/4)
NFP-Tree (4/4) Conditional NFP-tree of “D(3,4)”
Experimental Results (1/3) • PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running windows 2000 professional • All algorithms were coded in VC++ 6.0 • Datasets: • Real: BMS-Web View-1, BMS-Web View-2, Connect 4 • Artificial: generated by IBM synthetic data generator
Fast Share Measure (FSM) Algorithm • FSM: Fast Share Measure algorithm • ML: Maximum transaction length in DB • MV: Maximum measure valuein DB • min_lmv=minShare×Tmv • Level Closure Property: Given a minShare and a k-itemset X • Theorem 1. If lmv(X)+(lmv(X)/k)×MV < min_lmv, all supersets of X with length k + 1 are infrequent • Theorem 2. If lmv(X)+(lmv(X)/k)×MV ×k’< min_lmv, all supersets of X with length k+k’ are infrequent • Corollary 1. If lmv(X)+(lmv(X)/k)×MV ×(ML-k)< min_lmv, all supersets of X are infrequent
minShare=30% • Let CF(X)=lmv(X)+(lmv(X)/k)×MV ×(ML-k) • Prune X if CF(X)<min_lmv • CF({ABC})=3+(3/3)×3×(6-3)=12<14.1=min_lmv
ExperimentalResults (1/2) • T4.I2.D100k.N50.S10 • minShare = 0.8% • ML=14
Three Efficient Algorithms • EFSM (Enhanced FSM): instead of joining arbitrary two itemsets in RCk-1, EFSM joins arbitrary itemset of RCk-1 with a single item in RC1 to generate Ck efficiently • Reduce time complexity from O(n2k-2) to O(nk)
Xk+1:arbitrary superset of X with length k+1 in DB • S(Xk+1): the set which contains all Xk+1 in DB • dbS(Xk+1): the set of transactions of which each transaction contains at least one Xk+1 • SuFSM and ShFSM from EFSM which prune the candidates more efficiently than FSM • SuFSM (Support-counted FSM): • Theorem 3. If lmv(X)+Sup(S(Xk+1))×MV×(ML –k)< min_lmv, all supersets of X are infrequent
SuFSM (Support-counted FSM) • lmv(X)/k Sup(X) Sup(S(Xk+1)) • EX. lmv({BCD})/k=15/3=5, Sup({BCD})=3, Sup(S({BCD}k+1))=2, • If there is no superset of X is an SH-frequent itemset, then the following four equations hold • lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • lmv(X)+Sup(X) ×MV×(ML - k) < min_lmv • lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv
ShFSM (Share-counted FSM) • ShFSM (Share-counted FSM): • Theorem 4. If Tmv(dbS(Xk+1)) < min_lmv, all supersets of X are infrequent • FSM:lmv(X)+(lmv(X)/k)×MV×(ML - k) < min_lmv • SuFSM:lmv(X)+Sup(S(Xk+1)) ×MV×(ML - k) < min_lmv • ShFSM: Tmv(dbS(Xk+1)) < min_lmv
ShFSM (Share-counted FSM) • Ex. X={AB} • Tmv(dbS(Xk+1)) = tmv(T01)+tmv(T05) =6+6=12 <14 = min_lmv
Experimental Results (1/3) minShare=0.3%
Experimental Results (2/3) minShare=0.3%
Experimental Results (3/3) • T6.I4.D100k.N200.S10 • minShare = 0.1% • ML=20
IIDS (1/2) ShFSM minUtil=30%
IIDS (2/2) FUM minUtil=30%
Experimental Results (5/5) minUtil = 0.12% minUtil = 0.12%
Maximum Item Conflict First (MICF) Sanitization Method Tdegree(Tq): the degree of conflict of a sensitive transaction Tq is the number of restrictive itemsets which are included in Tq, If Tdegree(Tq) > 1, Tq is a conflicting transaction
Idegree({D}, {D, F}, T05)=1 • Idegree({F}, {D, F}, T05)=0 • MaxIdegree: store the maximum value of the conflict degree among items in a transaction • MICF: select an item with MaxIdegree to delete in each iteration
1 • Idegree({D}, {D, F}, T06)=1 • Idegree({F}, {D, F}, T06)=0 4