Finding Recent Frequent Itemsets Adaptively over Online Data Streams

Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge Discovery and Data Ming, 2003. Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Date: 2004.8.12

Introduction • This paper proposes a method of finding recent frequent itemsets： • Significant itemsets are maintained by a prefix-tree lattice structure called monitoring lattice. • Decaying the old occurrence count of each itemset as time goes by. • Minimize the number of significant itemsets： • delayed-insertion • pruning operations

Preliminaries (1) • Data Stream can be defined： • I={i1, i2, …, in}：a set of current items. • e：itemset, a set of item. • Tid：transaction id, Tk generate at the kth turn. • Dk=<T1, T2, …, Tk>, When new transaction Dk is generated. • |D|k：the number of transactions in Dk. • Ck(e)：the number of transactions in Dk that contain the itemset e. • Sk(e)：Support of itemset e in Dk.

Preliminaries (2) • Decay rate：the reducing rate of a weight for a fixed decay-unit. • d=b-(1/h), (b>1, h≧1, b-1≦d<1) • decay-unit：the chunk of information to be decayed together. • decay-base b：the amount of weight reduction per a decay-unit and greater than 1. • decay-base-life h：defined by the number of decay-units that makes the current weight be b-1.

Preliminaries (3) • The total number of transactions |D|kin the current data stream Dk： • The value of |D|kconverges to 1/(1-d) as the value k increases infinitely. • The count Ck(e) of an itemset e in the current data stream Dk：

Count Estimation of an itemset (1) • The maximum possible count of an itemset is estimated by the minimum value among the maximum possible counts of all of its subsets.

Count Estimation of an itemset (2) • Definition 1： • ：a set of itemset e’s subsets • ：a set of e’s m-subsets • ： a set of counts for e’s m-subsets • Definition 2： • Union-itemset is composed of all items that are members of either e1or e2. • Intersection-itemset is composed of all items that are members of both e1and e2.

Count Estimation of an itemset (3) • exclusively distributed (LED)：the items of an itemset appear together in as many transactions as possible. • most exclusively distributed (MED)：the items of an itemset appear exclusively as many transactions as possible. • The maximum count of n-itemset e：

Count Estimation of an itemset (4) • Two itemsets e1, e2： • The minimum count of Cmin(e) can be estimated by (n-1)-subset union： • Estimation error： • E(e)=Cmax(e)-Cmin(e)

estDec Method (1) • Every node in a monitoring lattice maintains a triple (cnt, err, MRtid) for its corresponding itemset e： • cnt：count of e. • err：maximum error count of e • Mrtid： the most recent transacrion id that contain e

estDec Method (2) • estDec Method is composed of four phase： • Phase Ⅰ：parameter updating phase • Phase Ⅱ：count updating phase • Phase Ⅲ：Delayed insertion phase • Phase Ⅳ：frequent itemset selection phase

estDec Method (3) • Phase II：the counts of those itemsets in ML that appear in Tk are updated. • Sprn：threshold for pruning. • If a 1-itemset is pruned from ML, it is impossible to estimate its count later. Phase I：|D|kis updated.

estDec Method (4) • Phase III： Find new itemset that has high possibility to become frequent. Two cases insert new itemset to a ML： • new 1-itemset, the cnt of 1-itemset is actual. • Itemset e Cmax(e)/|D|k ≧ Sins, Sins：threshold for delayed-insertion. • cntt_for_subsets=(1-d|e|-1)/(1-d) • max_xnt_before_subsets=Sins*(|D|k-(|e|-1))*d|e|-1) • Cupper(e)=Max_xnt_before_subsets+ Cntt_for_subsets

estDec Method (5) • Phase IV：produces all current frequent itemsets in ML. • itemset e is frequent if its current support (cnt * d(k-MRtid))/|D|k is greater than Smin • its current support error： • (err*d(k-MRtid))/|D|k

estDec Method (6) • Force-pruning operation： • all insignificant itemsets in ML can be pruned • perform when the current size of ML reaches a threshold.

Experimental (1) • Performance of the estDec method for the data set T10.I4.D1000K • Sins is denoted p%, the actual value=Smin*p%. • Force-pruning operation perform in every 1,000 transactions. • (a) memory usage (b) performance time of Phases I~III (c) performance time of Phases IV

Experimental (2) • Accuracy of mining result • Average support error • ASE(RestDec|RdApriori)

Experimental (3) • The adaptability of the estDec method for the change of information in a data stream. • Coverage rate CR(X) • |R|：total nmber of frequent itemdets in ML

Finding Recent Frequent Itemsets Adaptively over Online Data Streams

Finding Recent Frequent Itemsets Adaptively over Online Data Streams

Presentation Transcript

Finding Frequent Items in Data Streams

Mining Frequent Itemsets over Uncertain Databases

Frequent Pattern Mining in Data Streams

Finding Frequent Items in Distributed Data Streams

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window

Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton]

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams

Text clustering using frequent itemsets

Mining Frequent Itemsets over Uncertain Databases

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Finding the Frequent Items in Streams of Data

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Finding Maximal Frequent Itemsets over Online Data Streams Adaptively

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

Finding Frequent Items in Data Streams