1 / 17

Verify and mining frequent patterns from large windows over data streams

Verify and mining frequent patterns from large windows over data streams. Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008. Outline. Introduction and motivation SWIM algorithm DTV 、 DFV algorithm Experiments Conclusion. Introduction and motivation. Conditional counting

dinah
Download Presentation

Verify and mining frequent patterns from large windows over data streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008

  2. Outline • Introduction and motivation • SWIM algorithm • DTV、DFV algorithm • Experiments • Conclusion

  3. Introduction and motivation • Conditional counting • Verifiers: DTV ,DFV verify the frequency of previously frequent itemsets over newly arriving windows • Fast verifier for incremental frequent itemset mining: Sliding window incremental miner (SWIM)

  4. SWIM algorithm • The difficulty: a new pattern is added to pattern tree for the first time, its true frequency in the whole window is not known, since this pattern wasn`t frequent in the previous n-1 slides • W: window • PT (Pattern tree): a superset of the frequent patterns over W • aux_array: stores the frequency of a pattern for each window, for which the frequency is unknown • p.fi: the frequency of p in the ith slide • p.freq: p`s cumulative frequency in the current window

  5. SWIM algorithm (cont.) • Example: … S2 S3 S4 S5 S6 S7 … W4 W5 W6 W7 W4: aux_array=<p.f4,p.f4> p.freq=p.f4 W5:aux_array=<p.f2+p.f4,p.f4+p.f5> p.freq=p.f4+p.f5 W6:aux_array=<p.f2++p.f3+p.f4,p.f3+p.f4+p.f5> p.freq=p.f4+p.f5+p.f6 W7:p.freq=p.f5+p.f6+p.f7

  6. Analysis of SWIM algorithm • Delay: the frequency of pattern turns out to be larger than the minimum support • Maximum delay:n-1 slides (n: number of slides) • Bottleneck: counting frequencies of itemsets over a given dataset( delay=L, n-L+1slides)

  7. Conditional counting • Goal: verifies counts for a given set of patterns • 1.p`s true frequency in D if it has occurred at least min_freq times • 2.reports it has occurred less than min_freq (frequency not required in this case, it can skip any pattern whose frequency less than min_freq)

  8. Conditional counting (cont.) • Verification • given a set of transaction T, a set of pattern P and a threshold s • goal: find the exact freq of each p P w.r.t T, iff its freq is≧ s • ifs=0 ,verification=counting, but if s>0 extra computation can be avoided • Proposed fast verifiers • DTV, DFV, hybrid ∈

  9. a b:4 g:2 b:? e:? d:? b:? d:2 a:2 b:1 a:3 f:? g:? f:? b:2 d:? c:2 c:3 g:? b:3 e:? e:1 b:? d:2 d:4 b:5 a:5 c:5 d:? b:1 g:1 e:1 e:1 h:1 g:1 f:1 g:4 a root root root root:? root:? root:? root:4 b b c c d e f a a a a b b b b c c c c d d d d e e e e f f f f g g h h Double-Tree Verifier (DTV) FP-tree a b c d e f g g:2 h Original fp-tree Conditionalized fp-tree on g Conditionalized fp-tree |g on d Pattern-tree Initial pattern tree pattern tree | ”g” pattern tree | ”g” after verification against FP-tree Filling original pattern tree using reverse pointers

  10. Double-Tree Verifier (DTV) • for very small min_freq values, it becomes impossible to run FP-growth due to the exponential number of paths • Advantage: it is useful when the minimum support decreases

  11. Depth-First Verifier (DFV) • Ancestor Failure: if a path in the fp-tree has already proved to not contain a prefix of the pattern p, then it does not contain p itself either (apriori property) • Smaller Sibling Equivalence: if a path in the fp-tree has already been marked to (or not to) contain a smaller sibling of the pattern p, then it does (or does not) contain p itself too • Parent Success: if a path in the fp-tree has already been marked to contain the parent pattern of p, then it also contain p

  12. Hybrid Version • many transactions in the fp-tree and many patterns in the pattern tree :DTV is faster than DFV • trees are small: DFV is faster than DFV • Hybrid: start with DTV until the conditionalized tree are “small enough” and after that point switch to DFV

  13. Experiments

  14. Experiments (cont.) transaction=100k

  15. Conclusion • Speed up many other application: • incremental mining (SWIM) • enhancing static algorithms (counting phase) • privacy preserving techniques (long transaction) • monitoring /concept shift detection • Hybrid : no exactly point to switch DTV to DFV

More Related