1 / 24

Mining Compressed Frequent-Pattern Sets

Mining Compressed Frequent-Pattern Sets. Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Proceedings of International Conference on Very Large Data Bases (VLDB), 2005. 報告人 : 吳建良. Outline. Introduction Problem Statement and Analysis Discovering Representative Patterns Performance Study.

cjimmy
Download Presentation

Mining Compressed Frequent-Pattern Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Compressed Frequent-Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Proceedings of International Conference on Very Large Data Bases (VLDB), 2005 報告人:吳建良

  2. Outline • Introduction • Problem Statement and Analysis • Discovering Representative Patterns • Performance Study

  3. Introduction • Frequent Pattern Mining • Given a database, find all the frequent itemsets • Challenges • Efficiency? • Many scaleable mining algorithms are available now • Usability? • High minimum support: common sense patterns • Low minimum support: explosive number of results

  4. Existing Compressing Techniques • Lossless compression • Closed frequent patterns • Non-derivable frequent itemsets • Lossy approximation • Maximal frequent patterns

  5. A Motivating Example • Closed frequent pattern • Report P1,P2,P3,P4,P5 • Emphasize too much on support • no compression • Maximal frequent pattern • Only report P3 • Only care about the expression • Loss the information of support • A desirable output: P2,P3,P4 • High-quality compression needs to consider both expression and support A subset of frequent itemsets in accident dataset

  6. Compressing Frequent Patterns • Compressing framework • Clustering frequent patterns by pattern similarity • Pick a representative pattern for each cluster • Key Problems • Similarity measure • Quality of the clustering • Expression and support • Efficiency

  7. Distance Measure • Let P1 and P2 are two closed frequent patterns, T(P) is the set of transactions which containP, the distance between P1 and P2 is: • Let T(P1)={t1,t2,t3,t4,t5}, T(P2)={t1,t2,t3,t4,t6}, thenD(P1,P2)=1-4/6=1/3 • D characterizes the support, but ignore the expression

  8. Representative Patterns • Incorporate expression into Representative Pattern • The representative pattern should be able to expressall the other patterns in the same cluster (i.e., superset) • The representative pattern Pr={38,16,18,12,17} • Representative pattern is also good w.r.t. distance • D(Pr, P1) ≤ D(P1, P2), D(Pr, P2) ≤ D(P1, P2) • Distance can be computed using support only

  9. δ-Clustering • δ-cover • A pattern P is δ-covered by another pattern P’ • P  P’ and D(P, P’)δ • δ-cluster • There exists a representative pattern Pr in the cluster • For each pattern P in the cluster, P is δ-covered by Pr

  10. Characteristics of δ-Clustering • P is δ-covered by a representation pattern Pr • Suppose sup(Pr)= , min_supM

  11. Pattern Compressing Problem • Pattern Compression Problem • Find the minimum number of clusters (representative patterns) • All the frequent patterns are δ-covered by at least one representative pattern • NP-hard problem • Reducible from set-covering problem

  12. Discovering Representative Patterns • RPglobal • Assume all the frequent patterns are mined • Directly apply greedy set-covering algorithm • Guaranteed bounds w.r.t. optimal solution • RPlocal • Directly mine from raw data set • Gain in efficiency, lose in bound guarantee • RPcombine • Combine above two methods • Trade-off w.r.t. efficiency and performance

  13. RPglobal • Algorithm • Collect the complete coverage information • Find the set of representative patterns • Maximize ∣Set(RP)∣ • Example: δ=0.35,M=3, FP(M)={A, B, C, D, AB, BD} Support 2, closed pattern={B, C, AB, BD, ABC, ABD}

  14. RPglobal(cont.) • Step1: the complete coverage information • Step2: the set of representative patterns Set(A)={A} Set(AB)={A, B} Set(ABC)={A, C, AB} Set(B)={B} Set(AC)={A, C} Set(ABD)={A, D, BD} Set(C)={C} Set(AD)={A, D} Set(D)={D} Set(BC)={C} Set(BD)={B, D} Pick Set(ABC) FP(M)={A, B, C, D, AB, BD} Pick Set(ABD) Pick Set(AB) or Set(BD)

  15. RPlocal Pattern P={a, c} • Depth first search strategy • A pattern can only be covered by its sons or patterns visited before • Integrate pattern Compression into frequent pattern mining process (FP-growth method) • Beneficial • Without storing all of outputs • More efficient pruning methods P’s Sons Visited Patterns covering P

  16. RPlocal (cont.) • Algorithm • FP-growth mining process with depth-first search strategy • Remember all previously discovered representative patterns • For each pattern P (Not covered yet) • Using closed_index for closed pruning checking • Select representative pattern Pr with largest coverage and covering P

  17. RPlocal (cont.) • Example: δ=0.35, M=3, Original Dataset Reordered Dataset B:4, A:3, C:3, D:3 Reorder all itemsets Depth first search in pattern space FP-tree

  18. RPlocal (cont.) FP-tree∣D Pr set: Set(D)={D} DC DA DB FP-tree∣DA Pr set: Set(DA)={D} Sup(DC)=1 → infrequent Pr set: Set(DAB)={D, DA, DB} DAB Pr set: Set(DAB)={D, DA}

  19. RPlocal (cont.) FP-tree∣C Pr set: Set(DAB)={D, DA, DB} Set(C)={C} CA CB FP-tree∣CA Pr set: Set(DAB)={D, DA, DB} Set(CA)={C} Closed pruning Pr set: Set(DAB)={D, DA, DB} Set(CAB)={C, CA, CB} CAB Pr set: Set(DAB)={D, DA, DB} Set(CAB)={C, CA}

  20. Closed pruning • In the depth-first search, all the single items is partitioned as three disjoint sets: • conditional set: current pattern • todo-set: to be expended • done-set: all the other items • Closed_index Example: current pattern CB conditional set: {C, B} todo-set: {} done-set: {D, A} D C A B CB=0 1 1 1 “A” belongs to done-set. CB is not closed

  21. Experimental Setting • Data • frequent itemset mining dataset repository (http://fimi.cs.helsinki.fi/data/) • Accidents, Chess, Connect, Pumsb_star datasets • Implementation Environment • Pentium 4--2.6GHz with 1GB memory under Linux

  22. Performance Study • Number of Representative Patterns Accidents dataset(δ=0.1) Chess dataset(δ=0.1)

  23. Performance Study • Running Time Pumsb_star dataset(δ=0.1) Accidents dataset(δ=0.1)

  24. Performance Study • Quality of Representative Patterns Accidents dataset(min_sup=0.4, δ=0.1) Accidents dataset(δ=0.2)

More Related