1 / 20

Frequent Item Based Clustering

Frequent Item Based Clustering. M.Sc Student: Homayoun Afshar Supervisor: Martin Ester. Contents. Introduction and motivation Frequent item sets Text data as transactional data Cluster set definition Our approach Test data set, results, challenges Related works Conclusion.

terena
Download Presentation

Frequent Item Based Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Frequent Item Based Clustering M.Sc Student: Homayoun Afshar Supervisor: Martin Ester

  2. Contents • Introduction and motivation • Frequent item sets • Text data as transactional data • Cluster set definition • Our approach • Test data set, results, challenges • Related works • Conclusion Frequent Item Based Clustering

  3. Introduction and Motivation • Huge amount of information online • Lots of this information is in text format • E.G. Emails, web pages, news group postings, … • Need to group related documents • Nontrivial task Frequent Item Based Clustering

  4. Frequent Item Sets • Given a dataset D={t1,t2,…,tn} • Each ti is a transaction • tiI where I is the set of all items • Given a threshold min_sup • iI such that • |{t  it and tD}|>min_sup • i is a frequent item set with respect to minimum support min_sup Frequent Item Based Clustering

  5. Text Data As Transactional Data • Assume each word as an item • And each document as a transaction • Using a minimum support find frequent item sets (frequent word sets) • Frequent Word SetsFrequent Item Sets Frequent Item Based Clustering

  6. Cluster Set Definition • f={X1,X2,…,Xn} is the set of all the frequent item sets with respect to some minimum support • c={C1,C2,…,Cm} is a cluster set, where Ci is the documents that are covered with some Xkf • And… Frequent Item Based Clustering

  7. Cluster Set Definition … • Each optimal cluster set has to: • Cover the whole data set • Mutual overlap between clusters in cluster set must be minimized • Clusters should be roughly the same size Frequent Item Based Clustering

  8. Our Approach:Frequent-Item Based Clustering … • Find all the frequent word sets • Form cluster sets with just one cluster • Overlap is zero • Coverage is the support of the frequent item set presenting the cluster • Form cluster sets with two clusters • Find the overlap and coverage Frequent Item Based Clustering

  9. Our Approach:Frequent-Item Based Clustering … • Prune the candidate list for cluster sets • If Cov(ci)Cov(cj) and • overlap(ci)>overlap(cj) • ci and cj are candidates in same level • remove if Overlap(ci)>= |Cov(ci)| • Generate the next level • Find Overlap and Coverage, Prune • Stop when there are no more candidates left Frequent Item Based Clustering

  10. Our Approach:Coverage And Overlap … • Using a bit matrix • Each column is a document • Each row is a frequent word set • Coverage: OR, counting the 1s • Overlap: XOR, OR, AND, counting 1s Frequent Item Based Clustering

  11. Our Approach:Coverage And Overlap … 10110010 (1st) 10001010 (2nd) 10101100 (3rd) ------------ Coverage: OR all = 10111110 count 1s -> coverage = 6 cost = 2 ORs + counting 1s cost for counting 1s = 8 (shifts, ANDs, Adds) Frequent Item Based Clustering

  12. Our Approach:Coverage And Overlap … Overlap: 10110010 (1st) 10001010 (2nd) ------------ AND first two = 10000010 (i) XOR first two = 00111000 (ii) 10101100 (3rd) ------------ AND 3rd with (ii) 00101000 (iii) ------------ OR (i) and (iii) 10101010 now count 1s for overlap -> Overlap = 4 Frequent Item Based Clustering

  13. Test Data, Results, Challenges • Test data set • Reuters 21578 • 21578 documents Reuters news • 8655 of them have exactly one topic • Remove stop words • Stem all the words • Number of frequent word sets • 5% min_sup = 10678 • 10% min_sup=1217 • 20% min_sup=78 Frequent Item Based Clustering

  14. Test Data, Results, Challenges • With 20% min support • sample 2-cluster candidate set • {(said,reuter)(line,ct,vs)} • Overlap = 1 • Coverage = 5259 • sample 5-cluster candidate set • {(reuter)(vs)(net)(line,ct,net)(vs,net,shr)} • Overlap = 3303 • Coverage = 8609 Frequent Item Based Clustering

  15. Test Data, Results, Challenges • More Results • With min_sup=10% • {(reuter)(includ)(mln,includ)(mln,profit)(year,ct)(year,mln,net)} • 6-clusters cluster set • Coverage = 8616 • Overlap = 2553 • {(reuter)(loss)(profit)(year,1986)(mln,profit)(year,ct)(year,mln,net)} • 7-clusters cluster set • Coverage = 8611 • Overlap = 2705 • {(reuter)(loss)(profit)(year,1986)(mln,includ)(mln,profit)(year,ct)(year,mln,net)} • 8-clusters cluster set • Coverage = 8616 • Overlap = 3033 Frequent Item Based Clustering

  16. Test Data, Results, Challenges • Lower support values • Pruning is very slow • 2-cluster set with minSup=20% • Creating= 0.010 seconds. • Updating= 1.853 seconds. (Overlap and Coverage) • Pruning= 11.767 seconds. • Sorting= 0.000 seconds. • Number of candidates • Before prune=3003 • After prune=73 Frequent Item Based Clustering

  17. Test Data, Results, Challenges • Hierarchical clustering • Clustering quality • In our test data set, entropy • Real data sets, classes are not known • Test the pruning more efficiently • Defining an upper threshold • Using following ratios to prune candidates • or • Using only max item sets Frequent Item Based Clustering

  18. Related Works • Similar idea • Frequent Term-Based Text Clustering [BEX02] • Florian Beil, Martin Ester, Xiaowei Xu • Focuses on finding one optimal clustering set (non overlapping)-FTC • Hierarchical clustering (overlapping)-HFTC Frequent Item Based Clustering

  19. Conclusion • To get optimal clustering • Reduce minimum support • Reduce number of frequent items • Introduce maximum support • Use only max item sets • Better pruning (speed) • Hierarchical clustering Frequent Item Based Clustering

  20. References [AS94] R. Agrawal, R. Sirkant. Fast Algorithms for Mining Association rules in large databases. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), pages 487-499, Santiago, Chile, Sept. 1994. [BEX02] F. Beil, M. Ester,X. Xu. Frequent Term-Based Text clustering. J. Han, M. Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann, 2001. Frequent Item Based Clustering

More Related