1 / 36

Clustering for web documents

Clustering for web documents. 박흠. Contents. Cluto Criterion Functions for Document Clustering* Experiments and Analysis (2002) by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455 Feature selection for web documents (2004).

viviana
Download Presentation

Clustering for web documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering for web documents 박흠

  2. Contents • Cluto • Criterion Functions for Document Clustering* Experiments and Analysis(2002) • by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455 • Feature selection for web documents(2004)

  3. Cluto • Clustering Toolkit. 2.1.1 • Department of Computer Science, University of Minnesota, Minneapolis • http://www-users.cs.umn.edu/~karypis/ • platform • Linux 2.4.18 • Sun OS 5.7 • Win32 • programs • CLUTO's user callable library • vcluster • scluster

  4. Cluto • What is Cluto.(1/2) • Clustering algorithms • partitional clustering • agglomerative clustering • graph-partitioning clustering • clustering criterion function • provide seven different criterion functions • both partitional and agglomerative clustering algorithms • provide some of the more traditional local criteria (e.g., single-link, complete-link, and UPGMA) • agglomerative clustering.

  5. Cluto • What is Cluto.(2/2) • Analyze discovered clusters • relations between the objects assigned to each cluster • relations between the different clusters • identify the features that best describe and/or discriminate each cluster. • relationships between the clusters, objects, and features. • operate on very large datasets • the number of objects • the number of dimensions.

  6. Cluto • Programs • vcluster • operate in the object’s feature space • scluster • operate in the object’s similarity space. • Interface vcluster [optional parameters] MatrixFile Ncluster • n*m matrix. rows to objects, cols to features space • Ncluster : number of cluster

  7. Cluto • Parameters of Algorithms • rd, rdr • k-1 repeated bisections. (rdr : optimize the criterion function) • direct • computed by simultaneously finding all k clusters • agglo • the agglomerative paradigm • graph • using a nearest-neighbor graph • bagglo

  8. Cluto • Parameters of the similarity function • cos the cosine function. default. • corr the correlation coefficient. • dist the Euclidean distance • applicable when -clmethod=graph. • jacc the extended Jaccard coefficient. • applicable when -clmethod=graph.

  9. Cluto • Parameters of the criterion function • i1, i2, e1, g1, g1p, h1, h2

  10. Cluto • Parameters of the criterion function • slink single link • wslink weighted single link • clink complete link • wclink weighted complete link • upgma UPGMA • cstype • fulltree • rowmodel, colmodel • showfeatures

  11. Criterion Functions for Document Clustering Experiments and Analysis (2002) by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455

  12. Data Clustering A.K. JAIN Michigan State University M.N. MURTY Indian Institute of Science AND P.J. FLYNN The Ohio State University ACM Computing Surveys

  13. Introduction(1/2) • Clustering algorithms • Agglomerative algorithms • UPGMA, single-link, complete-link, CURE, ROCK, Chameleon • Partitional algorithms • K-means, K-medoids, Autoclass, graph-partitional-based, spectral-partitional-based • well suit for large datasets. so fast. • Seven Criterion functions • measure intra-cluster similarity, inter-cluster similarity, two combinations. i1, i2, e1, g1, g1p, h1, h2

  14. Introduction(2/2) • Datasets • 15 different data sets

  15. Preliminaries(1/3) • Document Representation • use vector space model for each document d : document, tf : term frequency, tfi : frequency of i-th term in the doc • use idf or tf*idf N : total documents • Similarity Measures • The similarity between two docs di, dj • Cosine functions ||d|| : normalize the length of doc vector 1 : identical, 0 : nothing in common

  16. Preliminaries(2/3) • Euclidean functions if dis=0, docs are identical, if , nothing in common. • Definitions • S : set of documents S1, S2, … Sk: set of document of k-th cluster • k : number of clusters • n1, n2, … nk : size docs of the corresponding clusters • A : a set of docs composite vector DAcentroid vector CA. sum of all docs vector in A average the weight of terms of docs in A

  17. Preliminaries(3/3) • Vector Properties • Si, Sj : two sets of docs containing ni, nj documents Di, Dj : the composite vector, Ci, Cj : the centroid vector • The sum of the pair similarity between the docs in Si and Sj is DjtDj • The sum of the pair similarity between the docs in Si is ||Di||2

  18. Criterion Functions(1/5) • Internal Criterion Functions • maximize sum of the average pairwise similarities between the docs to each cluster • use cosine function. I1 is similar to function of hierarchical agglomerative clustering that uses group average heuristics to determine merge. • use cosine function. I2 : vector space of K-means algorithm. Cr : centroid vector of clusters

  19. Criterion Functions (2/5) • External Criterion Functions. E1, E2 • optimize a function that different from each cluster • external function derived that the centroid vectors of the different clusters as orthogonal as possible C : the centroid vector of the entire docs D : the composite vector of the entire docs. 1/||D|| is constant.

  20. Criterion Functions (3/5) define with the Euclidean distance function. • Hybrid Criterion Functions. H1, H2 • maximize the similarity of docs in each cluster, minimize the similarity between the cluster’s docs and the entire docs • H1. combine criterion function I1, E1

  21. Criterion Functions (4/5) • H2. combine criterion function I2, E1 • Graph Based Criterion Functions • view the relations between docs is to use graphs • G1 : computing pairwise similarities between the docs • G2 : computing pairwise similarities between the docs and terms • S : given collection of n docs • Gs : similarity graph

  22. Criterion Functions (5/5) • G1. • G2.

  23. Experimental Results • Direct k-way Clustering

  24. Experimental Results

  25. Experimental Results

  26. Data Sets • ‘the Natural Science’ category in Naver directory (http://dir.naver.com) • 6 subcategories in corpora • 1,215 docs, 17,223 terms, 20 clusters, 5 features per a doc, idf

  27. Experimental parameters • Algorithms • rd, rdr • k-1 repeated bisections. (rdr : optimize the criterion function) • direct • computed by simultaneously finding all k clusters • agglo • the agglomerative paradigm • graph • using a nearest-neighbor graph

  28. Experimental parameters • Criterion Functions • i1, i2, e1, g1, g1p, h1, h2, clink, slink • Similarity Functions • cosine measure

  29. Experimental results • Entropy

  30. Entropy

  31. Experimental results • Purity

  32. Purity

  33. Best results

More Related