1 / 48

TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group )

TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group ) Mikhail Alexandrov Centro de Investigación en Computación Instituto Politécnico Nacional de México (+ 52) 5729-6000 ext. 56544; dyner1950 @mail.ru. Results of Investigation of Benno Stein Group.

lavonn
Download Presentation

TECHNIQIE OF CLUSTERING ( Panorama of methods, Works of Benno Stein group )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TECHNIQIE OF CLUSTERING • ( Panorama of methods, Works of Benno Stein group ) • Mikhail Alexandrov • Centro de Investigación en Computación • Instituto Politécnico Nacional de México • (+ 52) 5729-6000 ext. 56544; dyner1950@mail.ru

  2. Results of Investigation of Benno Stein Group Nature of data structure Definition of the best data structure based on graph presentation Numerical method of clustering Development of MajorClust method from the density based group of methods Validity and usability of clustering F-measure for cluster usability (expert opinion) Density indexρfor cluster validity (formal estimation) Numerical experiments Comparison of different methods of clustering and different ways of index selection

  3. Subject of Grouping TEXTUAL DATA NON TEXTUAL DATA Local Terminology It is not important what is the source of data: textual o non textual. Data: work in the space of parameters Texts: work in the space of words

  4. Presentation of Textual Data TEXTS (‘indexed’) TEXTS (parametrized) Terminology local Indexed texts are parametrized texts in the space of words

  5. Types of Grouping Clustering Classificación Characteristics: Caracteristics: Absence of patterns or descriptions Presence of patterns o descripciones of classes, so the results are of classes, so the results are defined by the nature of defined by the user ( N>=1 ) the data themselves ( N >1 ) Sinonyms: Sinonyms: Classification without teacher Classification with teacher Unsupervised classification Supervised classificacion Number of clustersSpecials terms: [ ] is knownexactly Categorization (of documents) [x] is known approximately Diagnostics (technics, medicine) [ ] is not known => searching Recognition (technics, science)

  6. Objectives of Grouping 1. Organization (structuring) of object set Process is named data structuring Tecnologies: clustering,categorización 2. Improvement of the process of searching interesting patterns Process is named navigation Tecnology: cluster-based search 3. Groupng for other applications: - Knowledge discovering Tecnology: clustering - Summarization (generalization) of documents Tecnologies: clustering,categorización Nota: do not mix the type of groupng and its objective

  7. Example 1. Clusteringof opinionsof the experts invited by the Moscow City Government

  8. Example 2. Categorizationof letters and complaints of Moscow dwellers directed to the Major of the city

  9. Learning General books about Information Retrieval 1. Baeza-Yates, R., Ribero-Neto, B. Modern Information Retrieval. Addison Wesley, 1999 2. Manning, D.C., Schutze, H. Foundations of statistical natural language processing. MIT Press, 1999. Journals and Congresses about Information Retrieval 1. Journal “Knowledge and Information Systems”, Springer 2. KDD - Knowledge Discovery from Data Base 3. NLDB – Natural Language and Data Bases 4. XX-AI – National Congresses on Artificial Intelligence (MICAI, ….)

  10. Learning General books and articles about clustering 1. Hartigan J. , Clustering Algorithms, Wiley, 1975 <= ! 2. Kaufman L. and Rousseuw P., Finding Group in Data, Wiley, 1990 3. Gordon D. “How many clusters?”, Proc. of IFCS-96, Springer, serie “Studies in classification, data analysis…..”, 1996 Articles about clustering texts 1. Eissen S., Stein B. “Analysis of clustering algorithms for Web based search”, Springer, LNAI, N_2569 2. Stein B., Eissen S., Wibrock F. “On cluster validity and the information need of users”, Proc. 3-rd Conf. Artif. Intel. Appl., IASTED, 2003 Universal packets with algorithms of clustering 1. ClustAn (Scotland) www. clustan.com Clustan Graphics-6 (2003) 2. MatLab Descriptions are in Internet 3. Statistica Descriptions are in Internet

  11. General Scheme of Clustering Process Here: Rude and good matrixes are Matrix Obj/Atr Principal idea : To transform texts to numerical form to use matematical tools( the problem is text grouping documents but not their undestanding) Preprocessing <= Processing

  12. General Scheme of Clustering Process Preprocessing Processing <= Here: Instead of matrix Obj/Obj matrix Atr/Atr can be used

  13. Matrixes to be Considered

  14. Clustering for Categorization Colour matrix “words-words” before clustering

  15. Clustering for Categorizatión Colour matriz “words-words” after clustering

  16. Importance of Preprocessing

  17. Definitions Def. 1 “Let us V be the set of objects. Clustering C = { Ci|Ci є V }ofV is division of V on subsets, for which we have :UiCi = V and Ci ∩Cj = 0 i ≠j“ Def. 2 “Let us V be the set of nodes, E be arcs, φ is weight function that reflects the distance between objects, so we have a weighted graph G = { V,E, φ}.In this case C is named as clustering of G.” In the framework of the second definition every Ci produced subgraph G(Ci). Both subsets Ci and subgraphs G(Ci) are named clusters. Set Clique Graph

  18. Definitions Principal note Both definitions SAYS NOTHING: - about quality of clusters - about numbers of clusters Reason of difficulties Nowadays there is no any general agreement about any universal defintion of the term cluster What means that clustering is good ? 1. Closeness between objets inside clusters is essentially more than the closeness between clusters themselves 2. Constructed clusters correspond to intuitive presentations of users ( they are natural clusters)

  19. Classification of methods Based on the way of grouping 1. Hierarchical methods Any neighbors ( N is not given) 2. Methods oriented on examples K-means ( N is given) 3.Methods oriented on density MajorClust (N is calculated automatically)

  20. Based on belonging to cluster Exclusive methods Every object belongs only to one cluster. Methods are named hard grouping methods Non-exclusive methods Every object can belong to several clusters. Methods are named soft grouping methods. Classification of methods Based on data presentation Methods oriented on sets in free metric space Every object is presented as a point in a free space Methods oriented on graphs Every object is presented as an element on graph

  21. Fuzzy Grouping Hard classification Hard clustering Hard categorization Soft classification Soft clustering Soft categorization Example. The distribution of letters of Moscovites to the Government is soft categorization (numbers in the table reflect the relative weight of each theme)

  22. Hierarchical methods • Aglomerative methods (bottom=>top) • - Relations with neighbors • - Ward’s method • División methods (top => bottom) • Min cut method • Methods based on dissimilarity he I if this the his a

  23. Hierarchical aglomerative methods Neighbors 1.Nearest neighbor (single-linkage) 2. Farthest neighbor (complete linkage) 3. Averaged neighbor (average linkage) King - Unweighted Pair-Group Average UPGMA - Weighted Pair-Group Average WPGMA - Unweighted Pair-Group Centroid WPGMC <= King method Density Ward método Difference between methods Evaluation of closeness between clusters

  24. Hierarchical methods Neighbors. Every object is cluster

  25. Hierarchical methods • General algorithm • Initially every object is one • cluster • The series of steps are performed. On every step the pair of cluster to be the closest are searched. They are merged. • At the end we have one cluster. Neighbors. Every object is cluster

  26. Hierarchical methods Nearest neighbor method (NN)

  27. Hierarchical methods Farthest neighbor method

  28. Hierarchical methods Averaged neighbor (Unweighted Pair-Group Centroid WPGMC)

  29. Hierarchical methods • Averaged neighbors • Unweighted Pair-Group • Centroid WPGMC (a) • - Unweighted Pair-Group • Average UPGMA (b) • Weighted Pair-Group • Average WPGMA • Difference • Evaluation the closeness • between clusters (a) (b)

  30. Methods oriented on examples Variants of K-means methods K-means, centroid (a) K-means, medoid (b) K-means fuzzy Difference Election of centers (a) (b)

  31. Methods oriented on examples K - means, centroid

  32. Methods oriented on examples Method K-means • General algorithm • Initially K centers are selected by any random way • Series of stepsare performed. On every step the objects are distributed between centers according the criterion of the nearestcenter. Then all centers are recalculated. • The end of searching is fixed when clusters are not changed.

  33. Natural Structure of Data u Graph decomposition C=(C1, C2,….Cn) is decomposition of graph Λ(C)is weighted partial connectivity of C λ(Ci) = λiis edge connectivity of G(Ci) H(C*, E(C*)) is connectivity structure

  34. MajorClust method Principal idea Un object belongs to the cluster to which the majority of his neighbors belong Suboptimal solution Only part of neighbors are considered on every step. Density based methods

  35. Density based methods Comparason of Nearest Neighbor method and MajorClust NN-method does not change belonging of objects. MajorClust changes belonging of objects

  36. Λ- criterion and MajorClust

  37. Validity Principal question: What of the formal validiy measures reflects a user opinion by the best way? Dunn index Davies- Bouldin index etc. Dunn index (to be max)

  38. Validity and Usability Conclusion Expected density measure corresponds to F-measure reflecting expert’s opinion

  39. Meta methods (Tecnology) They construct separated data sets using criteria of optimization and limitations: Neither much nor small number of clusteres Neither large nor small size of clusters etc. Visual methods (Tecnology) They present visual images to a user in order to select manually the clusters Using different methods Comparing results Classification of Tecnologies

  40. Meta Methods Algorithm 1. Initially the number of clusters Kini and theircentersCi is given (aleatoriamente) 2. Method K-medoid (or any other one) is completed 3. If there is N> Nmax in any cluster, than this cluster is divided along with its diagonal. Go to p.2 4. If there is N < Nmin in any cluster, than the closest clusters are joined. Go to p.2 5. If there is D > Dmax in any cluster, than this cluster is divided along with its diagonal. Go to p.2 6. If there is D< Dmin in any cluster, than the closest clusters are joined. Go to p.2 7. When the number of iteration I > Imax, Stop Here: N is the number of objects in a given cluster D is the diagonal of a given cluster

  41. Technique of Visual Clustering Clustering on dendrite Clustering in space of factors

  42. Demographic Map of the World,clusters in space of factors and on dendrite

  43. Searching Leaders in Clusteres • Objectives ( Leader = Representive ) • To reduce time for elaboration of a document corpus • To improve the procedure of navigation in Internet ( new) • etc. • Typical documents • A. Typical document (the averaged one) • It reflects the principal idea of a given cluster • B. The least typical document • It is good to reach the consensus between clusters • C. The most typical document • It reflects the idea of difference between clusters

  44. Variants of Leaders Examples Typical: the most popularcostume in Chine is jeans. The most typical: the most typical costume in China is kimono The least typical: the least typical costume in China is European siut (it is the closest one to the other clusters) Typical, the most typical and the least typical documents form so called semantic sphere (boundaries)of cluster

  45. Clustering Metods of Clustering

  46. Open Problems 1. Clustering with filtering (a) Given object distribution reflects: real structure (nature) + noise (alien objects) 2. Validation of clusters (b) What is realy inside the black box? 3. Statistical clustering (c) How take into account the randomness of data? 4. Visual presentation (d) Facility for visual clustering

  47. Certain Observations The numbers of clustering methods is a little bit more than the numbers of researchers working in this area. Problem does not consist in searching the best method for all cases. Problem consists in searching the method being relevant for your data. Only you know what methods are the best for you own data. To be sure that results are really good and do not depend on the method used it is recommended to test these results using any other methods. And these methods are desirable to be antopodes. Solomon G, 1977. The most antipodes are:NN-methodand K-means Principal problem consists in choice of measure of closeness being adecuate to a given problem and given data Frecuently the results are bad because of the bad measure but not the bad method !

  48. CLUSTERING DOCUMENTS ( General information, Works of Benno Stein group ) End

More Related