1 / 35

SET (8)

SET (8). Prof. Dragomir R. Radev radev@umich.edu. SET Fall 2013. … 13. Clustering …. Clustering. Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis Documents in the same cluster are relevant to the same query How do we use it in practice?.

lael
Download Presentation

SET (8)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SET(8) Prof. Dragomir R. Radev radev@umich.edu

  2. SET Fall 2013 … 13. Clustering …

  3. Clustering • Exclusive/overlapping clusters • Hierarchical/flat clusters • The cluster hypothesis • Documents in the same cluster are relevant to the same query • How do we use it in practice?

  4. Representations for document clustering • Typically: vector-based • Words: “cat”, “dog”, etc. • Features: document length, author name, etc. • Each document is represented as a vector in an n-dimensional space • Similar documents appear nearby in the vector space (distance measures are needed)

  5. Scatter-gather • Introduced by Cutting, Karger, and Pedersen • Iterative process • Show terms for each cluster • User picks some of them • System produces new clusters • Example: • http://web.archive.org/web/20070614084152/http://www.ischool.berkeley.edu/~hearst/images/sg-example1.html

  6. k-means • Iteratively determine which cluster a point belongs to, then adjust the cluster cenroid, then repeat • Needed: small number k of desired clusters • hard decisions • Example: Weka

  7. k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document ddo 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster cdo 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while

  8. K-means (cont’d) • In practice (to avoid suboptimal clusters), run hierarchical agglomerative clustering on sample size sqrt(N) and then use the resulting clusters as seeds for k-means.

  9. Example • Cluster the following vectors into two groups: • A = <1,6> • B = <2,2> • C = <4,0> • D = <3,3> • E = <2,5> • F = <2,1>

  10. Weka • A general environment for machine learning (e.g. for classification and clustering) • Book by Witten and Frank • www.cs.waikato.ac.nz/ml/weka • cd /data2/tools/weka-3-4-7 • export CLASSPATH=$CLASSPATH:./weka.jar • java weka.clusterers.SimpleKMeans -t ~/e.arff • java weka.clusterers.SimpleKMeans -p 1-2 -t ~/e.arff

  11. Demos • http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html • http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means • http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster • http://www.cc.gatech.edu/~dellaert/FrankDellaert/Software.html • http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf • http://web.archive.org/web/20110223234358/http://www.ece.neu.edu/groups/rpl/projects/kmeans/

  12. Probability and likelihood Example: What is J in this case?

  13. Bayesian formulation Posterior ∞ likelihood x prior

  14. E-M algorithms • Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values. [Dempster et al. 77] [McCallum & Nigam 98]

  15. E-M algorithm • Initialize probability model • Repeat • E-step: use the best available current classifier to classify some datapoints • M-step: modify the classifier based on the classes produced by the E-step. • Until convergence Soft clustering method

  16. EM example Figure from Chris Bishop

  17. EM example Figure from Chris Bishop

  18. EM example Figure from Chris Bishop

  19. EM example Figure from Chris Bishop

  20. Demos http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/mixture.html http://lcn.epfl.ch/tutorial/english/gaussian/html/ http://www.cs.cmu.edu/~alad/em/ http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html http://people.csail.mit.edu/mcollins/papers/wpeII.4.ps

  21. “Online” centroid method

  22. Centroid method

  23. Online centroid-based clustering sim ≥ T sim < T

  24. Sample centroids

  25. Evaluation of clustering • Formal definition • Objective function • Purity (considering the majority class in each cluster)

  26. RAND index • Accuracy when preserving object-object relationships. • RI=(TP+TN)/TP+FP+FN+TN • In the example:

  27. RAND index RI = 0.68

  28. Hierarchical clustering methods • Single-linkage • One common pair is sufficient • disadvantages: long chains • Complete-linkage • All pairs have to match • Disadvantages: too conservative • Average-linkage • Demo

  29. Non-hierarchical methods • Also known as flat clustering • Centroid method (online) • K-means • Expectation maximization

  30. Hierarchical clustering 1 2 5 6 3 4 7 8 Single link produces straggly clusters (e.g., ((12)(56)))

  31. Hierarchical agglomerative clusteringDendrograms E.g., language similarity: http://odur.let.rug.nl/~kleiweg/clustering/clustering.html/data2/tools/clustering

  32. Clustering using dendrograms Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation?

More Related