1 / 76

Clustering

Clustering. Shallow Processing Techniques for NLP Ling570 November 30, 2011. Roadmap. Clustering Motivation & Applications Clustering Approaches Evaluation. Clustering. Task: Given a set of objects, create a set of clusters over those objects Applications:. Clustering.

moral
Download Presentation

Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011

  2. Roadmap • Clustering • Motivation & Applications • Clustering Approaches • Evaluation

  3. Clustering • Task: Given a set of objects, create a set of clusters over those objects • Applications:

  4. Clustering • Task: Given a set of objects, create a set of clusters over those objects • Applications: • Exploratory data analysis • Document clustering • Language modeling • Generalization for class-based LMs • Unsupervised Word Sense Disambiguation • Automatic thesaurus creations • Unsupervised Part-of-Speech Tagging • Speaker clustering,….

  5. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering:

  6. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment

  7. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire

  8. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering

  9. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters

  10. Example: Document Clustering • Input: Set of individual documents • Output: Sets of document clusters • Many different types of clustering: • Category: news, sports, weather, entertainment • Genre clustering: Similar styles: blogs, tweets, newswire • Author clustering • Language ID: language clusters • Topic clustering: documents on the same topic • OWS, debt supercommittee, Seattle Marathon, Black Friday..

  11. Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters

  12. Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters:

  13. Example:Word Clustering • Input: Words • Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats • Output: Word clusters • Example clusters: (from NYT) • ballot, polls, Gov, seats • profit, finance, payments • NFL, Reds, Sox, inning, quarterback, scored, score • researchers, science • Scott, Mary, Barbara, Edward

  14. Questions • What should a cluster represent? Due to F. Xia

  15. Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? Due to F. Xia

  16. Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? Due to F. Xia

  17. Questions • What should a cluster represent? • Similarity among objects • How can we create clusters? • How can we evaluate clusters? • How can we improve NLP with clustering? Due to F. Xia

  18. Similarity • Between two instances

  19. Similarity • Between two instances • Between an instance and a cluster

  20. Similarity • Between two instances • Between an instance and a cluster • Between clusters

  21. Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn)

  22. Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance:

  23. Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance:

  24. Similarity Measures • Given x=(x1,x2,…,xn) and y=(y1,y2,…,yn) • Euclidean distance: • Manhattan distance: • Cosine similarity:

  25. Clustering Algorithms

  26. Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters

  27. Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy

  28. Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster

  29. Types of Clustering • Flat vs Hierarchical Clustering: • Flat: partition data into k clusters • Hierarchical: Nodes form hierarchy • Hard vs Soft Clustering • Hard: Each object assigned to exactly one cluster • Soft: Allows degrees of membership and membership in more than one cluster • Often probability distribution over cluster membership

  30. Hierarchical Clustering

  31. Hierarchical Vs. Flat • Hierarchical clustering:

  32. Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive

  33. Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering:

  34. Hierarchical Vs. Flat • Hierarchical clustering: • More informative • Good for data exploration • Many algorithms, none good for all data • Computationally expensive • Flat clustering: • Fairly efficient • Simple baseline algorithm: K-means • Probabilistic models use EM algorithm

  35. Clustering Algorithms • Flat clustering: • K-means clustering • K-medoids clustering • Hierarchical clustering: • Greedy, bottom-up clustering

  36. K-Means Clustering • Initialize: • Randomly select k initial centroids

  37. K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing

  38. K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest

  39. K-Means Clustering • Initialize: • Randomly select k initial centroids • Center (mean) of cluster • Iterate until clusters stop changing • Assign each instance to the nearest cluster • Cluster is nearest if cluster centroid is nearest • Recompute cluster centroids • Mean of instances in the cluster

  40. K-Means: 1 step

  41. K-Means • Running time:

  42. K-Means • Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues:

  43. K-Means • Running time: • O(n) – where n is the number of clusters • Converges in finite number of steps • Issues: • Need to pick # clusters k • Can find only local optimum • Sensitive to outliers • Requires Euclidean distance: • What about enumerable classes (e.g. colors)?

  44. Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster

  45. Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute:

  46. Medoid • Medoid: Element in cluster with highest average similarity to other elements in cluster • Finding the medoid: • For each element compute: • Select the element with highest f(p)

  47. K-Medoids • Initialize: • Select k instances at random as medoids

  48. K-Medoids • Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid

  49. K-Medoids • Initialize: • Select k instances at random as medoids • Iterate until no changes • Assign instances to cluster with nearest medoid • Recomputemedoid for each cluster

  50. Greedy, Bottom-Up Hierarchical Clustering • Initialize: • Make an individual cluster for each instance

More Related