1 / 43

Efficient Clustering of High-Dimensional Data Sets

Efficient Clustering of High-Dimensional Data Sets. Andrew McCallum WhizBang! Labs & CMU. Kamal Nigam WhizBang! Labs. Lyle Ungar UPenn. Large Clustering Problems. Many examples Many clusters Many dimensions. Example Domains. Text Images Protein Structure.

verna
Download Presentation

Efficient Clustering of High-Dimensional Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn

  2. Large Clustering Problems • Many examples • Many clusters • Many dimensions Example Domains • Text • Images • Protein Structure

  3. The Citation Clustering Data • Over 1,000,000 citations • About 100,000 unique papers • About 100,000 unique vocabulary words • Over 1 trillion distance calculations

  4. Reduce number of distance calculations • [Bradley, Fayyad, Reina KDD-98] • Sample to find initial starting points for k-means or EM • [Moore 98] • Use multi-resolution kd-trees to group similar data points • [Omohundro 89] • Balltrees

  5. The Canopies Approach • Two distance metrics: cheap & expensive • First Pass • very inexpensive distance metric • create overlapping canopies • Second Pass • expensive, accurate distance metric • canopies determine which distances calculated

  6. Illustrating Canopies

  7. Overlapping Canopies

  8. Creating canopies with two thresholds • Put all points in D • Loop: • Pick a point X from D • Put points within Kloose of X in canopy • Remove points within Ktight of X from D tight loose

  9. Canopies • Two distance metrics • cheap and approximate • expensive and accurate • Two-pass clustering • create overlapping canopies • full clustering with limited distances • Canopy property • points in same cluster will be in same canopy

  10. Using canopies with GAC • Calculate expensive distances between points in the same canopy • All other distances default to infinity • Sort finite distances and iteratively merge closest

  11. Computational Savings • inexpensive metric << expensive metric • number of canopies: c (large) • canopies overlap: each point in f canopies • roughly f*n/c points per canopy • O(f 2 *n 2/c)expensive distance calculations • complexity reduction: O(f2/c) • n=106; k=104; c=1000; f small: computation reduced by factor of 1000

  12. Experimental Results F1 Minutes Canopies GAC 0.838 7.65 Complete GAC 0.835 134.09

  13. Preserving Good Clustering • Small, disjoint canopies big time savings • Large, overlapping canopies original accurate clustering • Goal: fast and accurate • requires good, cheap distance metric

  14. Reduced Dimension Representations

  15. Clustering finds groups of similar objects • Understanding clusters can be difficult • Important to understand/interpret results • Patterns waiting to be discovered

  16. A picture is worth 1000 clusters

  17. Feature Subset Selection • Find n features that work best for prediction • Find n features such that distance on them best correlates with distance on all features • Minimize:

  18. Feature Subset Selection • Suppose all features relevant • Does that mean dimensionality can’t be reduced? • No! • Manifold in feature space is what counts, not relevance of individual features • Manifold can be lower dimension than feats

  19. PCA: Principal Component Analysis • Given data in d dimensions • Compute: • d-dim mean vector M • dxd-dim covariance matrix C • eigenvectors and eigenvalues • Sort by eigenvalues • Select top k<d eigenvalues • Project data onto k eigenvectors

  20. PCA Mean vector M:

  21. PCA Covariance C:

  22. PCA • Eigenvectors • Unit vectors in directions of maximum variance • Eigenvalues • Magnitude of the variance in the direction of each eigenvector

  23. PCA • Find largest eigenvalues and corresponding eigenvectors • Project points onto k principal components • where A is a d x k matrix whose columns are the k principal components of each point

  24. PCA via Autoencoder ANN

  25. Non-Linear PCA by Autoencoder

  26. PCA • need vector representation • 0-d: sample mean • 1-d: y = mx + b • 2-d: y1 = mx + b; y2 = m`x + b`

  27. MDS: Multidimensional Scaling • PCA requires vector representation • Given pairwise distances between n points? • Find coordinates for points in d dimensional space s.t. distances are preserved “best”

  28. MDS • Assign points to coords xi in d-dim space • random coordinate values • principal components • dimensions with greatest variance • Do gradient descent on coordinates xi of each point j until distortion is minimzed

  29. Distortion

  30. Distortion

  31. Distortion

  32. Gradient Descent on Coordinates

  33. Subjective Distances • Brazil • USA • Egypt • Congo • Russia • France • Cuba • Yugoslavia • Israel • China

  34. How Many Dimensions? • D too large • perfect fit, no distortion • not easy to understand/visualize • D too small • poor fit, much distortion • easyto visualize, but pattern may be misleading • D just right?

  35. Agglomerative Clustering of Proteins

More Related