430 likes | 578 Views
Efficient Clustering of High-Dimensional Data Sets. Andrew McCallum WhizBang! Labs & CMU. Kamal Nigam WhizBang! Labs. Lyle Ungar UPenn. Large Clustering Problems. Many examples Many clusters Many dimensions. Example Domains. Text Images Protein Structure.
E N D
Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn
Large Clustering Problems • Many examples • Many clusters • Many dimensions Example Domains • Text • Images • Protein Structure
The Citation Clustering Data • Over 1,000,000 citations • About 100,000 unique papers • About 100,000 unique vocabulary words • Over 1 trillion distance calculations
Reduce number of distance calculations • [Bradley, Fayyad, Reina KDD-98] • Sample to find initial starting points for k-means or EM • [Moore 98] • Use multi-resolution kd-trees to group similar data points • [Omohundro 89] • Balltrees
The Canopies Approach • Two distance metrics: cheap & expensive • First Pass • very inexpensive distance metric • create overlapping canopies • Second Pass • expensive, accurate distance metric • canopies determine which distances calculated
Creating canopies with two thresholds • Put all points in D • Loop: • Pick a point X from D • Put points within Kloose of X in canopy • Remove points within Ktight of X from D tight loose
Canopies • Two distance metrics • cheap and approximate • expensive and accurate • Two-pass clustering • create overlapping canopies • full clustering with limited distances • Canopy property • points in same cluster will be in same canopy
Using canopies with GAC • Calculate expensive distances between points in the same canopy • All other distances default to infinity • Sort finite distances and iteratively merge closest
Computational Savings • inexpensive metric << expensive metric • number of canopies: c (large) • canopies overlap: each point in f canopies • roughly f*n/c points per canopy • O(f 2 *n 2/c)expensive distance calculations • complexity reduction: O(f2/c) • n=106; k=104; c=1000; f small: computation reduced by factor of 1000
Experimental Results F1 Minutes Canopies GAC 0.838 7.65 Complete GAC 0.835 134.09
Preserving Good Clustering • Small, disjoint canopies big time savings • Large, overlapping canopies original accurate clustering • Goal: fast and accurate • requires good, cheap distance metric
Clustering finds groups of similar objects • Understanding clusters can be difficult • Important to understand/interpret results • Patterns waiting to be discovered
Feature Subset Selection • Find n features that work best for prediction • Find n features such that distance on them best correlates with distance on all features • Minimize:
Feature Subset Selection • Suppose all features relevant • Does that mean dimensionality can’t be reduced? • No! • Manifold in feature space is what counts, not relevance of individual features • Manifold can be lower dimension than feats
PCA: Principal Component Analysis • Given data in d dimensions • Compute: • d-dim mean vector M • dxd-dim covariance matrix C • eigenvectors and eigenvalues • Sort by eigenvalues • Select top k<d eigenvalues • Project data onto k eigenvectors
PCA Mean vector M:
PCA Covariance C:
PCA • Eigenvectors • Unit vectors in directions of maximum variance • Eigenvalues • Magnitude of the variance in the direction of each eigenvector
PCA • Find largest eigenvalues and corresponding eigenvectors • Project points onto k principal components • where A is a d x k matrix whose columns are the k principal components of each point
PCA • need vector representation • 0-d: sample mean • 1-d: y = mx + b • 2-d: y1 = mx + b; y2 = m`x + b`
MDS: Multidimensional Scaling • PCA requires vector representation • Given pairwise distances between n points? • Find coordinates for points in d dimensional space s.t. distances are preserved “best”
MDS • Assign points to coords xi in d-dim space • random coordinate values • principal components • dimensions with greatest variance • Do gradient descent on coordinates xi of each point j until distortion is minimzed
Subjective Distances • Brazil • USA • Egypt • Congo • Russia • France • Cuba • Yugoslavia • Israel • China
How Many Dimensions? • D too large • perfect fit, no distortion • not easy to understand/visualize • D too small • poor fit, much distortion • easyto visualize, but pattern may be misleading • D just right?