Efficient Clustering of High-Dimensional Data Sets

Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn

Large Clustering Problems • Many examples • Many clusters • Many dimensions Example Domains • Text • Images • Protein Structure

The Citation Clustering Data • Over 1,000,000 citations • About 100,000 unique papers • About 100,000 unique vocabulary words • Over 1 trillion distance calculations

Reduce number of distance calculations • [Bradley, Fayyad, Reina KDD-98] • Sample to find initial starting points for k-means or EM • [Moore 98] • Use multi-resolution kd-trees to group similar data points • [Omohundro 89] • Balltrees

The Canopies Approach • Two distance metrics: cheap & expensive • First Pass • very inexpensive distance metric • create overlapping canopies • Second Pass • expensive, accurate distance metric • canopies determine which distances calculated

Illustrating Canopies

Overlapping Canopies

Creating canopies with two thresholds • Put all points in D • Loop: • Pick a point X from D • Put points within Kloose of X in canopy • Remove points within Ktight of X from D tight loose

Canopies • Two distance metrics • cheap and approximate • expensive and accurate • Two-pass clustering • create overlapping canopies • full clustering with limited distances • Canopy property • points in same cluster will be in same canopy

Using canopies with GAC • Calculate expensive distances between points in the same canopy • All other distances default to infinity • Sort finite distances and iteratively merge closest

Computational Savings • inexpensive metric << expensive metric • number of canopies: c (large) • canopies overlap: each point in f canopies • roughly f*n/c points per canopy • O(f 2 *n 2/c)expensive distance calculations • complexity reduction: O(f2/c) • n=106; k=104; c=1000; f small: computation reduced by factor of 1000

Experimental Results F1 Minutes Canopies GAC 0.838 7.65 Complete GAC 0.835 134.09

Preserving Good Clustering • Small, disjoint canopies big time savings • Large, overlapping canopies original accurate clustering • Goal: fast and accurate • requires good, cheap distance metric

Reduced Dimension Representations

Clustering finds groups of similar objects • Understanding clusters can be difficult • Important to understand/interpret results • Patterns waiting to be discovered

A picture is worth 1000 clusters

Feature Subset Selection • Find n features that work best for prediction • Find n features such that distance on them best correlates with distance on all features • Minimize:

Feature Subset Selection • Suppose all features relevant • Does that mean dimensionality can’t be reduced? • No! • Manifold in feature space is what counts, not relevance of individual features • Manifold can be lower dimension than feats

PCA: Principal Component Analysis • Given data in d dimensions • Compute: • d-dim mean vector M • dxd-dim covariance matrix C • eigenvectors and eigenvalues • Sort by eigenvalues • Select top k<d eigenvalues • Project data onto k eigenvectors

PCA Mean vector M:

PCA Covariance C:

PCA • Eigenvectors • Unit vectors in directions of maximum variance • Eigenvalues • Magnitude of the variance in the direction of each eigenvector

PCA • Find largest eigenvalues and corresponding eigenvectors • Project points onto k principal components • where A is a d x k matrix whose columns are the k principal components of each point

PCA via Autoencoder ANN

Non-Linear PCA by Autoencoder

PCA • need vector representation • 0-d: sample mean • 1-d: y = mx + b • 2-d: y1 = mx + b; y2 = m`x + b`

MDS: Multidimensional Scaling • PCA requires vector representation • Given pairwise distances between n points? • Find coordinates for points in d dimensional space s.t. distances are preserved “best”

MDS • Assign points to coords xi in d-dim space • random coordinate values • principal components • dimensions with greatest variance • Do gradient descent on coordinates xi of each point j until distortion is minimzed

Distortion

Gradient Descent on Coordinates

Subjective Distances • Brazil • USA • Egypt • Congo • Russia • France • Cuba • Yugoslavia • Israel • China

How Many Dimensions? • D too large • perfect fit, no distortion • not easy to understand/visualize • D too small • poor fit, much distortion • easyto visualize, but pattern may be misleading • D just right?

Agglomerative Clustering of Proteins

Efficient Clustering of High-Dimensional Data Sets

Efficient Clustering of High-Dimensional Data Sets

Presentation Transcript

Handling of High-Dimensional Data Sets

Clustering Algorithms for Categorical Data Sets

Efficient Training in high-dimensional weight space

Efficient Record Linkage in Large Data Sets

Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

New Algorithms for Efficient High-Dimensional Nonparametric Classification

Biometrics and High Dimensional Data

High-Dimensional Data

Pattern Recognition Chapter 8: Clustering Large Data Sets

Representative sets and Clustering.

Simple, Efficient, Portable Decomposition of Large Data Sets

ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

Efficient Clustering of Large EST Data Sets on Parallel Computers

Clustering and Indexing in High-dimensional spaces

High Dimensional Data Analysis

Clustering High Dimensional Data Using SVM

On the Anonymization of Sparse High-Dimensional Data

Booster in High Dimensional Data Classification

Foundation of High-Dimensional Data Visualization

Clustering and Testing in High-Dimensional Data

High Dimensional Data