670 likes | 967 Views
A Survey on Distance Metric Learning (Part 1). Gerry Tesauro IBM T.J.Watson Research Center. Acknowledgement. Lecture material shamelessly adapted/stolen from the following sources: Kilian Weinberger: “Survey on Distance Metric Learning” slides IBM summer intern talk slides (Aug. 2006)
E N D
A Survey on Distance Metric Learning (Part 1) Gerry Tesauro IBM T.J.Watson Research Center
Acknowledgement • Lecture material shamelessly adapted/stolen from the following sources: • Kilian Weinberger: • “Survey on Distance Metric Learning” slides • IBM summer intern talk slides (Aug. 2006) • Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”) • Yann LeCun talk slides (NIPS 2006 workshop on “Learning to Compare Examples”)
Outline Part 1 • Motivation and Basic Concepts • ML tasks where it’s useful to learn dist. metric • Overview of Dimensionality Reduction • Mahalanobis Metric Learning for Clustering with Side Info (Xing et al.) • Pseudo-metric online learning (Shalev-Shwartz et al.) • Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis) • Metric Learning for Kernel Regression (Weinberger & Tesauro) • Metric learning for RL basis function construction (Keller et al.) • Similarity learning for image processing (LeCun et al.) Part 2
Motivation • Many ML algorithms and tasks require a distance metric (equivalently, “dissimilarity” metric) • Clustering (e.g. k-means) • Classification & regression: • Kernel methods • Nearest neighbor methods • Document/text retrieval • Find most similar fingerprints in DB to given sample • Find most similar web pages to document/keywords • Nonlinear dimensionality reduction methods: • Isomap, Maximum Variance Unfolding, Laplacian Eigenmaps, etc.
Motivation (2) • Many problems may lack a well-defined, relevant distance metric • Incommensurate features Euclidean distance not meaningful • Side information Euclidean distance not relevant • Learning distance metrics may thus be desirable • A sensible similarity/distance metric may be highly task-dependent or semantic-dependent • What do these data points “mean”? • What are we using the data for?
right centered left It depends ...
male female It depends ...
student professor ... what you are looking for
nature background plain background ... what you are looking for
Key DML Concept: Mahalanobis distance metric • The simplest mapping is a linear transformation
PSD Mahalanobis distance metric • The simplest mapping is a linear transformation Algorithms can learn both matrices
How can the dimensionality be reduced? • eliminate redundant features • eliminate irrelevant features • extract low dimensional structure
Notation Input: with Output: Embedding principle: Nearby points remain nearby, distant points remain distant. Estimate r.
Two classes of DR algorithms Linear Non-Linear
Principal Component Analysis (Jolliffe 1986) Project data into subspace of maximum variance.
Facts about PCA • Eigenvectors of covariance matrix C • Minimizes ssq reconstruction error • Dimensionality r can be estimated from eigenvalues of C • PCA requires meaningful scaling of input features
Multidimensional Scaling (MDS) • equivalent to PCA • use eigenvectors of inner-product matrix • requires only pairwise distances
From subspace to submanifold We assume the data is sampled from some manifold with lower dimensional degree of freedom. How can we find a truthful embedding?
Isomap Tenenbaum et al 2000 • Compute shortest path between all inputs • Create geodesic distance matrix • Perform MDS with geodesic distances geodesic distance
Maximum Variance Unfolding (MVU) Weinberger and Saul 2004
Maximum Variance Unfolding (MVU) Weinberger and Saul 2004
Optimization problem unfold data by maximizing pairwise distances Preserve local distances
Optimization problem center output (translation invariance)
Optimization problem Problem: Optimization non-convex multiple local minima
Optimization problem Solution: Change of notation single global minimum
Mahalanobis Metric Learning for Clustering with Side Information (Xing et al. 2003) • Exemplars {xi , i=1,…,N} plus two types of side info: • “Similar” set S = { (xi , xj ) } s.t. xi and xj are “similar” (e.g. same class) • “Dissimilar” set D = { (xi , xj ) } s.t. xi and xj are “dissimilar” • Learn optimal Mahalanobis matrix M D2ij = (xi – xj)TM (xi – xj) (global dist. fn.) • Goal : keep all pairs of “similar” points close, while separating all “dissilimar” pairs. • Formulate as a constrained convex programming problem • minimize the distance between the data pairs in S • Subject to data pairs in D are well separated
MMC-SI (Cont’d) • Objective of learning: • M is positive semi-definite • Ensure non negativity and triangle inequality of the metric • The number of parameters is quadratic in the number of features • Difficult to scale to a large number of features • Significant danger of overfitting small datasets
Mahalanobis Metric for Clustering (MMC-SI) Xing et al., NIPS 2002
MMC-SI Move similarly labeled inputs together
MMC-SI Move different labeled inputs apart
Convex optimization problem target: Mahalanobis matrix
Convex optimization problem pushing differently labeled inputs apart
Convex optimization problem pulling similar points together
Convex optimization problem ensuring positive semi-definiteness
CONVEX Convex optimization problem
Gradient Alternating Projection Take step along gradient.
Gradient Alternating Projection Take step along gradient. Project onto constraint satisfying sub-space.
Gradient Alternating Projection Take step along gradient. Project onto constraint satisfying sub-space. Project onto PSD cone.