A Survey on Distance Metric Learning (Part 2)

A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center

Acknowledgement • Lecture material shamelessly adapted from the following sources: • Kilian Weinberger: • “Survey on Distance Metric Learning” slides • IBM summer intern talk slides (Aug. 2006) • Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”) • Yann LeCun talk slides (CVPR 2005, 2006)

Outline – Part 2 • Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis) • Metric Learning for Kernel Regression (Weinberger & Tesauro) • Metric learning for RL basis function construction (Keller et al.) • Similarity learning for image processing (LeCun et al.)

Neighborhood Component Analysis Distance metric for visualization and kNN (Goldberger et. al. 2004)

Weinberger & Tesauro, AISTATS 2007 Metric Learning for Kernel Regression

Killing three birds with one stone: We construct a method for linear dimensionality reduction that generates a meaningful distance metric optimally tuned for distance-based kernel regression

Kernel Regression • Given training set {(xj , yj), j=1,…,N} where x is -dim vector and y is real-valued, estimate value of a test point xi by weighted avg. of samples: where kij = kD (xi, xj) is a distance-based kernel function using distance metric D

Choice of Kernel • Many functional forms for kijcan be used in MLKR; our empirical work uses the Gaussian kernel where σ is a kernel width parameter (can set σ=1 W.L.O.G. since we learn D) softmax regression estimate similar to Roweis’ softmax classifier

Distance Metric for Nearest Neighbor Regression Learn a linear transformation that allows to estimate the value of a test point from its nearest neighbors

Mahalanobis Metric Distance function is a pseudo Mahalanobis metric (Generalizes Euclidean distance)

General Metric Learning Objective • Find parmaterized distance function Dθ that minimizes total leave-one-out cross-validation loss function • e.g. params θ = elements Aij of A matrix • Since we’re solving for A not M, optimization is non-convex  use gradient descent

Gradient Computation where xij = xi – xj • For fast implementation: • Don’t sum over all i-j pairs, only go up to ~1000 nearest neighbors for each sample i • Maintain nearest neighbors in a heap-tree structure, update heap tree every 15 gradient steps • Ignore sufficiently small values of kij ( < e-34 ) • Even better data structures: cover trees, k-d trees

Learned Distance Metric example orig. Euclidean D < 1 learned D < 1

“Twin Peaks” test Training: n=8000 we added 3 dimensions with 1000% noise we rotated 5 dimensions randomly

Input Variance Noise Signal

Test data

Output Variance Signal Noise

DimReduction with MLKR • FG-NET face data: 82 persons, 984 face images w/age

DimReduction with MLKR PowerManagement data (d=21) • Force A to be rectangular • Project onto eigenvectors of A • Allows visualization of data

Robot arm results (8,32dim) regression error

Resource Arbiter App Manager App Manager Server Server Server Server Server Server Server Server App Manager Unity Data Center Prototype • Objective: Learn long-range resource value estimates for each application manager • State Variables (~48): • Arrival rate • ResponseTime • QueueLength • iatVariance • rtVariance • Action: # of servers allocated • by Arbiter • Reward: SLA(Resp. Time) Maximize Total SLA Revenue 5 sec Demand (HTTP req/sec) Demand (HTTP req/sec) Value(#srvrs) Value(#srvrs) Value(#srvrs) SLA SLA SLA Value(RT) WebSphere 5.1 Value(#srvrs) WebSphere 5.1 Value(RT) DB2 DB2 Trade3 Batch Trade3 8 xSeries servers (Tesauro, AAAI 2005; Tesauro et al., ICAC 2006)

Power & Performance Management • Objective: Managing systems to multi-discipline objectives: minimize Resp. Time and minimize Power Usage • State Variables (21): • Power Cap • Power Usage • CPU Utilization • Temperature • # of requests arrived • Workload intensity (# Clients) • Response Time • Action: Power Cap • Reward: SLA(Resp. Time) – Power Usage (Kephart et al., ICAC 2007)

IBM Regression Results TEST ERROR MLKR 14/47 3/5 10/22

Metric Learning for RL basis function construction (Keller et al. ICML 2006) • RL Dataset of state-action-reward tuples {(si, ai, ri), i=1,…,N}

Value Iteration • Define an iterative “bootstrap” calculation: • Each round of VI must iterate over all states in the state space • Try to speed this up using state aggregation (Bertsekas & Castanon, 1989) • Idea: Use NCA to aggregate states: • project states into lower-dim rep; keep states with similar Bellman error close together • use projected states to define a set of basis functions {} • learn linear value function over basis functions: V =  θii

Chopra et. al. 2005 Similarity metric for image verification. Problem: Given a pair of face-images, decide if they are from the same person.

Chopra et. al. 2005 Similarity metric for image verification. Problem: Given a pair of face-images, decide if they are from the same person. Too difficult for linear mapping!

A Survey on Distance Metric Learning (Part 2)

A Survey on Distance Metric Learning (Part 2)

Presentation Transcript

A Survey on Transfer Learning

Distance Learning Session 2

A Survey on Distance Metric Learning (Part 1)

Distance Sampling – Part 2

2014 MCCVLC Distance Learning Administrators Survey

Distance Metric

LearnMet: Learning a Domain-Specific Distance Metric for Graph Mining

Distance metric learning Vs. Fisher discriminant analysis

Part 2: CREATING AN ON-TRACK METRIC

MCCVLC Distance Learning Administrators Survey

Diverse Learning Environments Survey Administration

Distance Learning Foundation

Semisupervised Multiview Distance Metric Learning for Cartoon Synthesis

Learning Instance Specific Distance Using Metric Propagation

Diverse Learning Environments Survey Administration

A Survey on Distance Metric Learning (Part 2)

Distance Metric Learning: A Comprehensive Survey