Learning embeddings for similarity based retrieval
Download
1 / 92

Learning Embeddings for Similarity-Based Retrieval - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Learning Embeddings for Similarity-Based Retrieval. Vassilis Athitsos Computer Science Department Boston University. Overview. Background on similarity-based retrieval and embeddings. BoostMap. Embedding optimization using machine learning. Query-sensitive embeddings.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Learning Embeddings for Similarity-Based Retrieval' - judith-pruitt


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning embeddings for similarity based retrieval

Learning Embeddings for Similarity-Based Retrieval

Vassilis Athitsos

Computer Science Department

Boston University


Overview
Overview

  • Background on similarity-based retrieval and embeddings.

  • BoostMap.

    • Embedding optimization using machine learning.

  • Query-sensitive embeddings.

    • Ability to preserve non-metric structure.


Problem definition

x1

x2

x3

xn

Problem Definition

database

(n objects)


Problem definition1

x1

x2

x3

xn

Problem Definition

database

(n objects)

  • Goals:

    • find the k nearest neighbors of query q.

q


Problem definition2

x1

x3

x2

xn

Problem Definition

database

(n objects)

  • Goals:

    • find the k nearest neighbors of query q.

  • Brute force time is linear to:

    • n (size of database).

    • time it takes to measure a single distance.

x2

q

xn


Problem definition3

x1

x3

x2

xn

Problem Definition

database

(n objects)

  • Goals:

    • find the k nearest neighbors of query q.

  • Brute force time is linear to:

    • n (size of database).

    • time it takes to measure a single distance.

q


Applications

Nearest neighbor classification.

Similarity-based retrieval.

Image/video databases.

Biological databases.

Time series.

Web pages.

Browsing music or movie catalogs.

faces

letters/digits

Applications

handshapes


Expensive distance measures

Comparing d-dimensional vectors is efficient:

O(d) time.

x1

y1

x2

y2

x3

y3

x4

y4

xd

yd

Expensive Distance Measures


Expensive distance measures1

Comparing d-dimensional vectors is efficient:

O(d) time.

Comparing strings of length d with the edit distance is more expensive:

O(d2) time.

Reason: alignment.

x1

y1

x2

y2

y3

x3

x4

y4

xd

yd

Expensive Distance Measures

i m m i g r a t i o n

i m i t a t i o n


Expensive distance measures2

Comparing d-dimensional vectors is efficient:

O(d) time.

x1

y1

x2

y2

y3

x3

x4

y4

xd

yd

Expensive Distance Measures

  • Comparing strings of length d with the edit distance is more expensive:

    • O(d2) time.

  • Reason: alignment.

i m m i g r a t i o n

i m i t a t i o n





Shape Context Distance

  • Proposed by Belongie et al. (2001).

    • Error rate: 0.63%, with database of 20,000 images.

    • Uses bipartite matching (cubic complexity!).

    • 22 minutes/object, heavily optimized.

    • Result preview: 5.2 seconds, 0.61% error rate.


More Examples

  • DNA and protein sequences:

    • Smith-Waterman.

  • Time series:

    • Dynamic Time Warping.

  • Probability distributions:

    • Kullback-Leibler Distance.

  • These measures are non-Euclidean, sometimes non-metric.


Indexing problem
Indexing Problem

  • Vector indexing methods NOT applicable.

    • PCA.

    • R-trees, X-trees, SS-trees.

    • VA-files.

    • Locality Sensitive Hashing.


Metric methods
Metric Methods

  • Pruning-based methods.

    • VP-trees, MVP-trees, M-trees, Slim-trees,…

    • Use triangle inequality for tree-based search.

  • Filtering methods.

    • AESA, LAESA…

    • Use the triangle inequality to compute upper/lower bounds of distances.

  • Suffer from curse of dimensionality.

  • Heuristic in non-metric spaces.

  • In many datasets, bad empirical performance.


x1

x2

x3

xn

x1

x2

x3

x4

xn

Embeddings

database

Rd

embedding

F


x1

x2

x3

xn

x1

x2

x3

x4

xn

q

Embeddings

database

Rd

embedding

F

query


x1

x2

x3

xn

x1

x2

x3

x4

xn

q

q

Embeddings

database

Rd

embedding

F

query


x2

x3

x1

xn

x4

x3

x2

x1

xn

q

q

  • Measure distances between vectors (typically much faster).

Embeddings

database

Rd

embedding

F

query


x2

x3

x1

xn

x4

x3

x2

x1

xn

q

q

  • Measure distances between vectors (typically much faster).

  • Caveat: the embedding must preserve similarity structure.

Embeddings

database

Rd

embedding

F

query



Reference Object Embeddings

database

r1

r2

r3


Reference Object Embeddings

database

r1

r2

r3

x

F(x) = (D(x, r1), D(x, r2), D(x, r3))


F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)


Existing embedding methods
Existing Embedding Methods

  • FastMap, MetricMap, SparseMap, Lipschitz embeddings.

    • Use distances to reference objects (prototypes).

  • Question: how do we directly optimize an embedding for nearest neighbor retrieval?

    • FastMap & MetricMap assume Euclidean properties.

    • SparseMap optimizes stress.

      • Large stress may be inevitable when embedding non-metric spaces into a metric space.

    • In practice often worse than random construction.


Boostmap
BoostMap

  • BoostMap: A Method for Efficient Approximate Similarity Rankings.Athitsos, Alon, Sclaroff, and Kollios,CVPR 2004.

  • BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. Athitsos, Alon, Sclaroff, and Kollios,PAMI 2007(to appear).


Key features of boostmap
Key Features of BoostMap

  • Maximizes amount of nearest neighbor structure preserved by the embedding.

  • Based on machine learning, not on geometric assumptions.

    • Principled optimization, even in non-metric spaces.

  • Can capture non-metric structure.

    • Query-sensitive version of BoostMap.

  • Better results in practice, in all datasets we have tried.


F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).


F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).


F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).


F

Rd

original space X

Ideal Embedding Behavior

b

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

For any database object b besides NN(q), we want F(q) closer to F(NN(q)) than to F(b).


Embeddings seen as classifiers

b

a

q

Embeddings Seen As Classifiers

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

Classification task: is q

closer to a or to b?


b

a

q

Embeddings Seen As Classifiers

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

Classification task: is q

closer to a or to b?

  • Any embedding F defines a classifier F’(q, a, b).

    • F’ checks if F(q) is closer to F(a) or to F(b).


b

a

q

Classifier Definition

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

Classification task: is q

closer to a or to b?

  • Given embedding F: X  Rd:

    • F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||.

  • F’(q, a, b) > 0 means “q is closer to a.”

  • F’(q, a, b) < 0 means “q is closer to b.”


F

Rd

original space X

Key Observation

b

a

q

  • If classifier F’ is perfect, then for every q, F(NN(q)) = NN(F(q)).

    • If F(q) is closer to F(b) than to F(NN(q)), then triple (q, a, b) is misclassified.


F

Rd

original space X

Key Observation

b

a

q

  • Classification error on triples (q, NN(q), b) measures how well F preserves nearest neighbor structure.


Optimization Criterion

  • Goal: construct an embedding F optimized for k-nearest neighbor retrieval.

  • Method: maximize accuracy of F’ on triples (q, a, b) of the following type:

    • q is any object.

    • a is a k-nearest neighbor of q in the database.

    • b is in database, but NOT a k-nearest neighbor of q.

  • If F’ is perfect on those triples, then F perfectly preserves k-nearest neighbors.


1D Embeddings as Weak Classifiers

  • 1D embeddings define weak classifiers.

    • Better than a random classifier (50% error rate).


Lincoln

Detroit

LA

Chicago

New

York

Cleveland

Chicago

LA

Detroit

New

York


1D Embeddings as Weak Classifiers

  • 1D embeddings define weak classifiers.

    • Better than a random classifier (50% error rate).

  • We can define lots of different classifiers.

    • Every object in the database can be a reference object.


1D Embeddings as Weak Classifiers

  • 1D embeddings define weak classifiers.

    • Better than a random classifier (50% error rate).

  • We can define lots of different classifiers.

    • Every object in the database can be a reference object.

Question: how do we combine many such

classifiers into a single strong classifier?


1D Embeddings as Weak Classifiers

  • 1D embeddings define weak classifiers.

    • Better than a random classifier (50% error rate).

  • We can define lots of different classifiers.

    • Every object in the database can be a reference object.

Question: how do we combine many such

classifiers into a single strong classifier?

Answer: use AdaBoost.

  • AdaBoost is a machine learning method designed for exactly this problem.


Fn

F2

F1

Using AdaBoost

original space X

Real line

  • Output: H = w1F’1 + w2F’2 + … + wdF’d .

    • AdaBoost chooses 1D embeddings and weighs them.

    • Goal: achieve low classification error.

    • AdaBoost trains on triples chosen from the database.


From classifier to embedding
From Classifier to Embedding

H = w1F’1 + w2F’2 + … + wdF’d

AdaBoost output

What embedding should we use?

What distance measure should we use?


From classifier to embedding1
From Classifier to Embedding

H = w1F’1 + w2F’2 + … + wdF’d

AdaBoost output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).


From classifier to embedding2

D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi|

d

From Classifier to Embedding

H = w1F’1 + w2F’2 + … + wdF’d

AdaBoost output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).

Distance

measure


From classifier to embedding3

D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi|

d

From Classifier to Embedding

H = w1F’1 + w2F’2 + … + wdF’d

AdaBoost output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).

Distance

measure

Claim:

Let q be closer to a than to b. H misclassifies

triple (q, a, b) if and only if, under distance

measure D, F maps q closer to b than to a.


Proof

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)


Proof1

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)


Proof2

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)


Proof3

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)


Proof4

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)


Proof5

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)


Significance of proof
Significance of Proof

  • AdaBoost optimizes a direct measure of embedding quality.

  • We optimize an indexing structure for similarity-based retrieval using machine learning.

    • Take advantage of training data.


How do we use it
How Do We Use It?

Filter-and-refine retrieval:

  • Offline step: compute embedding F of entire database.


How do we use it1
How Do We Use It?

Filter-and-refine retrieval:

  • Offline step: compute embedding F of entire database.

  • Given a query object q:

    • Embedding step:

      • Compute distances from query to reference objects  F(q).


How do we use it2
How Do We Use It?

Filter-and-refine retrieval:

  • Offline step: compute embedding F of entire database.

  • Given a query object q:

    • Embedding step:

      • Compute distances from query to reference objects  F(q).

    • Filter step:

      • Find top p matches of F(q) in vector space.


How do we use it3
How Do We Use It?

Filter-and-refine retrieval:

  • Offline step: compute embedding F of entire database.

  • Given a query object q:

    • Embedding step:

      • Compute distances from query to reference objects  F(q).

    • Filter step:

      • Find top p matches of F(q) in vector space.

    • Refine step:

      • Measure exact distance from q to top p matches.


Evaluating Embedding Quality

How often do we find the true nearest neighbor?

  • Embedding step:

    • Compute distances from query to reference objects  F(q).

  • Filter step:

    • Find top p matches of F(q) in vector space.

  • Refine step:

    • Measure exact distance from q to top p matches.


Evaluating Embedding Quality

How often do we find the true nearest neighbor?

  • Embedding step:

    • Compute distances from query to reference objects  F(q).

  • Filter step:

    • Find top p matches of F(q) in vector space.

  • Refine step:

    • Measure exact distance from q to top p matches.


Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

  • Embedding step:

    • Compute distances from query to reference objects  F(q).

  • Filter step:

    • Find top p matches of F(q) in vector space.

  • Refine step:

    • Measure exact distance from q to top p matches.


Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

  • Embedding step:

    • Compute distances from query to reference objects  F(q).

  • Filter step:

    • Find top p matches of F(q) in vector space.

  • Refine step:

    • Measure exact distance from q to top p matches.


Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

  • Embedding step:

    • Compute distances from query to reference objects  F(q).

  • Filter step:

    • Find top p matches of F(q) in vector space.

  • Refine step:

    • Measure exact distance from q to top p matches.


Evaluating Embedding Quality

What is the nearest neighbor classification error?

How many exact distance computations do we need?

  • Embedding step:

    • Compute distances from query to reference objects  F(q).

  • Filter step:

    • Find top p matches of F(q) in vector space.

  • Refine step:

    • Measure exact distance from q to top p matches.


nearest

neighbor

Database (80,640 images)

query

Results on Hand Dataset

Chamfer distance: 112 seconds per query


Results on Hand Dataset

Database: 80,640 synthetic images of hands.

Query set: 710 real images of hands.


Results on Hand Dataset

Database: 80,640 synthetic images of hands.

Query set: 710 real images of hands.


Results on mnist dataset
Results on MNIST Dataset

  • MNIST: 60,000 database objects, 10,000 queries.

  • Shape context (Belongie 2001):

    • 0.63% error, 20,000 distances, 22 minutes.

    • 0.54% error, 60,000 distances, 66 minutes.



Query sensitive embeddings
Query-Sensitive Embeddings

  • Richer models.

    • Capture non-metric structure.

    • Better embedding quality.

  • References:

    • Athitsos, Hadjieleftheriou, Kollios, and Sclaroff, SIGMOD 2005.

    • Athitsos, Hadjieleftheriou, Kollios, and Sclaroff, TODS, June 2007.


Capturing non metric structure
Capturing Non-Metric Structure

  • A human is not similar to a horse.

  • A centaur is similar both to a human and a horse.

  • Triangle inequality is violated:

    • Using human ratings of similarity (Tversky, 1982).

    • Using k-median Hausdorff distance.


Capturing non metric structure1
Capturing Non-Metric Structure

  • Mapping to a metric space presents dilemma:

    • If D(F(centaur), F(human)) = D(F(centaur), F(horse)) = C, then D(F(human), F(horse)) <= 2C.

  • Query-sensitive embeddings:

    • Have the modeling power to preserve non-metric structure.


Local importance of coordinates

xn1

x11

q1

x21

x22

q2

xn2

x12

xn3

x13

q3

x23

q4

x14

xn4

x24

xnd

qd

x1d

x2d

Local Importance of Coordinates

  • How important is each coordinate in comparing embeddings?

Rd

database

x1

embedding

F

x2

xn

query

q


F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)


General Intuition

1

2

original space X

3

  • Classifier: H = w1F’1 + w2F’2 + … + wjF’j.

  • Observation: accuracy of weak classifiers depends on query.

    • F’1 is perfect for (q, a, b) where q = reference object 1.

    • F’1 is good for queries close to reference object 1.

  • Question: how can we capture that?


Query-Sensitive Weak Classifiers

1

2

original space X

3


Query-Sensitive Weak Classifiers

1

2

original space X

j


Fd

F2

F1

Applying AdaBoost

original space X

Real line

  • AdaBoost forms classifiers QFi,Vi.

    • Fi: 1D embedding.

    • Vi: area of influence for Fi.

  • Output: H = w1 QF1,V1 + w2 QF2,V2 + … + wd QFd,Vd.


Fd

F2

F1

Applying AdaBoost

original space X

Real line

  • Empirical observation:

    • At late stages of the training, query-sensitive weak classifiers are still useful, whereas query-insensitive classifiers are not.


From classifier to embedding4
From Classifier to Embedding

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

AdaBoost

output

What embedding should we use?

What distance measure should we use?


From classifier to embedding5
From Classifier to Embedding

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

AdaBoost

output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x))

D(F(q), F(x)) = i=1wi SFi,Vi (q) |Fi(q) – Fi(x)|

d

Distance

measure


From classifier to embedding6
From Classifier to Embedding

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

AdaBoost

output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x))

D(F(q), F(x)) = i=1wi SFi,Vi(q) |Fi(q) – Fi(x)|

d

Distance

measure

  • Distance measure is query-sensitive.

    • Weighted L1 distance, weights depend on q.

    • SF,V(q) = 1 if F(q) is in V, 0 otherwise.


Centaurs revisited
Centaurs Revisited

  • Reference objects: human, horse, centaur.

    • For centaur queries, use weights (0,0,1).

    • For human queries, use weights (1,0,0).

  • Query-sensitive distances are non-metric.

    • Combine efficiency of L1 distance and ability to capture non-metric structure.


F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)


Recap of advantages
Recap of Advantages

  • Capturing non-metric structure.

  • Finding most informative reference objects for each query.

  • Richer model overall.

    • Choosing a weak classifier now also involves choosing an area of influence.


Dynamic Time Warping on

Time Series

Database: 31818 time series.

Query set: 1000 time series.


Dynamic Time Warping on

Time Series

Database: 32768 time series.

Query set: 50 time series.


Boostmap recap theory
BoostMap Recap - Theory

  • Machine-learning method for optimizing embeddings.

    • Explicitly maximizes amount of nearest neighbor structure preserved by embedding.

    • Optimization method is independent of underlying geometry.

    • Query-sensitive version can capture non-metric structure.


Boostmap recap practice
BoostMap Recap - Practice

  • BoostMap can significantly speed up nearest neighbor retrieval and classification.

    • Useful in real-world datasets:

      • Hand shape classification.

      • Optical character recognition (MNIST, UNIPEN).

    • In all four datasets, better results than other methods.

      • In three benchmark datasets, better than methods custom-made for those distance measures.

    • Domain-independent formulation.

      • Distance measures are used as a black box.

      • Application to proteins/DNA matching…



ad