Learning Embeddings for Similarity-Based Retrieval

1 / 92

Learning Embeddings for Similarity-Based Retrieval - PowerPoint PPT Presentation

Learning Embeddings for Similarity-Based Retrieval. Vassilis Athitsos Computer Science Department Boston University. Overview. Background on similarity-based retrieval and embeddings. BoostMap. Embedding optimization using machine learning. Query-sensitive embeddings.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about ' Learning Embeddings for Similarity-Based Retrieval' - judith-pruitt

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Learning Embeddings for Similarity-Based Retrieval

Vassilis Athitsos

Computer Science Department

Boston University

Overview
• Background on similarity-based retrieval and embeddings.
• BoostMap.
• Embedding optimization using machine learning.
• Query-sensitive embeddings.
• Ability to preserve non-metric structure.

x1

x2

x3

xn

Problem Definition

database

(n objects)

x1

x2

x3

xn

Problem Definition

database

(n objects)

• Goals:
• find the k nearest neighbors of query q.

q

x1

x3

x2

xn

Problem Definition

database

(n objects)

• Goals:
• find the k nearest neighbors of query q.
• Brute force time is linear to:
• n (size of database).
• time it takes to measure a single distance.

x2

q

xn

x1

x3

x2

xn

Problem Definition

database

(n objects)

• Goals:
• find the k nearest neighbors of query q.
• Brute force time is linear to:
• n (size of database).
• time it takes to measure a single distance.

q

Nearest neighbor classification.

Similarity-based retrieval.

Image/video databases.

Biological databases.

Time series.

Web pages.

Browsing music or movie catalogs.

faces

letters/digits

Applications

handshapes

Comparing d-dimensional vectors is efficient:

O(d) time.

Comparing strings of length d with the edit distance is more expensive:

O(d2) time.

Reason: alignment.

x1

y1

x2

y2

y3

x3

x4

y4

xd

yd

Expensive Distance Measures

i m m i g r a t i o n

i m i t a t i o n

Comparing d-dimensional vectors is efficient:

O(d) time.

x1

y1

x2

y2

y3

x3

x4

y4

xd

yd

Expensive Distance Measures
• Comparing strings of length d with the edit distance is more expensive:
• O(d2) time.
• Reason: alignment.

i m m i g r a t i o n

i m i t a t i o n

Shape Context Distance

• Proposed by Belongie et al. (2001).
• Error rate: 0.63%, with database of 20,000 images.
• Uses bipartite matching (cubic complexity!).
• 22 minutes/object, heavily optimized.
• Result preview: 5.2 seconds, 0.61% error rate.

More Examples

• DNA and protein sequences:
• Smith-Waterman.
• Time series:
• Dynamic Time Warping.
• Probability distributions:
• Kullback-Leibler Distance.
• These measures are non-Euclidean, sometimes non-metric.
Indexing Problem
• Vector indexing methods NOT applicable.
• PCA.
• R-trees, X-trees, SS-trees.
• VA-files.
• Locality Sensitive Hashing.
Metric Methods
• Pruning-based methods.
• VP-trees, MVP-trees, M-trees, Slim-trees,…
• Use triangle inequality for tree-based search.
• Filtering methods.
• AESA, LAESA…
• Use the triangle inequality to compute upper/lower bounds of distances.
• Suffer from curse of dimensionality.
• Heuristic in non-metric spaces.
• In many datasets, bad empirical performance.

x1

x2

x3

xn

x1

x2

x3

x4

xn

Embeddings

database

Rd

embedding

F

x1

x2

x3

xn

x1

x2

x3

x4

xn

q

Embeddings

database

Rd

embedding

F

query

x1

x2

x3

xn

x1

x2

x3

x4

xn

q

q

Embeddings

database

Rd

embedding

F

query

x2

x3

x1

xn

x4

x3

x2

x1

xn

q

q

• Measure distances between vectors (typically much faster).

Embeddings

database

Rd

embedding

F

query

x2

x3

x1

xn

x4

x3

x2

x1

xn

q

q

• Measure distances between vectors (typically much faster).
• Caveat: the embedding must preserve similarity structure.

Embeddings

database

Rd

embedding

F

query

Reference Object Embeddings

database

r1

r2

r3

x

F(x) = (D(x, r1), D(x, r2), D(x, r3))

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)

Existing Embedding Methods
• FastMap, MetricMap, SparseMap, Lipschitz embeddings.
• Use distances to reference objects (prototypes).
• Question: how do we directly optimize an embedding for nearest neighbor retrieval?
• FastMap & MetricMap assume Euclidean properties.
• SparseMap optimizes stress.
• Large stress may be inevitable when embedding non-metric spaces into a metric space.
• In practice often worse than random construction.
BoostMap
• BoostMap: A Method for Efficient Approximate Similarity Rankings.Athitsos, Alon, Sclaroff, and Kollios,CVPR 2004.
• BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. Athitsos, Alon, Sclaroff, and Kollios,PAMI 2007(to appear).
Key Features of BoostMap
• Maximizes amount of nearest neighbor structure preserved by the embedding.
• Based on machine learning, not on geometric assumptions.
• Principled optimization, even in non-metric spaces.
• Can capture non-metric structure.
• Query-sensitive version of BoostMap.
• Better results in practice, in all datasets we have tried.

F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

F

Rd

original space X

Ideal Embedding Behavior

b

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

For any database object b besides NN(q), we want F(q) closer to F(NN(q)) than to F(b).

b

a

q

Embeddings Seen As Classifiers

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

closer to a or to b?

b

a

q

Embeddings Seen As Classifiers

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

closer to a or to b?

• Any embedding F defines a classifier F’(q, a, b).
• F’ checks if F(q) is closer to F(a) or to F(b).

b

a

q

Classifier Definition

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

closer to a or to b?

• Given embedding F: X  Rd:
• F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||.
• F’(q, a, b) > 0 means “q is closer to a.”
• F’(q, a, b) < 0 means “q is closer to b.”

F

Rd

original space X

Key Observation

b

a

q

• If classifier F’ is perfect, then for every q, F(NN(q)) = NN(F(q)).
• If F(q) is closer to F(b) than to F(NN(q)), then triple (q, a, b) is misclassified.

F

Rd

original space X

Key Observation

b

a

q

• Classification error on triples (q, NN(q), b) measures how well F preserves nearest neighbor structure.

Optimization Criterion

• Goal: construct an embedding F optimized for k-nearest neighbor retrieval.
• Method: maximize accuracy of F’ on triples (q, a, b) of the following type:
• q is any object.
• a is a k-nearest neighbor of q in the database.
• b is in database, but NOT a k-nearest neighbor of q.
• If F’ is perfect on those triples, then F perfectly preserves k-nearest neighbors.

1D Embeddings as Weak Classifiers

• 1D embeddings define weak classifiers.
• Better than a random classifier (50% error rate).

Lincoln

Detroit

LA

Chicago

New

York

Cleveland

Chicago

LA

Detroit

New

York

1D Embeddings as Weak Classifiers

• 1D embeddings define weak classifiers.
• Better than a random classifier (50% error rate).
• We can define lots of different classifiers.
• Every object in the database can be a reference object.

1D Embeddings as Weak Classifiers

• 1D embeddings define weak classifiers.
• Better than a random classifier (50% error rate).
• We can define lots of different classifiers.
• Every object in the database can be a reference object.

Question: how do we combine many such

classifiers into a single strong classifier?

1D Embeddings as Weak Classifiers

• 1D embeddings define weak classifiers.
• Better than a random classifier (50% error rate).
• We can define lots of different classifiers.
• Every object in the database can be a reference object.

Question: how do we combine many such

classifiers into a single strong classifier?

• AdaBoost is a machine learning method designed for exactly this problem.

Fn

F2

F1

original space X

Real line

• Output: H = w1F’1 + w2F’2 + … + wdF’d .
• AdaBoost chooses 1D embeddings and weighs them.
• Goal: achieve low classification error.
• AdaBoost trains on triples chosen from the database.
From Classifier to Embedding

H = w1F’1 + w2F’2 + … + wdF’d

What embedding should we use?

What distance measure should we use?

From Classifier to Embedding

H = w1F’1 + w2F’2 + … + wdF’d

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).

D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi|

d

From Classifier to Embedding

H = w1F’1 + w2F’2 + … + wdF’d

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).

Distance

measure

D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi|

d

From Classifier to Embedding

H = w1F’1 + w2F’2 + … + wdF’d

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).

Distance

measure

Claim:

Let q be closer to a than to b. H misclassifies

triple (q, a, b) if and only if, under distance

measure D, F maps q closer to b than to a.

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

Proof

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

Significance of Proof
• AdaBoost optimizes a direct measure of embedding quality.
• We optimize an indexing structure for similarity-based retrieval using machine learning.
• Take advantage of training data.
How Do We Use It?

Filter-and-refine retrieval:

• Offline step: compute embedding F of entire database.
How Do We Use It?

Filter-and-refine retrieval:

• Offline step: compute embedding F of entire database.
• Given a query object q:
• Embedding step:
• Compute distances from query to reference objects  F(q).
How Do We Use It?

Filter-and-refine retrieval:

• Offline step: compute embedding F of entire database.
• Given a query object q:
• Embedding step:
• Compute distances from query to reference objects  F(q).
• Filter step:
• Find top p matches of F(q) in vector space.
How Do We Use It?

Filter-and-refine retrieval:

• Offline step: compute embedding F of entire database.
• Given a query object q:
• Embedding step:
• Compute distances from query to reference objects  F(q).
• Filter step:
• Find top p matches of F(q) in vector space.
• Refine step:
• Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

• Embedding step:
• Compute distances from query to reference objects  F(q).
• Filter step:
• Find top p matches of F(q) in vector space.
• Refine step:
• Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

• Embedding step:
• Compute distances from query to reference objects  F(q).
• Filter step:
• Find top p matches of F(q) in vector space.
• Refine step:
• Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

• Embedding step:
• Compute distances from query to reference objects  F(q).
• Filter step:
• Find top p matches of F(q) in vector space.
• Refine step:
• Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

• Embedding step:
• Compute distances from query to reference objects  F(q).
• Filter step:
• Find top p matches of F(q) in vector space.
• Refine step:
• Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

• Embedding step:
• Compute distances from query to reference objects  F(q).
• Filter step:
• Find top p matches of F(q) in vector space.
• Refine step:
• Measure exact distance from q to top p matches.

Evaluating Embedding Quality

What is the nearest neighbor classification error?

How many exact distance computations do we need?

• Embedding step:
• Compute distances from query to reference objects  F(q).
• Filter step:
• Find top p matches of F(q) in vector space.
• Refine step:
• Measure exact distance from q to top p matches.

nearest

neighbor

Database (80,640 images)

query

Results on Hand Dataset

Chamfer distance: 112 seconds per query

Results on Hand Dataset

Database: 80,640 synthetic images of hands.

Query set: 710 real images of hands.

Results on Hand Dataset

Database: 80,640 synthetic images of hands.

Query set: 710 real images of hands.

Results on MNIST Dataset
• MNIST: 60,000 database objects, 10,000 queries.
• Shape context (Belongie 2001):
• 0.63% error, 20,000 distances, 22 minutes.
• 0.54% error, 60,000 distances, 66 minutes.
Query-Sensitive Embeddings
• Richer models.
• Capture non-metric structure.
• Better embedding quality.
• References:
• Athitsos, Hadjieleftheriou, Kollios, and Sclaroff, SIGMOD 2005.
• Athitsos, Hadjieleftheriou, Kollios, and Sclaroff, TODS, June 2007.
Capturing Non-Metric Structure
• A human is not similar to a horse.
• A centaur is similar both to a human and a horse.
• Triangle inequality is violated:
• Using human ratings of similarity (Tversky, 1982).
• Using k-median Hausdorff distance.
Capturing Non-Metric Structure
• Mapping to a metric space presents dilemma:
• If D(F(centaur), F(human)) = D(F(centaur), F(horse)) = C, then D(F(human), F(horse)) <= 2C.
• Query-sensitive embeddings:
• Have the modeling power to preserve non-metric structure.

xn1

x11

q1

x21

x22

q2

xn2

x12

xn3

x13

q3

x23

q4

x14

xn4

x24

xnd

qd

x1d

x2d

Local Importance of Coordinates
• How important is each coordinate in comparing embeddings?

Rd

database

x1

embedding

F

x2

xn

query

q

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)

General Intuition

1

2

original space X

3

• Classifier: H = w1F’1 + w2F’2 + … + wjF’j.
• Observation: accuracy of weak classifiers depends on query.
• F’1 is perfect for (q, a, b) where q = reference object 1.
• F’1 is good for queries close to reference object 1.
• Question: how can we capture that?

V: area of influence (interval of real numbers).

F’(q, a, b) if F(q) is in V

• QF,V(q, a, b) =

“I don’t know” if F(q) not in V

Query-Sensitive Weak Classifiers

1

2

original space X

3

V: area of influence (interval of real numbers).

F’(q, a, b) if F(q) is in V

• QF,V(q, a, b) =

“I don’t know” if F(q) not in V

• If V includes all real numbers, QF,V = F’.

Query-Sensitive Weak Classifiers

1

2

original space X

j

Fd

F2

F1

original space X

Real line

• Fi: 1D embedding.
• Vi: area of influence for Fi.
• Output: H = w1 QF1,V1 + w2 QF2,V2 + … + wd QFd,Vd.

Fd

F2

F1

original space X

Real line

• Empirical observation:
• At late stages of the training, query-sensitive weak classifiers are still useful, whereas query-insensitive classifiers are not.
From Classifier to Embedding

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

output

What embedding should we use?

What distance measure should we use?

From Classifier to Embedding

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x))

D(F(q), F(x)) = i=1wi SFi,Vi (q) |Fi(q) – Fi(x)|

d

Distance

measure

From Classifier to Embedding

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x))

D(F(q), F(x)) = i=1wi SFi,Vi(q) |Fi(q) – Fi(x)|

d

Distance

measure

• Distance measure is query-sensitive.
• Weighted L1 distance, weights depend on q.
• SF,V(q) = 1 if F(q) is in V, 0 otherwise.
Centaurs Revisited
• Reference objects: human, horse, centaur.
• For centaur queries, use weights (0,0,1).
• For human queries, use weights (1,0,0).
• Query-sensitive distances are non-metric.
• Combine efficiency of L1 distance and ability to capture non-metric structure.

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)

• Capturing non-metric structure.
• Finding most informative reference objects for each query.
• Richer model overall.
• Choosing a weak classifier now also involves choosing an area of influence.

Dynamic Time Warping on

Time Series

Database: 31818 time series.

Query set: 1000 time series.

Dynamic Time Warping on

Time Series

Database: 32768 time series.

Query set: 50 time series.

BoostMap Recap - Theory
• Machine-learning method for optimizing embeddings.
• Explicitly maximizes amount of nearest neighbor structure preserved by embedding.
• Optimization method is independent of underlying geometry.
• Query-sensitive version can capture non-metric structure.
BoostMap Recap - Practice
• BoostMap can significantly speed up nearest neighbor retrieval and classification.
• Useful in real-world datasets:
• Hand shape classification.
• Optical character recognition (MNIST, UNIPEN).
• In all four datasets, better results than other methods.
• In three benchmark datasets, better than methods custom-made for those distance measures.
• Domain-independent formulation.
• Distance measures are used as a black box.
• Application to proteins/DNA matching…