Loading in 5 sec....

Learning Embeddings for Similarity-Based RetrievalPowerPoint Presentation

Learning Embeddings for Similarity-Based Retrieval

- 106 Views
- Uploaded on
- Presentation posted in: General

Learning Embeddings for Similarity-Based Retrieval

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Learning Embeddings for Similarity-Based Retrieval

Vassilis Athitsos

Computer Science Department

Boston University

- Background on similarity-based retrieval and embeddings.
- BoostMap.
- Embedding optimization using machine learning.

- Query-sensitive embeddings.
- Ability to preserve non-metric structure.

x1

x2

x3

xn

database

(n objects)

x1

x2

x3

xn

database

(n objects)

- Goals:
- find the k nearest neighbors of query q.

q

x1

x3

x2

xn

database

(n objects)

- Goals:
- find the k nearest neighbors of query q.

- Brute force time is linear to:
- n (size of database).
- time it takes to measure a single distance.

x2

q

xn

x1

x3

x2

xn

database

(n objects)

- Goals:
- find the k nearest neighbors of query q.

- Brute force time is linear to:
- n (size of database).
- time it takes to measure a single distance.

q

Nearest neighbor classification.

Similarity-based retrieval.

Image/video databases.

Biological databases.

Time series.

Web pages.

Browsing music or movie catalogs.

faces

letters/digits

handshapes

Comparing d-dimensional vectors is efficient:

O(d) time.

…

…

x1

y1

x2

y2

x3

y3

x4

y4

xd

yd

Comparing d-dimensional vectors is efficient:

O(d) time.

Comparing strings of length d with the edit distance is more expensive:

O(d2) time.

Reason: alignment.

…

…

x1

y1

x2

y2

y3

x3

x4

y4

xd

yd

i m m i g r a t i o n

i m i t a t i o n

Comparing d-dimensional vectors is efficient:

O(d) time.

…

…

x1

y1

x2

y2

y3

x3

x4

y4

xd

yd

- Comparing strings of length d with the edit distance is more expensive:
- O(d2) time.

- Reason: alignment.

i m m i g r a t i o n

i m i t a t i o n

Shape Context Distance

- Proposed by Belongie et al. (2001).
- Error rate: 0.63%, with database of 20,000 images.
- Uses bipartite matching (cubic complexity!).
- 22 minutes/object, heavily optimized.
- Result preview: 5.2 seconds, 0.61% error rate.

More Examples

- DNA and protein sequences:
- Smith-Waterman.

- Time series:
- Dynamic Time Warping.

- Probability distributions:
- Kullback-Leibler Distance.

- These measures are non-Euclidean, sometimes non-metric.

- Vector indexing methods NOT applicable.
- PCA.
- R-trees, X-trees, SS-trees.
- VA-files.
- Locality Sensitive Hashing.

- Pruning-based methods.
- VP-trees, MVP-trees, M-trees, Slim-trees,…
- Use triangle inequality for tree-based search.

- Filtering methods.
- AESA, LAESA…
- Use the triangle inequality to compute upper/lower bounds of distances.

- Suffer from curse of dimensionality.
- Heuristic in non-metric spaces.
- In many datasets, bad empirical performance.

x1

x2

x3

xn

x1

x2

x3

x4

xn

Embeddings

database

Rd

embedding

F

x1

x2

x3

xn

x1

x2

x3

x4

xn

q

Embeddings

database

Rd

embedding

F

query

x1

x2

x3

xn

x1

x2

x3

x4

xn

q

q

Embeddings

database

Rd

embedding

F

query

x2

x3

x1

xn

x4

x3

x2

x1

xn

q

q

- Measure distances between vectors (typically much faster).

Embeddings

database

Rd

embedding

F

query

x2

x3

x1

xn

x4

x3

x2

x1

xn

q

q

- Measure distances between vectors (typically much faster).
- Caveat: the embedding must preserve similarity structure.

Embeddings

database

Rd

embedding

F

query

Reference Object Embeddings

database

Reference Object Embeddings

database

r1

r2

r3

Reference Object Embeddings

database

r1

r2

r3

x

F(x) = (D(x, r1), D(x, r2), D(x, r3))

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)

- FastMap, MetricMap, SparseMap, Lipschitz embeddings.
- Use distances to reference objects (prototypes).

- Question: how do we directly optimize an embedding for nearest neighbor retrieval?
- FastMap & MetricMap assume Euclidean properties.
- SparseMap optimizes stress.
- Large stress may be inevitable when embedding non-metric spaces into a metric space.

- In practice often worse than random construction.

- BoostMap: A Method for Efficient Approximate Similarity Rankings.Athitsos, Alon, Sclaroff, and Kollios,CVPR 2004.
- BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. Athitsos, Alon, Sclaroff, and Kollios,PAMI 2007(to appear).

- Maximizes amount of nearest neighbor structure preserved by the embedding.
- Based on machine learning, not on geometric assumptions.
- Principled optimization, even in non-metric spaces.

- Can capture non-metric structure.
- Query-sensitive version of BoostMap.

- Better results in practice, in all datasets we have tried.

F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

F

Rd

original space X

Ideal Embedding Behavior

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

F

Rd

original space X

Ideal Embedding Behavior

b

a

q

For any query q: we want F(NN(q)) = NN(F(q)).

For any database object b besides NN(q), we want F(q) closer to F(NN(q)) than to F(b).

b

a

q

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

Classification task: is q

closer to a or to b?

b

a

q

Embeddings Seen As Classifiers

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

Classification task: is q

closer to a or to b?

- Any embedding F defines a classifier F’(q, a, b).
- F’ checks if F(q) is closer to F(a) or to F(b).

b

a

q

Classifier Definition

For triples (q, a, b) such that:

- q is a query object

- a = NN(q)

- b is a database object

Classification task: is q

closer to a or to b?

- Given embedding F: X Rd:
- F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||.

- F’(q, a, b) > 0 means “q is closer to a.”
- F’(q, a, b) < 0 means “q is closer to b.”

F

Rd

original space X

Key Observation

b

a

q

- If classifier F’ is perfect, then for every q, F(NN(q)) = NN(F(q)).
- If F(q) is closer to F(b) than to F(NN(q)), then triple (q, a, b) is misclassified.

F

Rd

original space X

Key Observation

b

a

q

- Classification error on triples (q, NN(q), b) measures how well F preserves nearest neighbor structure.

Optimization Criterion

- Goal: construct an embedding F optimized for k-nearest neighbor retrieval.
- Method: maximize accuracy of F’ on triples (q, a, b) of the following type:
- q is any object.
- a is a k-nearest neighbor of q in the database.
- b is in database, but NOT a k-nearest neighbor of q.

- If F’ is perfect on those triples, then F perfectly preserves k-nearest neighbors.

1D Embeddings as Weak Classifiers

- 1D embeddings define weak classifiers.
- Better than a random classifier (50% error rate).

Lincoln

Detroit

LA

Chicago

New

York

Cleveland

Chicago

LA

Detroit

New

York

1D Embeddings as Weak Classifiers

- 1D embeddings define weak classifiers.
- Better than a random classifier (50% error rate).

- We can define lots of different classifiers.
- Every object in the database can be a reference object.

1D Embeddings as Weak Classifiers

- 1D embeddings define weak classifiers.
- Better than a random classifier (50% error rate).

- We can define lots of different classifiers.
- Every object in the database can be a reference object.

Question: how do we combine many such

classifiers into a single strong classifier?

1D Embeddings as Weak Classifiers

- 1D embeddings define weak classifiers.
- Better than a random classifier (50% error rate).

- We can define lots of different classifiers.
- Every object in the database can be a reference object.

Question: how do we combine many such

classifiers into a single strong classifier?

Answer: use AdaBoost.

- AdaBoost is a machine learning method designed for exactly this problem.

Fn

F2

F1

Using AdaBoost

original space X

Real line

- Output: H = w1F’1 + w2F’2 + … + wdF’d .
- AdaBoost chooses 1D embeddings and weighs them.
- Goal: achieve low classification error.
- AdaBoost trains on triples chosen from the database.

H = w1F’1 + w2F’2 + … + wdF’d

AdaBoost output

What embedding should we use?

What distance measure should we use?

H = w1F’1 + w2F’2 + … + wdF’d

AdaBoost output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).

D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi|

d

H = w1F’1 + w2F’2 + … + wdF’d

AdaBoost output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).

Distance

measure

D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi|

d

H = w1F’1 + w2F’2 + … + wdF’d

AdaBoost output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x)).

Distance

measure

Claim:

Let q be closer to a than to b. H misclassifies

triple (q, a, b) if and only if, under distance

measure D, F maps q closer to b than to a.

i=1

i=1

i=1

d

d

d

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

i=1

i=1

i=1

d

d

d

H(q, a, b) =

= wiF’i(q, a, b)

= wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|)

= (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|)

= D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)

- AdaBoost optimizes a direct measure of embedding quality.
- We optimize an indexing structure for similarity-based retrieval using machine learning.
- Take advantage of training data.

Filter-and-refine retrieval:

- Offline step: compute embedding F of entire database.

Filter-and-refine retrieval:

- Offline step: compute embedding F of entire database.
- Given a query object q:
- Embedding step:
- Compute distances from query to reference objects F(q).

- Embedding step:

Filter-and-refine retrieval:

- Offline step: compute embedding F of entire database.
- Given a query object q:
- Embedding step:
- Compute distances from query to reference objects F(q).

- Filter step:
- Find top p matches of F(q) in vector space.

- Embedding step:

Filter-and-refine retrieval:

- Offline step: compute embedding F of entire database.
- Given a query object q:
- Embedding step:
- Compute distances from query to reference objects F(q).

- Filter step:
- Find top p matches of F(q) in vector space.

- Refine step:
- Measure exact distance from q to top p matches.

- Embedding step:

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

- Embedding step:
- Compute distances from query to reference objects F(q).

- Filter step:
- Find top p matches of F(q) in vector space.

- Refine step:
- Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

- Embedding step:
- Compute distances from query to reference objects F(q).

- Filter step:
- Find top p matches of F(q) in vector space.

- Refine step:
- Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

- Embedding step:
- Compute distances from query to reference objects F(q).

- Filter step:
- Find top p matches of F(q) in vector space.

- Refine step:
- Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

- Embedding step:
- Compute distances from query to reference objects F(q).

- Filter step:
- Find top p matches of F(q) in vector space.

- Refine step:
- Measure exact distance from q to top p matches.

Evaluating Embedding Quality

How often do we find the true nearest neighbor?

How many exact distance computations do we need?

- Embedding step:
- Compute distances from query to reference objects F(q).

- Filter step:
- Find top p matches of F(q) in vector space.

- Refine step:
- Measure exact distance from q to top p matches.

Evaluating Embedding Quality

What is the nearest neighbor classification error?

How many exact distance computations do we need?

- Embedding step:
- Compute distances from query to reference objects F(q).

- Filter step:
- Find top p matches of F(q) in vector space.

- Refine step:
- Measure exact distance from q to top p matches.

nearest

neighbor

Database (80,640 images)

query

Results on Hand Dataset

Chamfer distance: 112 seconds per query

Results on Hand Dataset

Database: 80,640 synthetic images of hands.

Query set: 710 real images of hands.

Results on Hand Dataset

Database: 80,640 synthetic images of hands.

Query set: 710 real images of hands.

- MNIST: 60,000 database objects, 10,000 queries.
- Shape context (Belongie 2001):
- 0.63% error, 20,000 distances, 22 minutes.
- 0.54% error, 60,000 distances, 66 minutes.

- Richer models.
- Capture non-metric structure.
- Better embedding quality.

- References:
- Athitsos, Hadjieleftheriou, Kollios, and Sclaroff, SIGMOD 2005.
- Athitsos, Hadjieleftheriou, Kollios, and Sclaroff, TODS, June 2007.

- A human is not similar to a horse.
- A centaur is similar both to a human and a horse.
- Triangle inequality is violated:
- Using human ratings of similarity (Tversky, 1982).
- Using k-median Hausdorff distance.

- Mapping to a metric space presents dilemma:
- If D(F(centaur), F(human)) = D(F(centaur), F(horse)) = C, then D(F(human), F(horse)) <= 2C.

- Query-sensitive embeddings:
- Have the modeling power to preserve non-metric structure.

…

…

…

…

xn1

x11

q1

x21

x22

q2

xn2

x12

xn3

x13

q3

x23

q4

x14

xn4

x24

xnd

qd

x1d

x2d

- How important is each coordinate in comparing embeddings?

Rd

database

x1

embedding

F

x2

xn

query

q

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)

General Intuition

1

2

original space X

3

- Classifier: H = w1F’1 + w2F’2 + … + wjF’j.
- Observation: accuracy of weak classifiers depends on query.
- F’1 is perfect for (q, a, b) where q = reference object 1.
- F’1 is good for queries close to reference object 1.

- Question: how can we capture that?

- V: area of influence (interval of real numbers).
F’(q, a, b) if F(q) is in V

- QF,V(q, a, b) =
“I don’t know” if F(q) not in V

Query-Sensitive Weak Classifiers

1

2

original space X

3

- V: area of influence (interval of real numbers).
F’(q, a, b) if F(q) is in V

- QF,V(q, a, b) =
“I don’t know” if F(q) not in V

- If V includes all real numbers, QF,V = F’.

Query-Sensitive Weak Classifiers

1

2

original space X

j

Fd

F2

F1

Applying AdaBoost

original space X

Real line

- AdaBoost forms classifiers QFi,Vi.
- Fi: 1D embedding.
- Vi: area of influence for Fi.

- Output: H = w1 QF1,V1 + w2 QF2,V2 + … + wd QFd,Vd.

Fd

F2

F1

Applying AdaBoost

original space X

Real line

- Empirical observation:
- At late stages of the training, query-sensitive weak classifiers are still useful, whereas query-insensitive classifiers are not.

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

AdaBoost

output

What embedding should we use?

What distance measure should we use?

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

AdaBoost

output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x))

D(F(q), F(x)) = i=1wi SFi,Vi (q) |Fi(q) – Fi(x)|

d

Distance

measure

H(q, a, b) = i=1wi QFi,Vi(q, a, b)

d

AdaBoost

output

BoostMap

embedding

F(x) = (F1(x), …, Fd(x))

D(F(q), F(x)) = i=1wi SFi,Vi(q) |Fi(q) – Fi(x)|

d

Distance

measure

- Distance measure is query-sensitive.
- Weighted L1 distance, weights depend on q.
- SF,V(q) = 1 if F(q) is in V, 0 otherwise.

- Reference objects: human, horse, centaur.
- For centaur queries, use weights (0,0,1).
- For human queries, use weights (1,0,0).

- Query-sensitive distances are non-metric.
- Combine efficiency of L1 distance and ability to capture non-metric structure.

F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando))

F(Sacramento)....= ( 386, 1543, 2920)

F(Las Vegas).....= ( 262, 1232, 2405)

F(Oklahoma City).= (1345, 437, 1291)

F(Washington DC).= (2657, 1207, 853)

F(Jacksonville)..= (2422, 1344, 141)

- Capturing non-metric structure.
- Finding most informative reference objects for each query.
- Richer model overall.
- Choosing a weak classifier now also involves choosing an area of influence.

Dynamic Time Warping on

Time Series

Database: 31818 time series.

Query set: 1000 time series.

Dynamic Time Warping on

Time Series

Database: 32768 time series.

Query set: 50 time series.

- Machine-learning method for optimizing embeddings.
- Explicitly maximizes amount of nearest neighbor structure preserved by embedding.
- Optimization method is independent of underlying geometry.
- Query-sensitive version can capture non-metric structure.

- BoostMap can significantly speed up nearest neighbor retrieval and classification.
- Useful in real-world datasets:
- Hand shape classification.
- Optical character recognition (MNIST, UNIPEN).

- In all four datasets, better results than other methods.
- In three benchmark datasets, better than methods custom-made for those distance measures.

- Domain-independent formulation.
- Distance measures are used as a black box.
- Application to proteins/DNA matching…

- Useful in real-world datasets: