Compact data representations and their applications
Download
1 / 50

Compact Data Representations and their Applications - PowerPoint PPT Presentation


  • 399 Views
  • Updated On :

Compact Data Representations and their Applications. Moses Charikar Princeton University. sketch. complex object. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. Sketching Paradigm. Construct compact representation ( sketch) of data such that

Related searches for Compact Data Representations and their Applications

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Compact Data Representations and their Applications' - Solomon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Compact data representations and their applications l.jpg

Compact Data Representations and their Applications

Moses Charikar

Princeton University


Sketching paradigm l.jpg

sketch

complex object

0

1

0

1

1

0

0

1

1

0

0

0

1

0

1

1

0

0

1

0

Sketching Paradigm

  • Construct compact representation (sketch) of data such that

  • Interesting functions of data can be computed from compact representation

estimated


Why care about compact representations l.jpg
Why care about compact representations ?

  • Practical motivations

    • Algorithmic techniques for massive data sets

    • Compact representations lead to reduced space, time requirements

    • Make impractical tasks feasible

  • Theoretical Motivations

    • Interesting mathematical problems

    • Connections to many areas of research


Questions l.jpg
Questions

  • What is the data ?

  • What functions do we want to compute on the data ?

  • How do we estimate functions on the sketches ?

  • Different considerations arise from different combinations of answers

  • Compact representation schemes are functions of the requirements


What is the data l.jpg
What is the data ?

  • Sets, vectors, points in Euclidean space, points in a metric space, vertices of a graph.

  • Mathematical representation of objects (e.g. documents, images, customer profiles, queries).


What functions do we want to compute on the data l.jpg
What functions do we want to compute on the data ?

  • Local functions : pairs of objectse.g. distance between objects

  • Sketch of each object, such that function can be estimated from pairs of sketches

  • Global functions : entire data sete.g. statistical properties of data

  • Sketch of entire data set, ability to update, combine sketches


Local functions distance similarity l.jpg
Local functions: distance/similarity

  • Distance is a general metric, i.e satisfies triangle inequality

  • Normed spacex = (x1, x2, …, xd) y = (y1, y2, …, yd)

  • Other special metrics (e.g. Earth Mover Distance)


Estimating distance from sketches l.jpg
Estimating distance from sketches

  • Arbitrary function of sketches

    • Information theory, communication complexity question.

  • Sketches are points in normed space

    • Embedding original distance function in normed space. [Bourgain ’85] [Linial,London,Rabinovich ’94]

  • Original metric is (same) normed space

    • Original data points are high dimensional

    • Sketches are points low dimensions

    • Dimension reduction in normed spaces[Johnson Lindenstrauss ’84]


Global functions l.jpg
Global functions

  • Statistical properties of entire data set

  • Frequency moments

  • Sortedness of data

  • Set membership

  • Size of join of relations

  • Histogram representation

  • Most frequent items in data set

  • Clustering of data


Streaming algorithms l.jpg
Streaming algorithms

  • Perform computation in one (or constant) pass(es) over data using a small amount of storage space

  • Availability of sketch function facilitates streaming algorithm

  • Additional requirements - sketch should allow:

    • Update to incorporate new data items

    • Combination of sketches for different data sets

storage

input


Goals l.jpg
Goals

  • Glimpse of sketching techniques, especially in geometric settings.

  • Basic theoretical ideas, no messy details

  • Concrete application


Talk outline l.jpg
Talk Outline

  • Dimension reduction

  • Similarity preserving hash functions

    • sketching vector norms

    • sketching Earth Mover Distance (EMD)

  • Application to image retrieval


Low distortion embeddings l.jpg

f

http://humanities.ucsd.edu/courses/kuchtahum4/pix/earth.jpg

http://www.physast.uga.edu/~jss/1010/ch10/earth.jpg

Low Distortion Embeddings

  • Given metric spaces (X1,d1) & (X2,d2),embedding f: X1 X2 has distortion D if ratio of distances changes by at most D

  • “Dimension Reduction” –

    • Original space high dimensional

    • Make target space be of “low” dimension, while maintaining small distortion


Dimension reduction in l 2 l.jpg
Dimension Reduction in L2

  • n points in Euclidean space (L2 norm) can be mapped down to O((log n)/2) dimensions with distortion at most 1+.[Johnson Lindenstrauss ‘84]

  • Two interesting properties:

    • Linear mapping

    • Oblivious – choice of linear mapping does not depend on point set

    • Quite simple [JL84, FM88, IM98, DG99, Ach01]: Even a random +1/-1 matrix works…

  • Many applications…


Dimension reduction for l 1 l.jpg
Dimension reduction for L1

  • [C,Sahai ‘02]Linear embeddings are not good for dimension reduction in L1

  • There exist O(n) points in L1in n dimensions, such that any linear mapping with distortion  needs n/2dimensions


Dimension reduction for l 116 l.jpg
Dimension reduction for L1

  • [C, Brinkman ‘03]Strong lower boundsfor dimension reduction in L1

  • There exist n points in L1, such that anyembedding with constant distortion  needs n1/2dimensions

  • Simpler proof by [Lee,Naor ’04]

  • Does not rule out other sketching techniques


Talk outline17 l.jpg
Talk Outline

  • Dimension reduction

  • Similarity preserving hash functions

    • sketching vector norms

    • sketching Earth Mover Distance (EMD)

  • Application to image retrieval


Similarity preserving hash functions l.jpg

sketch

complex object

0

1

0

1

1

0

0

1

1

0

0

0

1

0

1

1

0

0

1

0

Similarity Preserving Hash Functions

  • Similarity function sim(x,y), distance d(x,y)

  • Family of hash functions F with probability distribution such that


Applications l.jpg
Applications

  • Compact representation scheme for estimating similarity

  • Approximate nearest neighbor search [Indyk,Motwani ’98] [Kushilevitz,Ostrovsky,Rabani ‘98]


Relaxations of sph l.jpg
Relaxations of SPH

  • Estimate distance measure, not similarity measure in [0,1].

  • Measure E[f(h(x),h(y)].

  • Estimator will approximate distance function.


Sketching set similarity minwise independent permutations l.jpg
Sketching Set Similarity:Minwise Independent Permutations

[Broder,Manasse,Glassman,Zweig ‘97]

[Broder,C,Frieze,Mitzenmacher ‘98]


Sketching l 1 l.jpg
Sketching L1

  • Design sketch for vectors to estimate L1 norm

  • Hash function to distinguish between small and large distances [KOR ’98]

    • Map L1 to Hamming space

    • Bit vectors a=(a1,a2,…,an) and b=(b1,b2,…,bn)

    • Distinguish between distances  (1-ε)n/k versus  (1+ε)n/k

    • XOR random set of k bits

    • Pr[h(a)=h(b)] differs by constant in two cases


Sketching l 1 via stable distributions l.jpg
Sketching L1 via stable distributions

  • a=(a1,a2,…,an) and b=(b1,b2,…,bn)

  • Sketching L2

    • f(a) = Σi ai Xi f(b) = Σi bi XiXi independent Gaussian

    • f(a)-f(b) has Gaussian distribution scaled by |a-b|2

    • Form many coordinates, estimate |a-b|2 by taking L2 norm

  • Sketching L1

    • f(a) = Σi ai Xi f(b) = Σi bi XiXi independent Cauchy distributed

    • f(a)-f(b) has Cauchy distribution scaled by |a-b|1

    • Form many coordinates, estimate |a-b|1 by taking median[Indyk ’00] -- streaming applications



Bipartite bichromatic matching l.jpg

0.2

0.3

0.2

0.1

0.4

0.2

0.2

0.1

0.2

0.4

0.2

0.1

0.2

0.1

0.1

Bipartite/Bichromatic matching

  • Minimum cost matching between two sets of points.

  • Point weights  multiple copies of points

Fast estimation of bipartite matching [Agarwal,Varadarajan ’04]

Goal: Sketch point set to enable estimation of min cost matching


Detour classification with pairwise relationships kleinberg tardos 99 l.jpg

separation cost

Assignment cost

Detour: Classification with pairwise relationships [Kleinberg,Tardos ’99]

we


Lp relaxation and rounding l.jpg

P

Q

Separation cost measured byEMD(P,Q)

Rounding algorithm guarantees

E[d(h(P),h(Q)]  O(log n) EMD(P,Q) [C ’02]

LP Relaxation and Rounding

[Kleinberg,Tardos ’99]

[Chekuri,Khanna,Naor,Zosin ‘01]


Approximating metrics by trees l.jpg

Single tree may have high distortion

Use probability distribution over trees

d(u,v)  E[dT(u,v)]  O(log n) d(u,v)

[Bartal ’96,’98, FRT ’03]

Approximating metrics by trees


Emd on trees embedding into l 1 l.jpg

T

T

v(P) = {ℓTwT(P)}T

v(Q) = {ℓTwT(Q)}T

EMD(P,Q) =|v(P)-v(Q)|1

EMD on trees: embedding into L1

[suggested by

Piotr Indyk]

|wT(P)-wT(Q)|

EMD(P,Q) = ΣTℓT|wT(P)-wT(Q)|


Emd on general metrics l.jpg
EMD on general metrics

  • Approximate metric by probability distribution on trees

  • Sample tree from distribution and compute L1representation

  • EMD(P,Q)  E[d(v(P),v(Q))]  O(log n) EMD(P,Q)


Tree approximations for euclidean points l.jpg
Tree approximations for Euclidean points

distortion O(d log Δ) [Bartal ’96, CCGGP ’98]

proposed by [Indyk,Thaper ’03] for estimating EMD


Talk outline32 l.jpg
Talk Outline

  • Dimension reduction

  • Similarity preserving hash functions

    • sketching vector norms

    • sketching Earth Mover Distance (EMD)

  • Application to image retrieval


Motivation l.jpg
Motivation

  • Apply sketching techniques in complex setting

  • Compact data structures for high-quality and efficient image retrieval ?

  • Evaluate effectiveness of sketching techniques

  • [Lv,C,Li ’04]


Region based image retrieval rbir l.jpg
Region Based Image Retrieval (RBIR)

  • Region representation

    • color (histogram, moments, fourier coefficients)

    • position, shape

  • Region based image similarity measure

    • Independent best match [Blobworld, NETRA]

      • each region in one matched to best region in other

    • One-to-one match [Windsurf, WALRUS]

      • one-to-one matching between two sets of regions

    • EMD match


Overview l.jpg

Pre-processing

Query time

Overview


Region representation l.jpg
Region Representation

  • Color moments

    • First three moments in HSV color space

       9-D vector

  • Bounding box

    • Aspect ratio

    • Bounding box size

    • Area ratio

    • Region centroid

    •  5-D vector

  • Weighted L1 distance


Addressing problems with emd match l.jpg
Addressing problems with EMD match

  • Region weights: proportional to region size?

    • Big background has disproportionate effect

  • Ground distance: region distance?

    • Pair of different regions can have large effect

    • distance meaningless beyond certain point

  • EMD*-match

    • Region weights = Square root of region size

    • Ground distance= Thresholded region distance


Compact region representation l.jpg

x1

y1

x = (x1,x2,x3,x4)

x2

y2

y = (y1,y2,y3,y4)

y3

x3

x4

y4

0

1

Compact region representation

  • 14D region vectors → NK bit vectors

    • hamming distance  (weighted) L1 distance

  • XOR groups of K bits → N bit vector

    • hamming distance  thresholded L1 distance


Thresholding distance by xoring bits l.jpg
Thresholding distance by XORing bits

Number of bits XORed control shape of flattened curve


Emd embedding combining region vectors l.jpg
EMD embedding:combining region vectors

  • Pick random pattern (bit positions and bit values)

  • Add region weights matching pattern

  • M such patterns: M coordinates of image vector

  • Related to [Indyk Thaper 04]


Evaluation criteria l.jpg
Evaluation Criteria

  • Effectiveness of EMD* match

  • Compactness of data structureswithout compromising quality

  • Efficiency and effectiveness of embedding and filtering algorithm


Evaluation methodology l.jpg
Evaluation Methodology

  • 10,000 images

  • 32 queries with similar images identifiedhttp://dbvis.inf.uni-konstanz.de/research/projects/SimSearch/effpics.html

  • Segmentation via JSEGavg. #regions = 7.16, min = 1, max = 57

  • Effectiveness measured by average precisionAverage of precision values at positions of k target images


Effectiveness of emd match l.jpg
Effectiveness of EMD* Match

SIMPLIcity: avg. precision = 0.331




Effect of number of random patterns raw image vector size l.jpg
Effect of number of random patterns(raw image vector size)


Search quality effect of image bit vector size and filtering l.jpg
Search quality: Effect of image bit vector size and filtering


Average query time effect of image bit vector size and filtering l.jpg
Average query time: Effect of image bit vector size and filtering


Conclusions l.jpg
Conclusions

  • Compact representations at the heart of several algorithmic techniques for large data sets

    • Compact representations tailored to applications

    • Effective for region based image retrieval

  • Many interesting research questions

    • sketching EMD over points in R2

    • upper bounds for dimension reduction in L1


ad