summer school on hashing 14 locality sensitive hashing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Summer School on Hashing’14 Locality Sensitive Hashing PowerPoint Presentation
Download Presentation
Summer School on Hashing’14 Locality Sensitive Hashing

Loading in 2 Seconds...

play fullscreen
1 / 31

Summer School on Hashing’14 Locality Sensitive Hashing - PowerPoint PPT Presentation


  • 191 Views
  • Uploaded on

Summer School on Hashing’14 Locality Sensitive Hashing. Alex Andoni ( Microsoft Research). Nearest Neighbor Search (NNS). Preprocess: a set of points Query: given a query point , report a point with the smallest distance to. Approximate NNS. c -approximate. if there exists a

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Summer School on Hashing’14 Locality Sensitive Hashing' - brighton-titus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
nearest neighbor search nns
Nearest Neighbor Search (NNS)

Preprocess: a set of points

Query: given a query point , report a point with the smallest distance to

approximate nns
Approximate NNS

c-approximate

if there exists a

point at distance

r

p

cr

q

  • r-near neighbor: given a new point , report a point s.t.
  • Randomized: a point returned with 90% probability
heuristic for exact nns
Heuristic for Exact NNS

c-approximate

r

p

cr

q

  • r-near neighbor: given a new point , report a set with
    • all points s.t. (each with 90% probability)
    • may contain some approximate neighbors s.t.
  • Can filter out bad answers
locality sensitive hashing
Locality-Sensitive Hashing

[Indyk-Motwani’98]

q

q

p

“not-so-small”

:, where

1

  • Random hash function on s.t. for any points :
    • Close when

is “high”

    • Far when

is “small”

  • Use several hash

tables

locality sensitive hash functions
Locality sensitive hash functions
  • Hash function is usually a concatenation of “primitive” functions:
  • Example: Hamming space
    • , i.e., choose bit for a random
    • chooses bits at random
full algorithm
Full algorithm
  • Data structure is just hash tables:
    • Each hash table uses a fresh random function
    • Hash all dataset points into the table
  • Query:
    • Check for collisions in each of the hash tables
    • until we encounter a point within distance
  • Guarantees:
    • Space:, plus space to store points
    • Query time: (in expectation)
    • 50% probability of success.
analysis of lsh scheme
Analysis of LSH Scheme

sets.t.

=

=

  • Choice of parameters?
    • hash tables with
  • Pr[collision of far pair] =
  • Pr[collision of close pair] =
  • Hence “repetitions” (tables) suffice!
analysis correctness
Analysis: Correctness
  • Let be an -near neighbor
    • If does not exists, algorithm can output anything
  • Algorithm fails when:
    • near neighbor is not in the searched buckets
  • Probability of failure:
    • Probability do not collide in a hash table:
    • Probability they do not collide in hash tables at most
analysis runtime
Analysis: Runtime
  • Runtime dominated by:
    • Hash function evaluation: time
    • Distance computations to points in buckets
  • Distance computations:
    • Care only about far points, at distance
    • In one hash table, we have
      • Probability a far point collides is at most
      • Expected number of far points in a bucket:
    • Over hash tables, expected number of far points is
  • Total: in expectation
nns for euclidean space
NNS for Euclidean space

[Datar-Immorlica-Indyk-Mirrokni’04]

  • Hash function is a concatenation of “primitive” functions:
  • LSH function :
    • pick a random line , and quantize
    • project point into
      • is a random Gaussian vector
      • random in
      • is a parameter (e.g., 4)
optimal lsh
Optimal* LSH

[A-Indyk’06]

p

  • Regular grid → grid of balls
    • p can hit empty space, so take more such grids until p is in a ball
  • Need (too) many grids of balls
    • Start by projecting in dimension t
  • Analysis gives
  • Choice of reduced dimension t?
    • Tradeoff between
      • # hash tables, n, and
      • Time to hash, tO(t)
    • Total query time: dn1/c2+o(1)

p

2D

Rt

proof idea

q

P(r)

x

p

Proof idea
  • Claim: , i.e.,
    • P(r)=probability of collision when ||p-q||=r
  • Intuitive proof:
    • Projection approx preserves distances [JL]
    • P(r) = intersection / union
    • P(r)≈random point u beyond the dashed line
    • Fact (high dimensions): the x-coordinate of u has a nearly Gaussian distribution

→ P(r)  exp(-A·r2)

r

q

p

u

lsh zoo
LSH Zoo

To Simons or not to Simons

To be or not to be

Simons

Simons

not

not

not

not

be

be

to

or

to

or

1

…01111…

…11101…

1

…01122…

…21102…

{be,not,or,to}

{not,or,to,

Simons}

be

to

=be,to,Simons,or,not

  • Hamming distance
    • : pick a random coordinate(s) [IM98]
  • Manhattan distance:
    • : random grid [AI’06]
  • Jaccard distance between sets:
    • : pick a random permutation on the universe

min-wise hashing[Bro’97,BGMZ’97]

  • Angle: sim-hash [Cha’02,…]
lsh in the wild
LSH in the wild

fewer tables

fewer false

positives

safety not

guaranteed

  • If want exact NNS, what is ?
    • Can choose any parameters
    • Correct as long as
    • Performance:
      • trade-off between # tables and false positives
      • will depend on dataset “quality”
      • Can tune to optimize for given dataset
time space trade offs
Time-Space Trade-offs

query

time

space

low

high

medium

medium

1 mem lookup

high

low

data dependent hashing
Data-dependent Hashing!

[A-Indyk-Nguyen-Razenshteyn’14]

  • NNS in Hamming space () with query time, space and preprocessing for
    • optimal LSH: of [IM’98]
  • NNS in Euclidean space () with:
    • optimal LSH: of [AI’06]
a look at lsh lower bounds
A look at LSH lower bounds
  • LSH lower bounds in Hamming space
    • Fourier analytic
  • [Motwani-Naor-Panigrahy’06]
    • distribution over hash functions
    • Far pair: random, distance
    • Close pair: random at distance =
    • Get

[O’Donnell-Wu-Zhou’11]

why not nns lower bound
Why not NNS lower bound?
  • Suppose we try to generalize [OWZ’11] to NNS
    • Pick random
    • All the “far point” are concentrated in a small ball of radius
    • Easy to see at preprocessing: actual near neighbor close to the center of the minimum enclosing ball
intuition
Intuition
  • Data dependent LSH:
    • Hashing that depends on the entire given dataset!
  • Two components:
    • “Nice” geometric configuration with
    • Reduction from general to this “nice” geometric configuration
nice configuration sparsity
Nice Configuration: “sparsity”
  • All points are on a sphere of radius
    • Random points are at distance
  • Lemma 1:
  • “Proof”:
    • Obtained via “cap carving”
    • Similar to “ball carving” [KMS’98, AI’06]
  • Lemma 1’: for radius =
reduction into spherical lsh
Reduction: into spherical LSH
  • Idea: apply a few rounds of “regular” LSH
    • Ball carving [AI’06]
    • as if target approx. is 5c => query time
  • Intuitively:
    • far points unlikely to collide
    • partitions the data into buckets of small diameter
    • find the minimum enclosing ball
    • finally apply spherical LSH on this ball!
two level algorithm
Two-level algorithm
  • hash tables, each with:
    • hash function
    • ’s are “ball carving LSH” (data independent)
    • ’s are “spherical LSH” (data dependent)
    • Final is an “average” of from levels 1 and 2
details
Details
  • Inside a bucket, need to ensure “sparse” case
    • 1) drop all “far pairs”
    • 2) find minimum enclosing ball (MEB)
    • 3) partition by “sparsity” (distance from center)
1 far points
1) Far points
  • In level 1 bucket:
    • Set parameters as if looking for approximation
    • Probability a pair survives:
    • At distance :
    • At distance :
    • At distance :
    • Expected number of collisions for distance is constant
      • Throw out all pairs that violate this
2 minimum enclosing ball
2) Minimum Enclosing Ball
  • Use Jung theorem:
    • A pointsetwithdiameter has aMEB of radius
3 partition by sparsity
3) Partitionby“sparsity”
  • Inside a bucket, points are at distance at most
  • “Sparse” LSH does not work in this case
    • Need to partition by the distance to center
    • Partition into spherical shells of width 1
practice of nns
Practiceof NNS
  • Data-dependent partitions…
  • Practice:
    • Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees…
    • often no guarantees
  • Theory?
    • assuming more about data: PCA-like algorithms “work” [Abdullah-A-Kannan-Krauthgamer’14]
finale
Finale
  • LSH: geometric hashing
  • Data-dependent hashing can do better!
    • Better upper bound?
      • Multi-level improves a bit, but not too much
      • for ?
  • Best partitions in theory and practice?
  • LSH for other metrics:
    • Edit distance, Earth-Mover Distance, etc
  • Lower bounds (cell probe?)
open question
Open question:
  • Practical variant of ball-carving hashing?
  • Design space-partition of that is
    • efficient: point location in poly(t) time
    • qualitative: regions are “sphere-like”

[Prob. needle of length 1 is not cut]

1/c2

[Prob needle of length c is not cut]