1 / 31

Summer School on Hashing’14 Locality Sensitive Hashing

Summer School on Hashing’14 Locality Sensitive Hashing. Alex Andoni ( Microsoft Research). Nearest Neighbor Search (NNS). Preprocess: a set of points Query: given a query point , report a point with the smallest distance to. Approximate NNS. c -approximate. if there exists a

Download Presentation

Summer School on Hashing’14 Locality Sensitive Hashing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summer School on Hashing’14Locality Sensitive Hashing Alex Andoni (Microsoft Research)

  2. Nearest Neighbor Search (NNS) Preprocess: a set of points Query: given a query point , report a point with the smallest distance to

  3. Approximate NNS c-approximate if there exists a point at distance r p cr q • r-near neighbor: given a new point , report a point s.t. • Randomized: a point returned with 90% probability

  4. Heuristic for Exact NNS c-approximate r p cr q • r-near neighbor: given a new point , report a set with • all points s.t. (each with 90% probability) • may contain some approximate neighbors s.t. • Can filter out bad answers

  5. Locality-Sensitive Hashing [Indyk-Motwani’98] q q p “not-so-small” :, where 1 • Random hash function on s.t. for any points : • Close when is “high” • Far when is “small” • Use several hash tables

  6. Locality sensitive hash functions • Hash function is usually a concatenation of “primitive” functions: • Example: Hamming space • , i.e., choose bit for a random • chooses bits at random

  7. Full algorithm • Data structure is just hash tables: • Each hash table uses a fresh random function • Hash all dataset points into the table • Query: • Check for collisions in each of the hash tables • until we encounter a point within distance • Guarantees: • Space:, plus space to store points • Query time: (in expectation) • 50% probability of success.

  8. Analysis of LSH Scheme sets.t. = = • Choice of parameters? • hash tables with • Pr[collision of far pair] = • Pr[collision of close pair] = • Hence “repetitions” (tables) suffice!

  9. Analysis: Correctness • Let be an -near neighbor • If does not exists, algorithm can output anything • Algorithm fails when: • near neighbor is not in the searched buckets • Probability of failure: • Probability do not collide in a hash table: • Probability they do not collide in hash tables at most

  10. Analysis: Runtime • Runtime dominated by: • Hash function evaluation: time • Distance computations to points in buckets • Distance computations: • Care only about far points, at distance • In one hash table, we have • Probability a far point collides is at most • Expected number of far points in a bucket: • Over hash tables, expected number of far points is • Total: in expectation

  11. NNS for Euclidean space [Datar-Immorlica-Indyk-Mirrokni’04] • Hash function is a concatenation of “primitive” functions: • LSH function : • pick a random line , and quantize • project point into • is a random Gaussian vector • random in • is a parameter (e.g., 4)

  12. Optimal* LSH [A-Indyk’06] p • Regular grid → grid of balls • p can hit empty space, so take more such grids until p is in a ball • Need (too) many grids of balls • Start by projecting in dimension t • Analysis gives • Choice of reduced dimension t? • Tradeoff between • # hash tables, n, and • Time to hash, tO(t) • Total query time: dn1/c2+o(1) p 2D Rt

  13. q P(r) x p Proof idea • Claim: , i.e., • P(r)=probability of collision when ||p-q||=r • Intuitive proof: • Projection approx preserves distances [JL] • P(r) = intersection / union • P(r)≈random point u beyond the dashed line • Fact (high dimensions): the x-coordinate of u has a nearly Gaussian distribution → P(r)  exp(-A·r2) r q p u

  14. LSH Zoo To Simons or not to Simons To be or not to be Simons Simons not not not not be be to or to or 1 …01111… …11101… 1 …01122… …21102… {be,not,or,to} {not,or,to, Simons} be to =be,to,Simons,or,not • Hamming distance • : pick a random coordinate(s) [IM98] • Manhattan distance: • : random grid [AI’06] • Jaccard distance between sets: • : pick a random permutation on the universe min-wise hashing[Bro’97,BGMZ’97] • Angle: sim-hash [Cha’02,…]

  15. LSH in the wild fewer tables fewer false positives safety not guaranteed • If want exact NNS, what is ? • Can choose any parameters • Correct as long as • Performance: • trade-off between # tables and false positives • will depend on dataset “quality” • Can tune to optimize for given dataset

  16. Time-Space Trade-offs query time space low high medium medium 1 mem lookup high low

  17. LSH is tight…leave the rest to cell-probe lower bounds?

  18. Data-dependent Hashing! [A-Indyk-Nguyen-Razenshteyn’14] • NNS in Hamming space () with query time, space and preprocessing for • optimal LSH: of [IM’98] • NNS in Euclidean space () with: • optimal LSH: of [AI’06]

  19. A look at LSH lower bounds • LSH lower bounds in Hamming space • Fourier analytic • [Motwani-Naor-Panigrahy’06] • distribution over hash functions • Far pair: random, distance • Close pair: random at distance = • Get [O’Donnell-Wu-Zhou’11]

  20. Why not NNS lower bound? • Suppose we try to generalize [OWZ’11] to NNS • Pick random • All the “far point” are concentrated in a small ball of radius • Easy to see at preprocessing: actual near neighbor close to the center of the minimum enclosing ball

  21. Intuition • Data dependent LSH: • Hashing that depends on the entire given dataset! • Two components: • “Nice” geometric configuration with • Reduction from general to this “nice” geometric configuration

  22. Nice Configuration: “sparsity” • All points are on a sphere of radius • Random points are at distance • Lemma 1: • “Proof”: • Obtained via “cap carving” • Similar to “ball carving” [KMS’98, AI’06] • Lemma 1’: for radius =

  23. Reduction: into spherical LSH • Idea: apply a few rounds of “regular” LSH • Ball carving [AI’06] • as if target approx. is 5c => query time • Intuitively: • far points unlikely to collide • partitions the data into buckets of small diameter • find the minimum enclosing ball • finally apply spherical LSH on this ball!

  24. Two-level algorithm • hash tables, each with: • hash function • ’s are “ball carving LSH” (data independent) • ’s are “spherical LSH” (data dependent) • Final is an “average” of from levels 1 and 2

  25. Details • Inside a bucket, need to ensure “sparse” case • 1) drop all “far pairs” • 2) find minimum enclosing ball (MEB) • 3) partition by “sparsity” (distance from center)

  26. 1) Far points • In level 1 bucket: • Set parameters as if looking for approximation • Probability a pair survives: • At distance : • At distance : • At distance : • Expected number of collisions for distance is constant • Throw out all pairs that violate this

  27. 2) Minimum Enclosing Ball • Use Jung theorem: • A pointsetwithdiameter has aMEB of radius

  28. 3) Partitionby“sparsity” • Inside a bucket, points are at distance at most • “Sparse” LSH does not work in this case • Need to partition by the distance to center • Partition into spherical shells of width 1

  29. Practiceof NNS • Data-dependent partitions… • Practice: • Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees… • often no guarantees • Theory? • assuming more about data: PCA-like algorithms “work” [Abdullah-A-Kannan-Krauthgamer’14]

  30. Finale • LSH: geometric hashing • Data-dependent hashing can do better! • Better upper bound? • Multi-level improves a bit, but not too much • for ? • Best partitions in theory and practice? • LSH for other metrics: • Edit distance, Earth-Mover Distance, etc • Lower bounds (cell probe?)

  31. Open question: • Practical variant of ball-carving hashing? • Design space-partition of that is • efficient: point location in poly(t) time • qualitative: regions are “sphere-like” [Prob. needle of length 1 is not cut] ≥ 1/c2 [Prob needle of length c is not cut]

More Related