1 / 22

Sketching and Nearest Neighbor Search (2)

Sketching and Nearest Neighbor Search (2). Alex Andoni (Columbia University). Sketching. short bit-strings given and , should be able to estimate some function of and , norm: words Decision version: given in advance… , norm: bits. 010110. 010101. Estimate.

brinly
Download Presentation

Sketching and Nearest Neighbor Search (2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015

  2. Sketching • short bit-strings • given and ,should be able to estimate some function of and • , norm: words • Decision version: given in advance… • , norm: bits 010110 010101 Estimate Distinguish between

  3. Sketching: decision version • Consider Hamming space: • Lemma: for any , can achieve bit sketch.

  4. Nearest Neighbor Search (NNS) • Preprocess: a set of points • Query: given a query point , report a point with the smallest distance to

  5. Motivation • Generic setup: • Points model objects (e.g. images) • Distance models (dis)similarity measure • Application areas: • machine learning: k-NN rule • image/video/music recognition, deduplication, bioinformatics, etc… • Distance can be: • Hamming, Euclidean, … • Primitive for other problems: • find the similar pairs, clustering… 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111

  6. Curse of Dimensionality • All exact algorithms degrade rapidly with the dimension

  7. Approximate NNS • -near neighbor: given a query point , report a point s.t. • assuming there is a point within distance • Practice: use for exact NNS • Filtering: gives a set of candidates(hopefully small) -approximate

  8. NNS algorithms Dependence on dimension: • Exponential [Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],[Arya-Fonseca-Mount’11],… • Linear/polynomial [Kushilevitz-Ostrovsky-Rabani’98], [Indyk-Motwani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A.-Indyk’06], [A.-Indyk-Nguyen-Razenshteyn’14], [A.-Razenshteyn’15]

  9. NNS via sketching: Approach 1 • Boosted sketch: • Let = sketch for the decision version (90% success prob) • new sketch : copies of estimator is the median of the estimators • Sketch size: • Success probability: • Preprocess: compute sketches for all the points • Query: compute sketch , and compute distance to all points using sketch • Time: improved from to

  10. NNS via sketching: Approach 2 • Query time below ? • Theorem [KOR98]: query time and space for approximation. • Proof: • Note that has bits • Only possible sketches! • Store an answer for each of possible inputs • If a distance has constant-size sketch, admits a poly-space NNS data structure! • Space closer to linear? • approach 3 will require more specialized sketches…

  11. 3: Locality Sensitive Hashing [Indyk-Motwani’98] Random hash function on satisfying: • forclosepair (when ) is “high” • forfarpair(when ) is “small” Use several hash tables “not-so-small” 1 ,where

  12. Locality sensitive hash functions • Hash function is usually a concatenation of “primitive” functions: • Example: Hamming space • , i.e., choose bit for a random • chooses bits at random

  13. Full algorithm • Data structure is just hash tables: • Each hash table uses a fresh random function • Hash all dataset points into the table • Query: • Check for collisions in each of the hash tables • until we encounter a point within distance • Guarantees: • Space: , plus space to store points • Query time: (in expectation) • 50% probability of success.

  14. Analysis of LSH Scheme • Choice of parameters ? • hash tables with • Pr[collision of far pair] = • Pr[collision of close pair] = • Hence “repetitions” (tables) suffice! sets.t. = =

  15. Analysis: Correctness • Let be an -near neighbor • If does not exists, algorithm can output anything • Algorithm fails when: • near neighbor is not in the searched buckets • Probability of failure: • Probability do not collide in a hash table: • Probability they do not collide in hash tables at most

  16. Analysis: Runtime • Runtime dominated by: • Hash function evaluation: time • Distance computations to points in buckets • Distance computations: • Care only about far points, at distance • In one hash table, we have • Probability a far point collides is at most • Expected number of far points in a bucket: • Over hash tables, expected number of far points is • Total: in expectation

  17. NNS for Euclidean space [Datar-Immorlica-Indyk-Mirrokni’04] • Hash function is a concatenation of “primitive” functions: • LSH function : • pick a random line , and quantize • project point into • is a random Gaussian vector • random in • is a parameter (e.g., 4)

  18. Optimal Euclidean LSH [A-Indyk’06] • Regular grid → grid of balls • p can hit empty space, so take more such grids until p is in a ball • Need (too) many grids of balls • Start by projecting in dimension t • Analysis gives • Choice of reduced dimension t? • Tradeoff between • # hash tables, n, and • Time to hash, tO(t) • Total query time: dn1/c2+o(1) p p 2D Rt

  19. q P(r) x Proof idea p • Claim: , i.e., • P(r)=probability of collision when ||p-q||=r • Intuitive proof: • Projection approx preserves distances [JL] • P(r) = intersection / union • P(r)≈random point u beyond the dashed line • Fact (high dimensions): the x-coordinate of u has a nearly Gaussian distribution → P(r)  exp(-A·r2) r q p u

  20. Open question: • More practical variant of above hashing? • Design space partitioning of that is • efficient: point location in poly(t) time • qualitative: regions are “sphere-like” c2 [Prob. needle of length 1 is not cut] ≥ [Prob needle of length c is not cut]

  21. LSH Zoo To sketch or not to sketch To be or not to be • Hamming distance [IM’98] • : pick a random coordinate(s) • Manhattan distance [AI’06] • : cell in a randomly shifted grid • Jaccard distance between sets: • : pick a random permutation on the universe min-wise hashing[Bro’97] … sketch sketch not not not not be be to or to or 1 …01111… …11101… 1 …01122… …21102… {be,not,or,to} {not,or,to, sketch} be to =be,to,sketch,or,not

  22. LSH in practice safety not guaranteed • If want exact NNS, what is ? • Can choose any parameters • Correct as long as • Performance: • trade-off between # tables and false positives • will depend on dataset “quality” fewer false positives fewer tables

More Related