1 / 17

Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search

This lecture discusses sketching, sampling, and other sublinear algorithms for nearest neighbor search. It covers topics such as locality-sensitive hashing, approximate NNS, and approximation algorithms for NNS.

phowell
Download Presentation

Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sketching, Sampling and other Sublinear Algorithms:Nearest Neighbor Search Alex Andoni (MSR SVC)

  2. Nearest Neighbor Search (NNS) • Preprocess: a set of points • Query: given a query point , report a point with the smallest distance to

  3. Motivation • Generic setup: • Points model objects (e.g. images) • Distance models (dis)similarity measure • Application areas: • machine learning: k-NN rule • speech/image/video/music recognition, vector quantization, bioinformatics, etc… • Distance can be: • Hamming, Euclidean, edit distance, Earth-mover distance, etc… • Primitive for other problems: • find the similar pairs in a set D, clustering… 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111

  4. Lecture Plan • Locality-Sensitive Hashing • LSH as a Sketch • Towards Embeddings

  5. 2D case • Compute Voronoi diagram • Given query , perform point location • Performance: • Space: • Query time:

  6. High-dimensional case • All exact algorithms degrade rapidly with the dimension • In practice: • When is “low-medium”, kd-trees work reasonably • When is “high”, state-of-the-art is unsatisfactory

  7. Approximate NNS • r-near neighbor: given a new point , report a point s.t. • Randomized: a point returned with 90% probability c-approximate if there exists a point at distance r p cr q

  8. Heuristic for Exact NNS • r-near neighbor: given a new point , report a set with • all points s.t. (each with 90% probability) • may contain some approximate neighbors s.t. • Can filter out bad answers c-approximate r p cr q

  9. Approximation Algorithms for NNS • A vast literature: • milder dependence on dimension [Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],…[Aiger-Kaplan-Sharir’13], • little to no dependence on dimension [Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]

  10. Locality-Sensitive Hashing [Indyk-Motwani’98] • Random hash function on s.t. for any points : • Close when is “high” • Far when is “small” • Use several hash tables q q p “not-so-small” :, where 1

  11. Locality sensitive hash functions • Hash function is usually a concatenation of “primitive” functions: • Example: Hamming space • , i.e., choose bit for a random • chooses bits at random

  12. Formal description • Data structure is just hash tables: • Each hash table uses a fresh random function • Hash all dataset points into the table • Query: • Check for collisions in each of the hash tables • until we encounter a point within distance • Guarantees: • Space:, plus space to store points • Query time: (in expectation) • 50% probability of success.

  13. Analysis of LSH Scheme • How did we pick ? • For fixed , we have • Pr[collision close pair] = • Pr[collision far pair] = • Want to make • Set • , where

  14. Analysis: Correctness • Let be an -near neighbor • If does not exists, algorithm can output anything • Algorithm fails when: • near neighbor is not in the searched buckets • Probability of failure: • Probability do not collide in a hash table: • Probability they do not collide in hash tables at most

  15. Analysis: Runtime • Runtime dominated by: • Hash function evaluation: time • Distance computations to points in buckets • Distance computations: • Care only about far points, at distance • In one hash table, we have • Probability a far point collides is at most • Expected number of far points in a bucket: • Over hash tables, expected number of far points is • Total: in expectation

  16. LSH in the wild fewer tables fewer false positives • If want exact NNS, what is ? • Can choose any parameters • Correct as long as • Performance: • trade-off between # tables and false positives • will depend on dataset “quality” • Can tune to optimize for given dataset • Further advantages: • Point insertions/deletions easy • natural to distribute computation/hash tables in a cluster safety not guaranteed

  17. LSH Zoo To sketch or not to sketch To be or not to be • Hamming distance [IM’98] • : pick a random coordinate(s) • Manhattan distance: • homework • Jaccard distance between sets: • : pick a random permutation on the universe min-wise hashing[Bro’97] • Euclidean distance: next lecture sketch sketch not not not not be be to or to or 1 …01111… …11101… 1 …01122… …21102… {be,not,or,to} {not,or,to, sketch} be to =be,to,sketch,or,not

More Related