170 likes | 306 Views
This lecture explores sublinear algorithms for nearest neighbor search (NNS), focusing on preprocessing point sets and querying for the closest point based on various distance metrics, such as Hamming and Euclidean distances. Applications in machine learning, image/music recognition, and bioinformatics are discussed. Key concepts include Locality-Sensitive Hashing (LSH), sketching techniques, and approximation algorithms, aimed at improving performance in high-dimensional spaces. The analysis covers correctness, runtime efficiency, and trade-offs in algorithm design, providing insights into effective nearest neighbor solutions.
E N D
Sketching, Sampling and other Sublinear Algorithms:Nearest Neighbor Search Alex Andoni (MSR SVC)
Nearest Neighbor Search (NNS) • Preprocess: a set of points • Query: given a query point , report a point with the smallest distance to
Motivation • Generic setup: • Points model objects (e.g. images) • Distance models (dis)similarity measure • Application areas: • machine learning: k-NN rule • speech/image/video/music recognition, vector quantization, bioinformatics, etc… • Distance can be: • Hamming, Euclidean, edit distance, Earth-mover distance, etc… • Primitive for other problems: • find the similar pairs in a set D, clustering… 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111
Lecture Plan • Locality-Sensitive Hashing • LSH as a Sketch • Towards Embeddings
2D case • Compute Voronoi diagram • Given query , perform point location • Performance: • Space: • Query time:
High-dimensional case • All exact algorithms degrade rapidly with the dimension • In practice: • When is “low-medium”, kd-trees work reasonably • When is “high”, state-of-the-art is unsatisfactory
Approximate NNS • r-near neighbor: given a new point , report a point s.t. • Randomized: a point returned with 90% probability c-approximate if there exists a point at distance r p cr q
Heuristic for Exact NNS • r-near neighbor: given a new point , report a set with • all points s.t. (each with 90% probability) • may contain some approximate neighbors s.t. • Can filter out bad answers c-approximate r p cr q
Approximation Algorithms for NNS • A vast literature: • milder dependence on dimension [Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],…[Aiger-Kaplan-Sharir’13], • little to no dependence on dimension [Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]
Locality-Sensitive Hashing [Indyk-Motwani’98] • Random hash function on s.t. for any points : • Close when is “high” • Far when is “small” • Use several hash tables q q p “not-so-small” :, where 1
Locality sensitive hash functions • Hash function is usually a concatenation of “primitive” functions: • Example: Hamming space • , i.e., choose bit for a random • chooses bits at random
Formal description • Data structure is just hash tables: • Each hash table uses a fresh random function • Hash all dataset points into the table • Query: • Check for collisions in each of the hash tables • until we encounter a point within distance • Guarantees: • Space:, plus space to store points • Query time: (in expectation) • 50% probability of success.
Analysis of LSH Scheme • How did we pick ? • For fixed , we have • Pr[collision close pair] = • Pr[collision far pair] = • Want to make • Set • , where
Analysis: Correctness • Let be an -near neighbor • If does not exists, algorithm can output anything • Algorithm fails when: • near neighbor is not in the searched buckets • Probability of failure: • Probability do not collide in a hash table: • Probability they do not collide in hash tables at most
Analysis: Runtime • Runtime dominated by: • Hash function evaluation: time • Distance computations to points in buckets • Distance computations: • Care only about far points, at distance • In one hash table, we have • Probability a far point collides is at most • Expected number of far points in a bucket: • Over hash tables, expected number of far points is • Total: in expectation
LSH in the wild fewer tables fewer false positives • If want exact NNS, what is ? • Can choose any parameters • Correct as long as • Performance: • trade-off between # tables and false positives • will depend on dataset “quality” • Can tune to optimize for given dataset • Further advantages: • Point insertions/deletions easy • natural to distribute computation/hash tables in a cluster safety not guaranteed
LSH Zoo To sketch or not to sketch To be or not to be • Hamming distance [IM’98] • : pick a random coordinate(s) • Manhattan distance: • homework • Jaccard distance between sets: • : pick a random permutation on the universe min-wise hashing[Bro’97] • Euclidean distance: next lecture sketch sketch not not not not be be to or to or 1 …01111… …11101… 1 …01122… …21102… {be,not,or,to} {not,or,to, sketch} be to =be,to,sketch,or,not