1 / 18

Alex Andoni (MSR SVC)

Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS. Alex Andoni (MSR SVC). A Sketching Problem. Sketching: :objects short bit-strings given and should be able to deduce if and are “similar” Why?

garvey
Download Presentation

Alex Andoni (MSR SVC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sketching, Sampling and other Sublinear Algorithms:Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)

  2. A Sketching Problem • Sketching: • :objects short bit-strings • given and should be able to deduce if and are “similar” • Why? • reduce space and time to compute similarity To sketch or not to sketch To be or not to be 010110 010101 be to similar? similar?

  3. Sketch from LSH 1 • LSH often has property: • Sketching from LSH: • Estimate by the fraction of collisions between • controls the variance of the estimate [Broder’97]: for Jaccard coefficient

  4. General Theory: embeddings • The above map is an embedding • General motivation: given distance (metric) , solve a computational problem under Hamming distance Compute distance between two points Euclidean distance (ℓ2) Nearest Neighbor Search Diameter/Close-pair of set S Edit distance between two strings Earth-Mover (transportation) Distance Clustering, MST, etc f Reduce problem <P under hard metric> to <P under simpler metric>

  5. Embeddings: landscape • Definition: an embedding is a map of a metric into a host metric such that for any : where is the distortion (approximation) of the embedding . • Embeddings come in all shapes and colors: • Source/host spaces • Distortion • Can be randomized: withprobability • Time to compute • Types of embeddings: • From norm to the same norm but of lower dimension (dimension reduction) • From non-norms (edit distance, Earth-Mover Distance) into a norm (ℓ1) • From given finite metric (shortest path on a planar graph) into a norm (ℓ1) • not a metric but a computational procedure sketches

  6. Dimension Reduction • Johnson Lindenstrauss Lemma: there is a linear map , , that preserves distance between two vectors • up to distortion • with probability ( some constant) • Preserves distances among points for • Motivation: • E.g.: diameter of a pointset in -dimensional Euclidean space • Trivially: time • Using lemma: time for approximation • MANY applications: nearest neighbor search, streaming, pattern matching, approximation algorithms (clustering)…

  7. Main intuition • The map can be simply a projection onto a random subspace of dimension

  8. 1D embedding • How about one dimension () ? • Map • , • where are iid normal (Gaussian) random variable • Why Gaussian? • Stability property: is distributed as , where is also Gaussian • Equivalently: is centrally distributed, i.e., has random direction, and projection on random direction depends only on length of pdf = E[g]=0 E[g2]=1

  9. 1D embedding • Map , • for any , • Linear: • Want: • Claim: for any , we have • Expectation: • Standard deviation: • Proof: • Prove for since linear • Expectation pdf = E[g]=0 E[g2]=1 2 2

  10. Full Dimension Reduction • Just repeat the 1D embedding for times! • where is matrix of Gaussian random variables • Want to prove: • with probability • OK to prove for fixed

  11. Concentration • is distributed as • where each is distributed as Gaussian • Norm • is called chi-squared distribution with degrees • Fact: chi-squared very well concentrated: • Equal towith probability • Akin to central limit theorem

  12. Dimension Reduction: wrap-up • with high probability • Extra: • Linear: can update as changes • Can use instead of Gaussians [AMS’96, Ach’01, TZ’04…] • Fast JL: can compute faster than time [AC’06, AL’07’09, DKS’10, KN’10’12…]

  13. NNS for Euclidean space [Datar-Immorlica-Indyk-Mirrokni’04] • Can use dimensionality reduction to get LSH for • LSH function : • pick a random line , and quantize • project point into • is a random Gaussian vector • random in • is a parameter (e.g., 4)

  14. Near-Optimal LSH [A-Indyk’06] • Regular grid → grid of balls • p can hit empty space, so take more such grids until p is in a ball • Need (too) many grids of balls • Start by projecting in dimension t • Analysis gives • Choice of reduced dimension t? • Tradeoff between • # hash tables, n, and • Time to hash, tO(t) • Total query time: dn1/c2+o(1) p p 2D Rt

  15. Open question: • More practical variant of above hashing? • Design space partitioning of that is • efficient: point location in poly(t) time • qualitative: regions are “sphere-like” c2 [Prob. needle of length 1 is not cut] ≥ [Prob needle of length c is not cut]

  16. Time-Space Trade-offs query time space low high medium medium one hash table lookup! high low

  17. NNS beyond LSH • Data-dependent partitions… • Practice: • Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees… • often no guarantees • Theory: • can improve standard LSH by random data-dependent space partitions [A-Indyk-Nguyen-Razenshteyn’??] • tree-based approach to max-norm ()

  18. Finale • Dimension Reduction in Euclidean space • , random projection preserves distances • only dimensions for distance among points! • NNS for Euclidean space • Random projections gives LSH • Even better with ball partitioning • Or better with cool lattices?

More Related