html5-img
1 / 20

Efficient Sketches for Earth-Mover Distance, with Applications

Efficient Sketches for Earth-Mover Distance, with Applications. David Woodruff IBM Almaden. Joint work with Alexandr Andoni, Khanh Do Ba, and Piotr Indyk. (Planar) Earth-Mover Distance. For multisets A , B of points in [ ∆] 2 , | A |=| B |= N ,

maddy
Download Presentation

Efficient Sketches for Earth-Mover Distance, with Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Sketches for Earth-Mover Distance, with Applications David Woodruff IBM Almaden Joint work with Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

  2. (Planar) Earth-Mover Distance • For multisets A, B of points in [∆]2, |A|=|B|=N, i.e., min cost of perfect matching between A and B EMD(, ) = 6 + 3√2

  3. Geometric Representation of EMD • Map A, B to k-dimensional vectors F(A), F(B) • Image space of F “simple,” e.g., k small • Can estimate EMD(A,B) from F(A), F(B) via some efficient recovery algorithm E 2 Rk F E ≈ EMD(A,B)

  4. Geometric Representation of EMD: Motivation • Visual search and recognition: • Approximate nearest neighbor under EMD • Reduces to approximate NN under simpler distances • Has been applied to fast image search and recognition in large collections of images [Indyk-Thaper’03, Grauman-Darrell’05, Lazebnik-Schmid-Ponce’06] • Data streaming computation: • Estimating the EMD between two point sets given as a stream • Need mapping F to be linear: adding new point a to A translates to adding F(a) to F(A) • Important open problem in streaming [“Kanpur List ’06”]

  5. Prior and New Results Geometric representation of EMD: Main Theorem For any ε2(0,1), there exists a distribution over linear mappings F: R∆2!R∆εs.t. for multisets A,Bµ [∆]2 of equal size, we can produce an O(1/ε)-approximation to EMD(A,B) from F(A), F(B) with probability 2/3.

  6. Implications • Streaming: • Approximate nearest neighbor: * N = number of points * s = number of data points (multisets) to preprocess α>1 free parameter

  7. Proof Outline • Old [Agarwal-Varadarajan’04, Indyk’07]: • Extend EMD to EEMD which: • Handles sets of unequal size |A| · |B| in a grid of side-length k • EEMD(A,B) = min|S|=|A| andS µ B EMD(A,S) + k¢|B\S| • Is induced by a norm ||¢||EEMD, i.e., EEMD(A,B) = ||Â(A) – Â(B)||EEMD, where Â(A)2 R∆2 is the characteristic vector of A • Decomposition of EEMD into weighted sum of small EEMD’s • O(1/ε) distortion • New: • Linear sketching of “sum-norms” EMD over [∆]2 EEMD over [∆ε]2 EEMD over [∆ε]2 EEMD over [∆ε]2 + + … + ∆O(1) terms

  8. Old Idea [Indyk ’07] EEMD over [∆ε]2 EEMD over [∆ε]2 EEMD over [∆ε]2 + + … + ∆O(1) terms EMD over [∆]2 EMD over [∆]2 EEMD over [∆1/2]2 EEMD over [∆1/2]2 + … +

  9. Old Idea [Indyk ’07] Solve EEMD in each of ¢ cells, each a problem in [¢1/2]2 EMD over [∆]2 2

  10. Old Idea [Indyk ’07] Solve one additional EEMD problem in [¢1/2]2 2 Should also scale edge lengths by ¢1/2

  11. Old Idea [Indyk ’07] • Total cost is the sum of the two phases • Algorithm outputs a matching, so its cost is at least the EMD cost • Indyk shows that if we put a random shift of the [¢1/2]2 grid on top of the [¢]2 grid,algorithm’s cost is at most a constant factor times the true EMD cost • Recursive application gives multiple [¢ε]2 grids on top of each other, and results in O(1/ε)-approximation

  12. Main New Technical Theorem ||M||1, X = + + … + For normed space X = (Rt, ||¢||X) and M2Xn, denote ||M||1,X = ∑i ||Mi||X. ||M1||X ||M2||X ||Mn||X Given C > 0 and λ > 0, if C/λ· ||M||1, X· C, there is a distribution over linear mappings μ: Xn!X(λlog n)O(1) such that we can produce an O(1)-approximation to ||M||1,X from μ(M) w.h.p.

  13. Proof Outline: Sum of Norms • First attempt: • Sample (uniformly) a few Mi’s to compute ||Mi||X • Problem: sum could be concentrated in 1 block • Second attempt: • Sample Mi w/probability proportional to ||Mi||X [Indyk’07] • Problem: how to do online? • Techniques from [JW09, MW10]? • Need to sample/retrieve blocks, not just individual coordinates … M2 contains most of mass … M1 M2 M3 Mn

  14. Proof Outline: Sum of Norms (cont.) M = (M1, M2, …, Mn) M2 S11 • Our approach: • Split into exponential levels: • Assume ||M||1, X· C • Sk = {i2[n] s.t. ||Mi||X2(Tk, 2Tk]}, Tk=C/2k • Suffices to estimate |Sk| for each level k. How? • For each level k, subsample from [n] at a rate such that event Ek (“isolation” of level k) holds with probability proportional to |Sk| • Repeat experiment several times, count number of successes M4, M7 S2 S3 M1, M3, M8, M9 … Sℓ M5, M10, Mn M: Subsample: Ek? Y N

  15. Proof Outline: Event Ek • Ek$ “isolation” of level k: • Exactly one i 2Sk gets subsampled • Nothing from Sk’ for k’<k • Verification of trial success/failure • Hash subsampled elements • Each cell maintains vector sum of subsampled Mi’s that hash there • Ek holds roughly (we “accept”) when: • 1 cell has X-norm in (0.9Tk, 2.1Tk] • All other cells have X-norm ≤ 0.9Tk • Check fails only if: • Elements from lighter levels contribute a lot to 1 cell • Elements from heavier levels subsampled and collide • Both unlikely if hash table big enough • Under-estimates |Sk|. If |Sk| > 2k/polylog(n), gives O(1)-approximation • Remark: triangle inequality of norm gives control over impact of collisions Subsample: M1 M4 M5 M6 M9 M11 Mn–1 ∑ ∑ ∑ ∑

  16. Sketch and Recovery Algorithm Sketch: • For every k, the estimator under-estimates |Sk| • If |Sk| > 2k/polylog n, the estimator is (|Sk|) • For each level k, create t hash tables • For each hash table: • Subsample from [n], including each i2[n] w.p. pk = 2-k • Each cell maintains sum of Mi’s that hash to it Recovery algorithm: • For each level k, count number ck of “accepting” hash tables • Return ∑kTk · (ck/t) · (1/pk) {

  17. EMD Wrapup • We achieve a linear embedding of EMD • with constant distortion, namely O(1/ε), • into a space of strongly sublinear dimension, namely ∆ε. • Open problems: • Getting (1+ε)-approximation / proving impossibility • Reducing dimension to logO(1)∆ / proving lower bound

  18. What We Did • We showed that in a data stream, one can sketch ||M||1,X = ∑i ||Mi||X with space about the space complexity of computing (or sketching) ||¢||X • This quantity is known as a cascaded norm, written as L1(X) • Cascaded norms have many applications [CM, JW] • Can we generalize this? E.g., what about L2(X), i.e., (∑i ||Mi||2X )1/2

  19. Cascaded Norms [JW09] • No! • L2(L1), i.e., (∑i ||Mi||21)1/2, requires (n1/2) space, where n is the number of different i, but sketching complexity of L1 is O(log n) • More generally, for p ¸ 1, Lp(L1), i.e., (∑i ||Mi||p 1)1/p is £(n1-1/p) space • So, L1(X) is very special

  20. Thank You!

More Related