Edit Distance and Large Data Sets

1 / 23

# Edit Distance and Large Data Sets - PowerPoint PPT Presentation

##### Edit Distance and Large Data Sets

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Edit Distance and Large Data Sets Ravi Kumar Robert Krauthgamer Ziv Bar-Yossef T.S. Jayram IBM Almaden Technion

2. Motivating Example:Near-Duplicate Elimination Web • Syntactic clustering[Broder, Glassman, Manasse, Zweig 97] • Group pages into clusters of “similar” pages • Keep one “representative” from each cluster Crawler Page Repository Duplicate elimination Page Repository

3. Syntactic Clustering via Sketching[Broder,Glassman,Manasse,Zweig 97] Challenges • Corpus is huge (billions of pages, 10K/page) • Streaming access • Limited main memory • Linear running time p h(p) Locality Sensitive Hashes [Indyk, Motwani 98] Prh[h(p) = h(q)] = sim(p,q) • Can compute sketches in one pass • Sketches can be stored and processed on a single machine Cluster: Collection of pages that have a common sketch

4. Shingling and Resemblance[Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98] w-shingling: Sw(p) = all substrings of p of length w Sw(p) Sw(q) resemblancew(p,q) = Pr[min((Sw(p)) = min((Sw(q))] =

5. The Sketching Model Shared Randomness Bob Alice k vs. r Gap Problem x y Promise: d(x,y) · k or d(x,y) ¸ r (y) (x) Goal: Decide which of the two holds. d(x,y) ¸ r Approximation d(x,y) · k Referee

6. Applications of Sketching Large data sets • Clustering • Nearest Neighbor schemes • Data streams Management of Files over the Network • Differential backup • Synchronization Theory • Low distortion embeddings • Simultaneous messages communication complexity

7. Known Sketching Schemes • Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98] • Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01] • Cosine similarity [Charikar 02] • Earth mover distance [Charikar 02] In this talk: Edit Distance

8. Edit Distance x 2n, y 2m ED(x,y): Minimum number of character insertions, deletions and substitutions that transform x to y. Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications • Genomics • Text processing • Web search For simplicity: m = n,  = {0,1}.

9. Computing Edit Distance Exact Computation • Dynamic programming (1970) O(n2) • Masek and Paterson (1980) O(n2/log n) • Impractical for comparing two very long strings. • Natural question 1: can we do it in lineartime? • Impractical for handling massive document repositories. • Natural question 2: are there constant size sketches of edit distance? Focus of this talk Can we solve the above problems if we settle for approximation?

10. Sketching Schemes for Edit Distance Negative Indications • No known embeddings of Edit distance into a normed space. • Every embedding of Edit distance into L1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03] • Weak nearest neighbor schemes [Indyk 04]

11. Hamming Distance Sketches[Kushilevitz, Ostrovsky, Rabani 98] Ham(x,y) = # of positions in which x,y differ Gap: k vs. 2k Sketch size: O(1) Analysis: Pr[h(x)  h(y)] = Pr[h(x) + h(y) = 1] = Pr[i: xi yi ri = 1] = ½(1- (1 – 1/k)Ham(x,y)) Shared randomness: r1,…,rn2 {0,1} are independent and Sketch: h(x) = (i xi ri ) mod 2 h(y) = (i yi ri ) mod 2 (x) = (h1(x),…,ht(x)), (y) = (h1(y),…,ht(y)), t = O(1)

12. Edit Distance Sketches: Basic Framework Underlying Principle ED(x,y) is small iff x and y share many common substrings at nearby positions. Sx= set of pairs of the form (,h(i)) : a substring of x h(i): a “locality sensitive” encoding of the substring’s position x y common substrings at nearby positions ED(x,y) small iff intersection SxÅ Sy large Sy Sx

13. Basic Framework (cont.) x y ED(x,y) small iff symmetric difference Sx Sy small Sy Sx • Need to estimate size of symmetric difference • Hamming distance computation of characteristic vectors • Use constant size sketches [KOR] Reduced Edit Distance to Hamming Distance

14. General Case: Encoding Scheme Gap: k vs. O((kn)2/3) B = n2/3/k1/3, W = n/B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 x B windows of size W each. b1 b2 b3 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … ,(i, win(i)),… (2,1), (3,2), (1,1), Sx = { ,(bi, win(i)),… … (b1,1), (b2,1), (b3,2), Sy = {

15. Analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 i x bj y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 1: ED(x,y) · k • If i is “unmarked”, it has a matching “companion” j • (i,win(i)) 2 Sxn Sy, only if: • either i is “marked” • or i is unmarked, but win(i)  win(j) • At most kB marked substrings • At most k * n/W = kB companions with mismatched windows • Therefore, Ham(Sx,Sy) · 4kB

16. Analysis (cont.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2B+1 1 B+1 x b2 bB-1 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Case 2: Ham(Sx,Sy) · 8kB • If i has a “companion” j and win(i) = win(j), can align i with j using at most W operations • Otherwise, substitute first character of i • At most 8kB substrings of x have no companion • Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)

17. Non-repetitive Case: Encoding Scheme t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W Gap: k vs. O(k W) Alice and Bob choose a sequence of “anchors” in a coordinated way W 2 3 7 x2 1 x1 4 5 6 x 3 7 1 y1 2 y2 4 5 6 y 1: a random permutation on {0,1}t 1: minimal length-t substring of x1 (under 1) 1: minimal length-t substring of y1 (under 1) W

18. Encoding scheme (cont.) 2 3 7 1 1 8 3 4 4 5 7 2 5 6 6 x 1 6 8 7 2 4 5 3 6 1 2 7 3 5 4 y Sx = { (1,1),…,(8,8) } Sy = { (y1,1),…,(y8,8) }

19. Analysis 2 3 7 1 4 5 6 1 3 4 7 2 5 6 8 x 3 7 7 1 6 1 2 4 5 5 4 8 6 2 3 y • Case 1: ED(x,y) · k. • All anchors are “unmarked” with probability 1 - kt/W = (1) • If i,i are unmarked, they are aligned • # of mismatching substrings · 2k • Ham(Sx,Sy) · 2k

20. Analysis (cont.) 2 3 7 1 4 5 6 1 3 4 7 2 5 6 8 x 3 7 7 1 6 1 2 4 5 5 4 8 6 2 3 y • Case 2: Ham(Sx,Sy) · 4k • # of mismatching substrings · 4k • ED(x,y) · 2 ¢ W ¢ 4k = O(k W).

21. Approximation in Linear Time Arbitrary Strings Non-repetitive Strings

22. Summary and Open Problems • Designed efficient approximation schemes for edit distance. • Best sketching and linear-time approximations to date • Subsequent work: • O(n2/3) distortion embedding of edit distance into L1[Indyk 04] [Rabani 04] • Better embeddings of edit distance into L1[Ostrovsky, Rabani, 05] • Embeddings of the Ulam metric into L1[Charikar, Krauthgamer, 05] • Open Problems • Sketch size lower bounds • Constant factor approximations in linear time • Better embeddings of edit distance • Sketching schemes for other distance measures

23. Thank You