1 / 38

Presentation by Itai Dinur

A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami. Presentation by Itai Dinur. Edit Distance (Levenshtein distance).

anana
Download Presentation

Presentation by Itai Dinur

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Sublinear Algorithm For Weakly Approximating Edit DistanceBatu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur

  2. Edit Distance(Levenshtein distance) • Let A,B be two strings over a fixed alphabet Σ. The edit distance D(A,B) between A and B is defined as the minimum number of character insertions, deletions, and substitutions that transform A into B, or vice versa.

  3. Applications • Bioinformatics • Text processing • Web search

  4. Algorithms • Wagner and Fischer gave a dynamic programming algorithm that runs in time O(n2) • Masek and Paterson gave an improved algorithm that runs in time O(n2/logn)

  5. The Edit Distance Testing Problem • On input A,B and parameters 0<α<1, C>1: • If D(A,B)≤nα, output CLOSE with probability at least 2/3 • If D(A,B)>n/C, output FAR with probability at least 2/3 • Note that the output is unrestricted for nα<D(A,B)≤n/C • E.g. cannot distinguish between n0.1 and n0.9 • The algorithm presented for the problem runs in time Õ(nmax{α/2,2α-1})

  6. Motivation • In some applications, given many pairs of strings, one is interested in computing the edit distance only for close strings • For string pairs where the edit distance is above a certain threshold, the actual value of the distance is irrelevant

  7. Lower Bound • Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries • The algorithm presented for the problem runs in time Õ(nmax{α/2,2α-1}), which is close to optimal for α≤2/3

  8. Other Approximations • There are several papers that give better approximation results, but none run in sublinear time • Andoni and Onak give an algorithm that computes the edit distance between two strings up to a factor of in n1+o(1) time

  9. Algorithm Overview • A recursive divide and conquer algorithm • B is broken into substrings which are recursively matched against A • The matches are pieced together to form a matching for A • It is too expensive to match all the substrings • A small number of them are sampled and matched, relying on statistical properties of the matchings

  10. A abcd1234efgh5678 Bcd02 I has a (2,1)-(approximate) matching with respect to A Approximate Matching • Definition 1: An interval I = B[s…e] has a (t,E)-(approximate) matching with respect to A if for some interval A[s’…e’], s’=s+t and D(A[s’…e’],I)≤E

  11. Coordinated Matching • Definition 2:Let I = (I1,…Ik) be a collection of intervals. We say that I has a (t,σ,E,D)-coordinated matching with A if for all but D of the intervals IiI, Ii has a (ti,E)-matching with A, where |t-ti|≤σ A abcd1234efgh5678 Bcd0236gjfkl5 I has a (1,1,2,1)-coordinated matchingwith A

  12. Coordinated Matching to Approximate Matching • We decompose an interval I of size S into k disjoint continuous subintervals, I=(I1,…Ik), each of size S’=S/k (assuming k|S) • Lemma 1: If (I1,…Ik) has a (t,σ,εS’,δk)-coordinated matching with A, then I has a (t,βS)-(approximate) matching with A, where β = (2σ/S’ + ε+δ)

  13. Approximate Matching to Coordinated Matching • Lemma 2: Let c>1 and S>cE. If I has a (t,E)-matching with A then I=(I1,…Ik)has (t,E,cE/k,k/c)-coordinated matching with A • Lemma 3: If I has a (t,E)-matching with A, and k≥E, then I=(I1,…Ik)has (t,E,0,E)-coordinated matching with A

  14. To match A and B • Decompose B into a set of continuous disjoint intervals I • Lemma 2 argues that a match for A and B gives a coordinated matching for A and I • Use a subroutine (COORD-MATCHES) to find coordinated matches for I • Lemma 1 infers the existence of good matches for B from coordinated matches for I

  15. COORD-MATCHES • COORD-MATCHES(A,I,σ,E,D,ε,c) • Let d be a constant, l=dlog(n). Choose samples i1,…,il uniformly and independently from [1,…,k] • For each chosen sample ij compute Tj=MATCHES(A,ij,E) • Let Δ=(D/k+ε/2)l • Return the set T, where t T iff Tj∩[t-σ…t+σ]=Ø for at most Δ sets Tj

  16. Sampling Lemma • Lemma 4: Suppose that a random element of a set S of size n has a property Z with probability p. For any positive ε and c, there exists d such that for dlog(n) random samples from S the fraction p’ of these samples with property Z satisfies p-ε/2≤p’≤p+ε/2 with probability 1-1/nc

  17. COORD-MATCHES • Lemma 5: With probability 1-1/nc-1 over the random coins of COORD-MATCHES, the output T of COORD-MATCHES(A,I,σ,E,D,ε,c) has the following properties: • If I hasa (t,σ,E,D)-coordinated matching then t T • If t T then I has a (t,σ,E,D+εk)-coordinated matching

  18. MATCHES(A,I,E) • If E≥1, use a recursive call to COORD-MATCHES • If E<1 (i.e E=0), then A must contain the interval I unchanged. The set of t values is computed directly using the algorithm SHIFTS

  19. Implementing SHIFTS • A naïve implementation of SHIFTS may give an output set T consisting of n elements • We may restrict the allowed shifts to [-nα,…,+nα ] • However, we need a running time of o(nα), so we must further restrict the set of possible outputs

  20. The Approximate Matching problem • Actually, we will solve the approximate matching problem: Given a block I=B[s…e] of length b=e-s+1, and a constant c2>1, find all indexes s’ such that A[s’…(s’+b-1)] matches I, in a sense that the two substrings have Hamming-distance at most b/c2 • Note that if D(A,B)<nα, it is enough to consider s’ in the interval [s-nα,s+nα]

  21. The Approximate Matching problem • Naively, we can randomly sample O(log(n)) indexes i to determine (with high probability) if a substring of A[(t+1)…(t+b)] matches I, for a given t, and try all 2nαpossible shifts • Requires Ω(nα) queries to A

  22. The Ruler Procedure • We can compare pairs of characters A[i],I[j] such that a pair is compared for every i-j from 0 to u=2nαwith √u queries to each string given that b>√u • In A character positions divisible by √u are queried A[√u,2√u,…u] . In I, √u consecutive positions are queried I[1…√u] • Define cen=ët/√uû+1mil=t(mod√u), then for i=cen√u, j=√u-mil i-j=t

  23. The Ruler Procedure • To test whether a block matches: pick l=Θ(log(n)) random numbers m1,m2…,ml from [0,b-√u] • For each cen and mil marks construct a fingerprint with l offsets e.g. f(√u)=A[√u+m1,√u+m2,…,√u+ml] • Detect with high probability if a block matches with shift t by comparing the cen and mil fingerprints. i.e. f(cen√u)= A[cen√u+m1…cen√u+ml] and f(t(mod√u)) =I[t(mod√u)+m1… t(mod√u)+ml]

  24. The Ruler Procedure • If b≤√u we have only O(b) mil marks and Ω(u/b) cen marks • We can find all matching shifts by using O(max{√u,u/b}log(n)) queries

  25. Efficient Implementation of the Ruler • We need an efficiently algorithm to compare all fingerprints and return valid shifts u=|A|-|B|=9 √u=3 l=2 m1=1 m2=3 A dbadaabcdabddcd Babcdab

  26. Efficient Implementation of the Ruler u=|A|-|B|=9 √u=3 l=2 m1=1 m2=3 A dbadaabcdabddcd B abcdab

  27. Efficient Implementation of the Ruler u=|A|-|B|=9 √u=3 l=2 m1=1 m2=3 A dbadaabcdabddcd Babcdab

  28. Quantizing the Ruler • The explicit list of all matching t can have Ω(u) values • We round the values of t to multiples of some integer Q and return all quantized shifts • The running time is O(max{√u,u/b,u/Q}log(n))

  29. SHIFTS(A,I,Q) • Initialize the fingerprint data structure • Pick l=Θ(log(n)) random numbers m1,m2…,ml • Add all the fingerprints f(i) of A to the data structure, adding i to the A-list of f(i) • Add all the fingerprints f(j) of I to the data structure, adding j to the B-list of f(j) • Quantize all A-lists and B-lists • For each fingerprint, output the list of quantized shifts (differences)

  30. SHIFTS(A,I,Q) • Theorem 1: Procedure SHIFTS finds all quantized shifts of interval I in A, with high probability. It runs in time O(max{√u,u/b,u/Q}log(n)), where u=|A|-b

  31. MATCHES(A,I,E) • If E<1, use SHIFTS to compute T • If E≥1 • Set k=min{εn1-α,2c1E} • Decompose I into a set I of continuous disjoint intervals of size |I|/k • Compute T=COORD-MATCHES(A,I,E,c1E/k,k/c1) • Return T

  32. DECIDE(A,B,α,C) • Choose sufficiently small ε, and sufficiently large c1 (given α,C) • Let the quantization parameter be Q=εmin{n1-α,nα/2} • Set T = MATCHES(A,B,nα) • If T is nonempty, output CLOSE, otherwise output FAR

  33. DECIDE(A,B,α,C) • For any fixed α<1, we can choose constants ε and c1 such that procedure DECIDE solves the edit distance testing problem with high probability

  34. Running Time Analysis • Note that when k=2c1E, COORD-MATCHES is called with edit distance parameter c1E/k=1/2<1. I.e. next call to MATCHES will call SHIFTS and end the recursion • Each level, The interval input to MATCHES goes down by a factor of k=Ω(n1-α), after r=α/(1-α) levels the intervals are of length n/nr(1-α)=O(n1-α), E=O(nα/nr(1-α))=O(1) and SHIFT will be called next

  35. Running Time Analysisα<1/2 • One level of recursion • B is broken to intervals of size O(nα) • dlog(n) calls to SHIFT with Q=εnα/2 • Each call takes O(max{√u,u/b,u/Q}log(n)) = O(max{nα/2,1,nα/2}log(n))=O(nα/2log(n)) • One merge taking O(nα/2log(n)) • Total running time O(nα/2log2(n))

  36. Running Time Analysis 1/2<α<2/3 • Two levels of recursion • At the last level, B is broken to intervals of size O(nα/2) • log2(n) calls to SHIFT with Q=εnα/2 • Each call takes O(nα/2log(n)) • log(n) merges each taking O(nα/2log(n)) • Total running time O(nα/2log3(n))

  37. Running Time Analysisα>2/3 • r>2 levels of recursion • At the last level, B is broken to intervals of size O(n1-α) • logO(1)(n) calls to SHIFT with Q=εn1-α • Note that n1-α<nα/2 • Each call takes O(max{√u,u/b,u/Q}log(n)) = O((u/b)log(n))=O(n2α-1log(n)) • Total running time Õ(n2α-1log(n))

  38. Conclusion • We saw an algorithm for the edit distance test problem that runs in time Õ(nmax{α/2,2α-1}) • Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries

More Related