On Embedding Edit Distance into L 1

Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work with Moses Charikar, with Yuval Rabani, with Parikshit Gopalan and T.S. Jayram. with Alex Andoni On Embedding Edit Distance into L1 On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1 Edit Distance x 2n, y 2m ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance] Examples: ED(00000, 1111) = 5 ED(01010, 10101) = 2 Applications: • Genomics • Text processing • Web search For simplicity: m = n. X

On Embedding Edit Distance into L_1 Embedding into L1 An embedding of (X,d) into l1is a map f : X!l1. It has distortionK¸1 if d(x,y) ≤ kf(x)-f(y)k1 ≤ K d(x,y)8x,y2X Very powerful concept (when distortion is small) Goal: Embed edit distance into l1 with small distortion Motivation: Reduce algorithmic problems to l1 E.g. Nearest-Neighbor Search Study a simple metric space without norm E.g. Hamming cube w/cyclic shifts.

Known Results for Edit Distance Embed ({0,1}n, ED) into L1 Previous bounds Upper bound: 2O(√log n) [Ostrovsky-Rabani’05] O(n2/3)[Bar Yossef-Jayram-K.-Kumar’04] Lower bound: (log n) [K.-Rabani’06] (log n)1/2-o(1)[Khot-Naor’05] and 3/2 [Andoni-Deza-Gupta-Indyk-Raskhodnikova’03] Large Gap … Despite signficant effort!!! On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1 Submetrics (Restricted Strings)‏ • Why focus on submetrics of edit distance? • May admit smaller distortion • Partial progress towards general case • A framework to analyzing non worst-case instances • Example (a la computational biology): Handle only “typical” strings • Class 1: • A string is k-non-repetitive if all its k-substrings are distinct • A random 0-1 string is WHP (2log n)-non-repetitive • Yields a submetric containing 1-o(1) fraction of the strings • Class 2: • Ulam metric = edit distance on all permutations (here ={1,…,n})‏ • Every permutation is 1-non-repetitive • Note: k-non-repetitive strings embed into Ulam with distortion k. k=7 Theory of Computation Seminar, Computer Science Department

Known Results for Ulam Metric Embed ({0,1}n, ED) into L1 Embed Ulam metric into L1 Upper bound: 2O(√log n) [Ostrovsky-Rabani’05] O(log n)[Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) Lower bound: (log n) [K.-Rabani’06] log n/loglog n)[Andoni-K.’07] (Actually qualitatively stronger)‏ Large Gap … Near-tight! On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n). Proof. Define where Intuition: • sign(fa,b(P)) is indicator for “a appears before b” in P • Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)‏ • Suppose Q is obtained from P by moving one symbol, say ‘s’ • General case then follows by applying triangle inequality on P,P’,P’’,…,Q • Total contribution of • coordinates s2{a,b} is 2k (1/k) ≤ O(log n)‏ • other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)‏

On Embedding Edit Distance into L_1 Embedding of permutations Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n). Proof. Define where Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)‏ Claim 2: ||f(P)-f(Q)||1¸ ½ ED(P,Q) • Assume wlog that P=identity • Edit Q into an increasing sequence (thus into P) using quicksort: • Choose a random pivot, • Delete all characters inverted wrt to pivot • Repeat recursively on left and right portions • Now argue ||f(P)-f(Q)||1¸E[ #quicksort deletions ] ¸ ½ ED(P,Q) Surviving subsequence is increasing  ED(P,Q) ≤ 2 #deletions For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|

On Embedding Edit Distance into L_1 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n) Proof sketch: • Suppose embeds with distortion D¸1, and let V={0,1}n. • By the cut-cone characterization of L1: • For every symmetric probability distributions  and over V£V, The embedding f into L1 can be written as Hence,

On Embedding Edit Distance into L_1 Lower bound for 0-1 strings Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n) Proof sketch: • Suppose embeds with distortion D¸1, and let V={0,1}n. • By the cut-cone characterization of L1: • For every symmetric probability distributions  and over V£V, • We choose: • =uniform over V£V • =½(H+S) where • H=random point+random bit flip (uniform over EH={(x,y): ||x-y||1=1})‏ • S=random point+a cyclic shift (uniform over ES={(x,S(x)} )‏ • The RHS of (*) evaluates to O(D/n) by a counting argument. • Main Lemma: For all AµV, the LHS of (*) is (log n) / n. • Analysis of Boolean functions on the hypercube

On Embedding Edit Distance into L_1 Lower bound for 0-1 strings – cont. • Recall =½(H+S) where • H=random point+random bit flip • S=random point+a cyclic shift • Lemma: For all AµV, the LHS of (*) is • Proof sketch: • Assume to contrary, and define f = 1A.

On Embedding Edit Distance into L_1 Lower bound for 0-1 strings – cont. • Claim: Ij¸ 1/n1/8)Ij+1¸ 1/2n1/8 • Proof: cyclic shift S(x) x flip bit j flip bit j+1 x+ej S(x+ej) = S(x )+ej+1 cyclic shift

On Embedding Edit Distance into L_1 randomness y2n x2n … CCAbits Communication Complexity Approach Communication complexity model: • Two-party protocol • Shared randomness • Promise (gap) version • A = approximation factor • CCA = min. # bits to decide whp Alice Bob Previous communication lower bounds: • l1[Saks-Sun’02, BarYossef-Jayram-Kumar-Shivakumar’04] • l1[Woodruff’04] • Earthmover [Andoni-Indyk-K.’07] Distance Estimation Problem: decide whether d(x,y)¸R or d(x,y)·R/A

On Embedding Edit Distance into L_1 Communication Bounds for Edit Distance A tradeoff between approximation and communication • Theorem [Andoni-K.’07]: Corollary 1: Approximation A=O(1) requires CCA¸(loglog n) Corollary 2: Communication CCA=O(1) requires A ¸*(log n) For Hamming distance: CC1+ = O(1/2) [Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04] First computational model where edit is provably harder than Hamming! Implications to embeddings: • Embedding ED into L1 (or squared-L2) requires distortion *(log n) • Furthermore, holds for both 0-1 strings and permutations (Ulam)‏

On Embedding Edit Distance into L_1 Proof Outline Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity If CCA≤k then for every two distributions far,closethere is a k-bit deterministic protocol with success probability ¸ 2/3 Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols Further to above, there are Boolean functions sA,sB :n{0,1} with advantage Pr(x,y)2far[sA(x)sB(y)] – Pr(x,y)2 close[sA(x)sB(y)] ¸(2-k) Step 3 [Fourier expansion]: Reduce to one Fourier level  Furthermore, sA,sBdepend only on fixed positions j1,…,j Step 4 [Choose distribution]: Analyze (x,y)2 projected on these positions Let close,farinclude -noise  handle a high level  Let close,farinclude (few/more) block rotations  handle a low level  Step 5: Reduce Ulam to {0,1}n A random mapping {0,1} works Compare this additive analysis to our previous analysis: Key property: distribution of (xj1,…,xj, yj1,…,yj) is “statistically close” under far vs. under close

Summary of Known Results Embed ({0,1}n, ED) into L1 Embed Ulam metric into L1 Upper bound: 2O(√log n) [Ostrovsky-Rabani’05] O(log n)[Charikar-K.’06] (New proof by [Gopalan-Jayram-K.]) Lower bound: (log n) [K.-Rabani’06] log n/loglog n)[Andoni-K.’07] (Qualitatively much stronger)‏ On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1 Concluding Remarks • The computational lens • Study Distance Estimation problems rather than embeddings • Open problems: • Still large gap for 0-1 strings • Variants of edit distance (e.g. edit distance with block-moves)‏ • Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l1)‏ • Recent progress: • Bypass L1-embedding by devising new techniques • E.g. using max (l1) product for NNS under Ulam metric [Andoni- Indyk-K.] • Analyze/design “good” heuristics • E.g. smoothed analysis [Andoni-K.] Thank you!

On Embedding Edit Distance into L 1

On Embedding Edit Distance into L 1

Presentation Transcript

L-EDIT Tutorial

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Minimum Edit Distance

Minimum Edit Distance

Lower Bounds for Embedding Edit Distance into Normed Spaces

On embedding cycles into faulty twisted cubes

Minimum Edit Distance

On Embedding Machine-Processable Semantics into Documents

Edit Distance

Minimum Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance

L arge-scale Similarity Join with Edit-distance Constraints

Embedding Technology Into Instruction

Minimum Edit Distance

Dynamic Programming: Edit Distance

Edit Distance

Lower Bounds for Embedding Edit Distance into Normed Spaces