On Embedding Edit Distance into L 1

1 / 17

# On Embedding Edit Distance into L 1 - PowerPoint PPT Presentation

Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work with Moses Charikar, with Yuval Rabani, with Parikshit Gopalan and T.S. Jayram. with Alex Andoni. On Embedding Edit Distance into L 1. Edit Distance. x 2  n , y 2  m.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## On Embedding Edit Distance into L 1

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏

Based on joint work

with Moses Charikar,

with Yuval Rabani,

with Parikshit Gopalan and T.S. Jayram.

with Alex Andoni

On Embedding Edit Distance into L1

On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1Edit Distance

x 2n, y 2m

ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance]

Examples:

ED(00000, 1111) = 5

ED(01010, 10101) = 2

Applications:

• Genomics
• Text processing
• Web search

For simplicity: m = n.

X

On Embedding Edit Distance into L_1Embedding into L1

An embedding of (X,d) into l1is a map f : X!l1.

It has distortionK¸1 if

d(x,y) ≤ kf(x)-f(y)k1 ≤ K d(x,y)8x,y2X

Very powerful concept (when distortion is small)

Goal: Embed edit distance into l1 with small distortion

Motivation:

Reduce algorithmic problems to l1

E.g. Nearest-Neighbor Search

Study a simple metric space without norm

E.g. Hamming cube w/cyclic shifts.

Known Results for Edit Distance

Embed ({0,1}n, ED) into L1

Previous bounds

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(n2/3)[Bar Yossef-Jayram-K.-Kumar’04]

Lower

bound:

(log n)

[K.-Rabani’06]

(log n)1/2-o(1)[Khot-Naor’05] and

Large Gap … Despite signficant effort!!!

On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1Submetrics (Restricted Strings)‏
• Why focus on submetrics of edit distance?
• May admit smaller distortion
• Partial progress towards general case
• A framework to analyzing non worst-case instances
• Example (a la computational biology): Handle only “typical” strings
• Class 1:
• A string is k-non-repetitive if all its k-substrings are distinct
• A random 0-1 string is WHP (2log n)-non-repetitive
• Yields a submetric containing 1-o(1) fraction of the strings
• Class 2:
• Ulam metric = edit distance on all permutations (here ={1,…,n})‏
• Every permutation is 1-non-repetitive
• Note: k-non-repetitive strings embed into Ulam with distortion k.

k=7

Theory of Computation Seminar, Computer Science Department

Known Results for Ulam Metric

Embed ({0,1}n, ED) into L1

Embed Ulam metric into L1

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(log n)[Charikar-K.’06]

(New proof by [Gopalan-Jayram-K.])

Lower

bound:

(log n)

[K.-Rabani’06]

log n/loglog n)[Andoni-K.’07]

(Actually qualitatively stronger)‏

Large Gap … Near-tight!

On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1Embedding of permutations

Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n).

Proof. Define where

Intuition:

• sign(fa,b(P)) is indicator for “a appears before b” in P
• Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q

Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)‏

• Suppose Q is obtained from P by moving one symbol, say ‘s’
• General case then follows by applying triangle inequality on P,P’,P’’,…,Q
• Total contribution of
• coordinates s2{a,b} is 2k (1/k) ≤ O(log n)‏
• other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)‏
On Embedding Edit Distance into L_1Embedding of permutations

Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n).

Proof. Define where

Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)‏

Claim 2: ||f(P)-f(Q)||1¸ ½ ED(P,Q)

• Assume wlog that P=identity
• Edit Q into an increasing sequence (thus into P) using quicksort:
• Choose a random pivot,
• Delete all characters inverted wrt to pivot
• Repeat recursively on left and right portions
• Now argue ||f(P)-f(Q)||1¸E[ #quicksort deletions ] ¸ ½ ED(P,Q)

Surviving subsequence is increasing

ED(P,Q) ≤ 2 #deletions

For every inversion (a,b) in Q:

Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|

On Embedding Edit Distance into L_1Lower bound for 0-1 strings

Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n)

Proof sketch:

• Suppose embeds with distortion D¸1, and let V={0,1}n.
• By the cut-cone characterization of L1:
• For every symmetric probability distributions  and over V£V,

The embedding f into L1 can be written as

Hence,

On Embedding Edit Distance into L_1Lower bound for 0-1 strings

Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n)

Proof sketch:

• Suppose embeds with distortion D¸1, and let V={0,1}n.
• By the cut-cone characterization of L1:
• For every symmetric probability distributions  and over V£V,
• We choose:
• =uniform over V£V
• =½(H+S) where
• H=random point+random bit flip (uniform over EH={(x,y): ||x-y||1=1})‏
• S=random point+a cyclic shift (uniform over ES={(x,S(x)} )‏
• The RHS of (*) evaluates to O(D/n) by a counting argument.
• Main Lemma: For all AµV, the LHS of (*) is (log n) / n.
• Analysis of Boolean functions on the hypercube
On Embedding Edit Distance into L_1Lower bound for 0-1 strings – cont.
• Recall =½(H+S) where
• H=random point+random bit flip
• S=random point+a cyclic shift
• Lemma: For all AµV, the LHS of (*) is
• Proof sketch:
• Assume to contrary, and define f = 1A.
On Embedding Edit Distance into L_1Lower bound for 0-1 strings – cont.
• Claim: Ij¸ 1/n1/8)Ij+1¸ 1/2n1/8
• Proof:

cyclic shift

S(x)

x

flip bit j

flip bit j+1

x+ej

S(x+ej)

= S(x )+ej+1

cyclic shift

On Embedding Edit Distance into L_1

randomness

y2n

x2n

CCAbits

Communication Complexity Approach

Communication complexity model:

• Two-party protocol
• Shared randomness
• Promise (gap) version
• A = approximation factor
• CCA = min. # bits to decide whp

Alice

Bob

Previous communication lower bounds:

• l1[Saks-Sun’02, BarYossef-Jayram-Kumar-Shivakumar’04]
• l1[Woodruff’04]
• Earthmover [Andoni-Indyk-K.’07]

Distance Estimation Problem:

decide whether d(x,y)¸R or d(x,y)·R/A

On Embedding Edit Distance into L_1Communication Bounds for Edit Distance

A tradeoff between approximation and communication

• Theorem [Andoni-K.’07]:

Corollary 1: Approximation A=O(1) requires CCA¸(loglog n)

Corollary 2: Communication CCA=O(1) requires A ¸*(log n)

For Hamming distance: CC1+ = O(1/2)

[Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04]

First computational model where edit is provably harder than Hamming!

Implications to embeddings:

• Embedding ED into L1 (or squared-L2) requires distortion *(log n)
• Furthermore, holds for both 0-1 strings and permutations (Ulam)‏
On Embedding Edit Distance into L_1Proof Outline

Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity

If CCA≤k then for every two distributions far,closethere is a k-bit deterministic protocol with success probability ¸ 2/3

Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols

Further to above, there are Boolean functions sA,sB :n{0,1} with advantage

Pr(x,y)2far[sA(x)sB(y)] – Pr(x,y)2 close[sA(x)sB(y)] ¸(2-k)

Step 3 [Fourier expansion]: Reduce to one Fourier level 

Furthermore, sA,sBdepend only on fixed positions j1,…,j

Step 4 [Choose distribution]: Analyze (x,y)2 projected on these positions

Let close,farinclude -noise  handle a high level 

Let close,farinclude (few/more) block rotations  handle a low level 

Step 5: Reduce Ulam to {0,1}n

A random mapping {0,1} works

Compare this additive analysis to our previous analysis:

Key property: distribution of (xj1,…,xj, yj1,…,yj) is

“statistically close” under far vs. under close

Summary of Known Results

Embed ({0,1}n, ED) into L1

Embed Ulam metric into L1

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(log n)[Charikar-K.’06]

(New proof by [Gopalan-Jayram-K.])

Lower

bound:

(log n)

[K.-Rabani’06]

log n/loglog n)[Andoni-K.’07]

(Qualitatively much stronger)‏

On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1Concluding Remarks
• The computational lens
• Study Distance Estimation problems rather than embeddings
• Open problems:
• Still large gap for 0-1 strings
• Variants of edit distance (e.g. edit distance with block-moves)‏
• Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l1)‏
• Recent progress:
• Bypass L1-embedding by devising new techniques
• E.g. using max (l1) product for NNS under Ulam metric [Andoni- Indyk-K.]
• Analyze/design “good” heuristics
• E.g. smoothed analysis [Andoni-K.]

Thank you!