- 131 Views
- Uploaded on

Download Presentation
## On Embedding Edit Distance into L 1

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Robert Krauthgamer (Weizmann Institute and IBM Almaden)

Based on joint work

with Moses Charikar,

with Yuval Rabani,

with Parikshit Gopalan and T.S. Jayram.

with Alex Andoni

On Embedding Edit Distance into L1On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1Edit Distance

x 2n, y 2m

ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance]

Examples:

ED(00000, 1111) = 5

ED(01010, 10101) = 2

Applications:

- Genomics
- Text processing
- Web search

For simplicity: m = n.

X

On Embedding Edit Distance into L_1Embedding into L1

An embedding of (X,d) into l1is a map f : X!l1.

It has distortionK¸1 if

d(x,y) ≤ kf(x)-f(y)k1 ≤ K d(x,y)8x,y2X

Very powerful concept (when distortion is small)

Goal: Embed edit distance into l1 with small distortion

Motivation:

Reduce algorithmic problems to l1

E.g. Nearest-Neighbor Search

Study a simple metric space without norm

E.g. Hamming cube w/cyclic shifts.

Known Results for Edit Distance

Embed ({0,1}n, ED) into L1

Previous bounds

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(n2/3)[Bar Yossef-Jayram-K.-Kumar’04]

Lower

bound:

(log n)

[K.-Rabani’06]

(log n)1/2-o(1)[Khot-Naor’05] and

3/2 [Andoni-Deza-Gupta-Indyk-Raskhodnikova’03]

Large Gap … Despite signficant effort!!!

On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1Submetrics (Restricted Strings)

- Why focus on submetrics of edit distance?
- May admit smaller distortion
- Partial progress towards general case
- A framework to analyzing non worst-case instances
- Example (a la computational biology): Handle only “typical” strings
- Class 1:
- A string is k-non-repetitive if all its k-substrings are distinct
- A random 0-1 string is WHP (2log n)-non-repetitive
- Yields a submetric containing 1-o(1) fraction of the strings
- Class 2:
- Ulam metric = edit distance on all permutations (here ={1,…,n})
- Every permutation is 1-non-repetitive
- Note: k-non-repetitive strings embed into Ulam with distortion k.

k=7

Theory of Computation Seminar, Computer Science Department

Known Results for Ulam Metric

Embed ({0,1}n, ED) into L1

Embed Ulam metric into L1

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(log n)[Charikar-K.’06]

(New proof by [Gopalan-Jayram-K.])

Lower

bound:

(log n)

[K.-Rabani’06]

log n/loglog n)[Andoni-K.’07]

(Actually qualitatively stronger)

Large Gap … Near-tight!

On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1Embedding of permutations

Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n).

Proof. Define where

Intuition:

- sign(fa,b(P)) is indicator for “a appears before b” in P
- Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q

Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)

- Suppose Q is obtained from P by moving one symbol, say ‘s’
- General case then follows by applying triangle inequality on P,P’,P’’,…,Q
- Total contribution of
- coordinates s2{a,b} is 2k (1/k) ≤ O(log n)
- other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)

On Embedding Edit Distance into L_1Embedding of permutations

Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n).

Proof. Define where

Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)

Claim 2: ||f(P)-f(Q)||1¸ ½ ED(P,Q)

- Assume wlog that P=identity
- Edit Q into an increasing sequence (thus into P) using quicksort:
- Choose a random pivot,
- Delete all characters inverted wrt to pivot
- Repeat recursively on left and right portions
- Now argue ||f(P)-f(Q)||1¸E[ #quicksort deletions ] ¸ ½ ED(P,Q)

Surviving subsequence is increasing

ED(P,Q) ≤ 2 #deletions

For every inversion (a,b) in Q:

Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|

On Embedding Edit Distance into L_1Lower bound for 0-1 strings

Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n)

Proof sketch:

- Suppose embeds with distortion D¸1, and let V={0,1}n.
- By the cut-cone characterization of L1:
- For every symmetric probability distributions and over V£V,

The embedding f into L1 can be written as

Hence,

On Embedding Edit Distance into L_1Lower bound for 0-1 strings

Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n)

Proof sketch:

- Suppose embeds with distortion D¸1, and let V={0,1}n.
- By the cut-cone characterization of L1:
- For every symmetric probability distributions and over V£V,
- We choose:
- =uniform over V£V
- =½(H+S) where
- H=random point+random bit flip (uniform over EH={(x,y): ||x-y||1=1})
- S=random point+a cyclic shift (uniform over ES={(x,S(x)} )
- The RHS of (*) evaluates to O(D/n) by a counting argument.
- Main Lemma: For all AµV, the LHS of (*) is (log n) / n.
- Analysis of Boolean functions on the hypercube

On Embedding Edit Distance into L_1Lower bound for 0-1 strings – cont.

- Recall =½(H+S) where
- H=random point+random bit flip
- S=random point+a cyclic shift
- Lemma: For all AµV, the LHS of (*) is
- Proof sketch:
- Assume to contrary, and define f = 1A.

On Embedding Edit Distance into L_1Lower bound for 0-1 strings – cont.

- Claim: Ij¸ 1/n1/8)Ij+1¸ 1/2n1/8
- Proof:

cyclic shift

S(x)

x

flip bit j

flip bit j+1

x+ej

S(x+ej)

= S(x )+ej+1

cyclic shift

On Embedding Edit Distance into L_1

randomness

y2n

x2n

…

CCAbits

Communication Complexity ApproachCommunication complexity model:

- Two-party protocol
- Shared randomness
- Promise (gap) version
- A = approximation factor
- CCA = min. # bits to decide whp

Alice

Bob

Previous communication lower bounds:

- l1[Saks-Sun’02, BarYossef-Jayram-Kumar-Shivakumar’04]
- l1[Woodruff’04]
- Earthmover [Andoni-Indyk-K.’07]

Distance Estimation Problem:

decide whether d(x,y)¸R or d(x,y)·R/A

On Embedding Edit Distance into L_1Communication Bounds for Edit Distance

A tradeoff between approximation and communication

- Theorem [Andoni-K.’07]:

Corollary 1: Approximation A=O(1) requires CCA¸(loglog n)

Corollary 2: Communication CCA=O(1) requires A ¸*(log n)

For Hamming distance: CC1+ = O(1/2)

[Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04]

First computational model where edit is provably harder than Hamming!

Implications to embeddings:

- Embedding ED into L1 (or squared-L2) requires distortion *(log n)
- Furthermore, holds for both 0-1 strings and permutations (Ulam)

On Embedding Edit Distance into L_1Proof Outline

Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity

If CCA≤k then for every two distributions far,closethere is a k-bit deterministic protocol with success probability ¸ 2/3

Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols

Further to above, there are Boolean functions sA,sB :n{0,1} with advantage

Pr(x,y)2far[sA(x)sB(y)] – Pr(x,y)2 close[sA(x)sB(y)] ¸(2-k)

Step 3 [Fourier expansion]: Reduce to one Fourier level

Furthermore, sA,sBdepend only on fixed positions j1,…,j

Step 4 [Choose distribution]: Analyze (x,y)2 projected on these positions

Let close,farinclude -noise handle a high level

Let close,farinclude (few/more) block rotations handle a low level

Step 5: Reduce Ulam to {0,1}n

A random mapping {0,1} works

Compare this additive analysis to our previous analysis:

Key property: distribution of (xj1,…,xj, yj1,…,yj) is

“statistically close” under far vs. under close

Summary of Known Results

Embed ({0,1}n, ED) into L1

Embed Ulam metric into L1

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(log n)[Charikar-K.’06]

(New proof by [Gopalan-Jayram-K.])

Lower

bound:

(log n)

[K.-Rabani’06]

log n/loglog n)[Andoni-K.’07]

(Qualitatively much stronger)

On Embedding Edit Distance into L_1

On Embedding Edit Distance into L_1Concluding Remarks

- The computational lens
- Study Distance Estimation problems rather than embeddings
- Open problems:
- Still large gap for 0-1 strings
- Variants of edit distance (e.g. edit distance with block-moves)
- Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l1)
- Recent progress:
- Bypass L1-embedding by devising new techniques
- E.g. using max (l1) product for NNS under Ulam metric [Andoni- Indyk-K.]
- Analyze/design “good” heuristics
- E.g. smoothed analysis [Andoni-K.]

Thank you!

Download Presentation

Connecting to Server..