on embedding edit distance into l 1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
On Embedding Edit Distance into L 1 PowerPoint Presentation
Download Presentation
On Embedding Edit Distance into L 1

Loading in 2 Seconds...

play fullscreen
1 / 17

On Embedding Edit Distance into L 1 - PowerPoint PPT Presentation


  • 131 Views
  • Uploaded on

Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work with Moses Charikar, with Yuval Rabani, with Parikshit Gopalan and T.S. Jayram. with Alex Andoni. On Embedding Edit Distance into L 1. Edit Distance. x 2  n , y 2  m.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'On Embedding Edit Distance into L 1' - benedict-green


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
on embedding edit distance into l 1

Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏

Based on joint work

with Moses Charikar,

with Yuval Rabani,

with Parikshit Gopalan and T.S. Jayram.

with Alex Andoni

On Embedding Edit Distance into L1

On Embedding Edit Distance into L_1

edit distance
On Embedding Edit Distance into L_1Edit Distance

x 2n, y 2m

ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance]

Examples:

ED(00000, 1111) = 5

ED(01010, 10101) = 2

Applications:

  • Genomics
  • Text processing
  • Web search

For simplicity: m = n.

X

embedding into l 1
On Embedding Edit Distance into L_1Embedding into L1

An embedding of (X,d) into l1is a map f : X!l1.

It has distortionK¸1 if

d(x,y) ≤ kf(x)-f(y)k1 ≤ K d(x,y)8x,y2X

Very powerful concept (when distortion is small)

Goal: Embed edit distance into l1 with small distortion

Motivation:

Reduce algorithmic problems to l1

E.g. Nearest-Neighbor Search

Study a simple metric space without norm

E.g. Hamming cube w/cyclic shifts.

known results for edit distance
Known Results for Edit Distance

Embed ({0,1}n, ED) into L1

Previous bounds

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(n2/3)[Bar Yossef-Jayram-K.-Kumar’04]

Lower

bound:

(log n)

[K.-Rabani’06]

(log n)1/2-o(1)[Khot-Naor’05] and

3/2 [Andoni-Deza-Gupta-Indyk-Raskhodnikova’03]

Large Gap … Despite signficant effort!!!

On Embedding Edit Distance into L_1

submetrics restricted strings
On Embedding Edit Distance into L_1Submetrics (Restricted Strings)‏
  • Why focus on submetrics of edit distance?
    • May admit smaller distortion
    • Partial progress towards general case
    • A framework to analyzing non worst-case instances
      • Example (a la computational biology): Handle only “typical” strings
  • Class 1:
    • A string is k-non-repetitive if all its k-substrings are distinct
    • A random 0-1 string is WHP (2log n)-non-repetitive
      • Yields a submetric containing 1-o(1) fraction of the strings
  • Class 2:
    • Ulam metric = edit distance on all permutations (here ={1,…,n})‏
    • Every permutation is 1-non-repetitive
    • Note: k-non-repetitive strings embed into Ulam with distortion k.

k=7

Theory of Computation Seminar, Computer Science Department

known results for ulam metric
Known Results for Ulam Metric

Embed ({0,1}n, ED) into L1

Embed Ulam metric into L1

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(log n)[Charikar-K.’06]

(New proof by [Gopalan-Jayram-K.])

Lower

bound:

(log n)

[K.-Rabani’06]

log n/loglog n)[Andoni-K.’07]

(Actually qualitatively stronger)‏

Large Gap … Near-tight!

On Embedding Edit Distance into L_1

embedding of permutations
On Embedding Edit Distance into L_1Embedding of permutations

Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n).

Proof. Define where

Intuition:

  • sign(fa,b(P)) is indicator for “a appears before b” in P
  • Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q

Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)‏

  • Suppose Q is obtained from P by moving one symbol, say ‘s’
    • General case then follows by applying triangle inequality on P,P’,P’’,…,Q
  • Total contribution of
    • coordinates s2{a,b} is 2k (1/k) ≤ O(log n)‏
    • other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)‏
embedding of permutations1
On Embedding Edit Distance into L_1Embedding of permutations

Theorem [Charikar-K.’06]: The Ulam metric of dimension n embeds into l1 with distortion O(log n).

Proof. Define where

Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)‏

Claim 2: ||f(P)-f(Q)||1¸ ½ ED(P,Q)

  • Assume wlog that P=identity
  • Edit Q into an increasing sequence (thus into P) using quicksort:
    • Choose a random pivot,
    • Delete all characters inverted wrt to pivot
    • Repeat recursively on left and right portions
  • Now argue ||f(P)-f(Q)||1¸E[ #quicksort deletions ] ¸ ½ ED(P,Q)

Surviving subsequence is increasing

ED(P,Q) ≤ 2 #deletions

For every inversion (a,b) in Q:

Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|

lower bound for 0 1 strings
On Embedding Edit Distance into L_1Lower bound for 0-1 strings

Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n)

Proof sketch:

  • Suppose embeds with distortion D¸1, and let V={0,1}n.
  • By the cut-cone characterization of L1:
    • For every symmetric probability distributions  and over V£V,

The embedding f into L1 can be written as

Hence,

lower bound for 0 1 strings1
On Embedding Edit Distance into L_1Lower bound for 0-1 strings

Theorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires distortion (log n)

Proof sketch:

  • Suppose embeds with distortion D¸1, and let V={0,1}n.
  • By the cut-cone characterization of L1:
    • For every symmetric probability distributions  and over V£V,
  • We choose:
    • =uniform over V£V
    • =½(H+S) where
      • H=random point+random bit flip (uniform over EH={(x,y): ||x-y||1=1})‏
      • S=random point+a cyclic shift (uniform over ES={(x,S(x)} )‏
  • The RHS of (*) evaluates to O(D/n) by a counting argument.
  • Main Lemma: For all AµV, the LHS of (*) is (log n) / n.
    • Analysis of Boolean functions on the hypercube
lower bound for 0 1 strings cont
On Embedding Edit Distance into L_1Lower bound for 0-1 strings – cont.
  • Recall =½(H+S) where
    • H=random point+random bit flip
    • S=random point+a cyclic shift
  • Lemma: For all AµV, the LHS of (*) is
  • Proof sketch:
    • Assume to contrary, and define f = 1A.
lower bound for 0 1 strings cont1
On Embedding Edit Distance into L_1Lower bound for 0-1 strings – cont.
  • Claim: Ij¸ 1/n1/8)Ij+1¸ 1/2n1/8
  • Proof:

cyclic shift

S(x)

x

flip bit j

flip bit j+1

x+ej

S(x+ej)

= S(x )+ej+1

cyclic shift

communication complexity approach
On Embedding Edit Distance into L_1

randomness

y2n

x2n

CCAbits

Communication Complexity Approach

Communication complexity model:

  • Two-party protocol
  • Shared randomness
  • Promise (gap) version
  • A = approximation factor
  • CCA = min. # bits to decide whp

Alice

Bob

Previous communication lower bounds:

  • l1[Saks-Sun’02, BarYossef-Jayram-Kumar-Shivakumar’04]
  • l1[Woodruff’04]
  • Earthmover [Andoni-Indyk-K.’07]

Distance Estimation Problem:

decide whether d(x,y)¸R or d(x,y)·R/A

communication bounds for edit distance
On Embedding Edit Distance into L_1Communication Bounds for Edit Distance

A tradeoff between approximation and communication

  • Theorem [Andoni-K.’07]:

Corollary 1: Approximation A=O(1) requires CCA¸(loglog n)

Corollary 2: Communication CCA=O(1) requires A ¸*(log n)

For Hamming distance: CC1+ = O(1/2)

[Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04]

First computational model where edit is provably harder than Hamming!

Implications to embeddings:

  • Embedding ED into L1 (or squared-L2) requires distortion *(log n)
  • Furthermore, holds for both 0-1 strings and permutations (Ulam)‏
proof outline
On Embedding Edit Distance into L_1Proof Outline

Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity

If CCA≤k then for every two distributions far,closethere is a k-bit deterministic protocol with success probability ¸ 2/3

Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols

Further to above, there are Boolean functions sA,sB :n{0,1} with advantage

Pr(x,y)2far[sA(x)sB(y)] – Pr(x,y)2 close[sA(x)sB(y)] ¸(2-k)

Step 3 [Fourier expansion]: Reduce to one Fourier level 

Furthermore, sA,sBdepend only on fixed positions j1,…,j

Step 4 [Choose distribution]: Analyze (x,y)2 projected on these positions

Let close,farinclude -noise  handle a high level 

Let close,farinclude (few/more) block rotations  handle a low level 

Step 5: Reduce Ulam to {0,1}n

A random mapping {0,1} works

Compare this additive analysis to our previous analysis:

Key property: distribution of (xj1,…,xj, yj1,…,yj) is

“statistically close” under far vs. under close

summary of known results
Summary of Known Results

Embed ({0,1}n, ED) into L1

Embed Ulam metric into L1

Upper

bound:

2O(√log n)

[Ostrovsky-Rabani’05]

O(log n)[Charikar-K.’06]

(New proof by [Gopalan-Jayram-K.])

Lower

bound:

(log n)

[K.-Rabani’06]

log n/loglog n)[Andoni-K.’07]

(Qualitatively much stronger)‏

On Embedding Edit Distance into L_1

concluding remarks
On Embedding Edit Distance into L_1Concluding Remarks
  • The computational lens
    • Study Distance Estimation problems rather than embeddings
  • Open problems:
    • Still large gap for 0-1 strings
    • Variants of edit distance (e.g. edit distance with block-moves)‏
    • Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l1)‏
  • Recent progress:
    • Bypass L1-embedding by devising new techniques
      • E.g. using max (l1) product for NNS under Ulam metric [Andoni- Indyk-K.]
    • Analyze/design “good” heuristics
      • E.g. smoothed analysis [Andoni-K.]

Thank you!