combinatorial aspects of the burrows wheeler transform n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Combinatorial aspects of the Burrows-Wheeler transform PowerPoint Presentation
Download Presentation
Combinatorial aspects of the Burrows-Wheeler transform

Loading in 2 Seconds...

play fullscreen
1 / 12

Combinatorial aspects of the Burrows-Wheeler transform - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Combinatorial aspects of the Burrows-Wheeler transform. Sabrina Mantaci Antonio Restivo Marinella Sciortino. University of Palermo. Burrows-Wheeler Transform.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Combinatorial aspects of the Burrows-Wheeler transform' - jaime-randolph


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
combinatorial aspects of the burrows wheeler transform
Combinatorial aspects of the Burrows-Wheeler transform

Sabrina Mantaci

Antonio Restivo

Marinella Sciortino

University of Palermo

burrows wheeler transform
Burrows-Wheeler Transform
  • In 1994 M. Burrows and D. Wheeler introduced a new data compression method based on a preprocessing on the input string. Such a preprocessing, called after them the Burrows-Wheeler Transform (BWT), produces a permutation of the letters in the input string such that:
  • the transformed string is easier to compress than the original one.
  • the original string can be recovered;
  • The use of this preprocessing allowed to define a class of lossless data compression algorithms that:
  • achieve speed comparable to the algorithms based on the techniques by Lempel and Ziv;
  • obtains a compression ratio close to the best statistical modelling techniques.
how does bwt work

FL

0 a a b r a c

1 a b r a c a

2 a c a a b r

3 b r a c a a

4 c a a b r a

5 r a c a a b

I

  • OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the position of the original word w after the lexicographic ordering.
How does BWT work ?
  • INPUT:w = abraca
  • Lexicographically sort the cyclic rotations of w
  • The following properties hold:
  • the character L[i] is followed in w by F[i];
  • for each character ch, the i-th occurrence of ch in F corresponds to the i-th occurrence of ch in L.
reversibility

F

0 a

1 a

2 a

3 b

4 c

5 r

L

c 0

a 1

r 2

a 3

a 4

b 5

I

 :

0 1 2 3 4 5

1 3 4 5 0 2

 =

w=

a

b

r

a

c

a

Reversibility

The Burrows-Wheeler transform is reversible, in the sense that given BWT(w) and an index I, it is possible to recover w.

  • Given L=BWT(w)=caraab and I=1:
  • Construct F by alphabetically sorting the letters in L
  • Define a permutation  on {0,1,…,n-1}, establishing a correspondence between the positions of the same letters in F and in L;
  • Starting from position I, we can recover w=w0 … wn as follows:
  • wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x))
slide5

REMARK: Two words x and y are conjugate  BWT(x)=BWT(y)

  • PROPOSITION:
  • If and BWT(v)=a0a1…an-1then BWT(u)= ;
  • If BWT(v)=a0a1…an-1and BWT(u)= then there exists a
  • conjugate u’ of u such that u’=vd.

We can deduce that:

Therefore we can study combinatorial properties of the BWT by studying the conjugacy classes of primitive words.

standard words
Standard Words

d1, d2,…,dn,… a sequence of natural numbers

d10, >0 i =2,…,n

Consider the sequence {sn}n0 defined as:

  • s is a characteristic Sturmian word
  • {sn}0 is called approximating sequence of s
  • (d1, d2,…,dn,… )is the directive sequence of s
  • Each finite word snis a standard word
characterization of standard words
Characterization of standard words
  • A word w is standard if and only if it is a letter or w=vab (or equivalently w=vba) and v has periods p,q such that gcd(p,q)=1 and |v|=p+q-2.(extremal case of Fine and Wilf theorem)
  • A word w is standard if and only if it is a letter or there exist palindrome words P,Q,R, such that w = QR= Pxy where {x,y}={a,b}.
  • Standard words correspond to an extremal case of Knuth-Morris-Pratt algorithm.
rotations

Ia={0,1,…,q-1} Ib={q,q+1,…,n-1}

 : {0,1,…n-1} {a,b} defined as:

 (x )=a if x Ia, b otherwise.

a

a

b

0

1

7

a

2

b

6

3

5

a

4

b

a

  • THEOREM: Let w=x0x1…xn-1in {a,b}* , |w|a=q and |w|b=p.
  • w is a standard word with suffix ba  xi=
  • w is a standard word with suffix ab  xi=

REMARK: Let u=u0u1…un-1, v=v0v1…vn-1

If ui= and vi= then u and v are conjugate.

Rotations

Standard words can also be generated by rotations.

Let p,q2 such that gcd(p,q)=1 and n=p+q.

p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n)

If n=8, p=3, q=5,…

w=abaababa

a new characterization of standard words

THEOREM:Let u be a word over the alphabet {a,b}.

BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a standard word.

In particular, in order to reconstruct u from BWT(u) and the index I:

if I=p then u is a standard word with suffix ba

if I=p-1 then u is a standard word with suffix ab

COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where v is a conjugate of a standard word.

A new characterization of standard words
idea of the proof

 :

F

0 a

1 a

2 a

3 a

4 a

5 b

6 b

7 b

L

b 0

b 1

b 2

a 3

a 4

a 5

a 6

a 7

Idea of the proof:

The permutation  giving the correspondence between the positions of characters in F and L is (z)=z+p(mod n).

Starting, for example, from the position I=p we can recover the word u, ui=F(i(p)).

slide11

Further Research

Further Research

  • Study extremal case of the BWT for k-letters alphabets with k>2.
  • For instance for k=3, characterize the words w such that BWT(w) belongs to c*a*b* or b*c*a*.
  • This property does work neither with 3-Standard words nor with balanced words.
  • Does a relation between the complexity function of a word w and the structure of BWT(w) exist?
  • Given a language L, one can define BWT(L)={BWT(w) | w in L}. One can ask whether BWT preserves some properties of a language L, such as belonging to a certain family of languages in the Chomsky Hierarchy.
  • We found negative results

L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language

L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language

slide12

Consider for instance the words generated by finite iterations of the Thue-Morse morphism m(a)=ab m(b)=ba.

Denote by vRthe reversal word of v and by v the word obtained by interchanging a with b and vice-versa.

Then:

BWT(mn(a))=vvR

Where

v=b2n-2a2n-3b2n-4...b20a if n is even

v=b2n-2a2n-3b2n-4...a20b if n is odd

Further Research

  • Is it possible to characterize interesting families of words in terms of their BWT?