Combinatorial aspects of the burrows wheeler transform
Download
1 / 12

Combinatorial aspects of the Burrows-Wheeler transform - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

Combinatorial aspects of the Burrows-Wheeler transform. Sabrina Mantaci Antonio Restivo Marinella Sciortino. University of Palermo. Burrows-Wheeler Transform.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Combinatorial aspects of the Burrows-Wheeler transform' - jaime-randolph


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Combinatorial aspects of the burrows wheeler transform
Combinatorial aspects of the Burrows-Wheeler transform

Sabrina Mantaci

Antonio Restivo

Marinella Sciortino

University of Palermo


Burrows wheeler transform
Burrows-Wheeler Transform

  • In 1994 M. Burrows and D. Wheeler introduced a new data compression method based on a preprocessing on the input string. Such a preprocessing, called after them the Burrows-Wheeler Transform (BWT), produces a permutation of the letters in the input string such that:

  • the transformed string is easier to compress than the original one.

  • the original string can be recovered;

  • The use of this preprocessing allowed to define a class of lossless data compression algorithms that:

  • achieve speed comparable to the algorithms based on the techniques by Lempel and Ziv;

  • obtains a compression ratio close to the best statistical modelling techniques.


How does bwt work

FL

0 a a b r a c

1 a b r a c a

2 a c a a b r

3 b r a c a a

4 c a a b r a

5 r a c a a b

I

  • OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the position of the original word w after the lexicographic ordering.

How does BWT work ?

  • INPUT:w = abraca

  • Lexicographically sort the cyclic rotations of w

  • The following properties hold:

  • the character L[i] is followed in w by F[i];

  • for each character ch, the i-th occurrence of ch in F corresponds to the i-th occurrence of ch in L.


Reversibility

F

0 a

1 a

2 a

3 b

4 c

5 r

L

c 0

a 1

r 2

a 3

a 4

b 5

I

 :

0 1 2 3 4 5

1 3 4 5 0 2

 =

w=

a

b

r

a

c

a

Reversibility

The Burrows-Wheeler transform is reversible, in the sense that given BWT(w) and an index I, it is possible to recover w.

  • Given L=BWT(w)=caraab and I=1:

  • Construct F by alphabetically sorting the letters in L

  • Define a permutation  on {0,1,…,n-1}, establishing a correspondence between the positions of the same letters in F and in L;

  • Starting from position I, we can recover w=w0 … wn as follows:

  • wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x))


  • REMARK: Two words x and y are conjugate  BWT(x)=BWT(y)

  • PROPOSITION:

  • If and BWT(v)=a0a1…an-1then BWT(u)= ;

  • If BWT(v)=a0a1…an-1and BWT(u)= then there exists a

  • conjugate u’ of u such that u’=vd.

We can deduce that:

Therefore we can study combinatorial properties of the BWT by studying the conjugacy classes of primitive words.


Standard words
Standard Words

d1, d2,…,dn,… a sequence of natural numbers

d10, >0 i =2,…,n

Consider the sequence {sn}n0 defined as:

  • s is a characteristic Sturmian word

  • {sn}0 is called approximating sequence of s

  • (d1, d2,…,dn,… )is the directive sequence of s

  • Each finite word snis a standard word


Characterization of standard words
Characterization of standard words

  • A word w is standard if and only if it is a letter or w=vab (or equivalently w=vba) and v has periods p,q such that gcd(p,q)=1 and |v|=p+q-2.(extremal case of Fine and Wilf theorem)

  • A word w is standard if and only if it is a letter or there exist palindrome words P,Q,R, such that w = QR= Pxy where {x,y}={a,b}.

  • Standard words correspond to an extremal case of Knuth-Morris-Pratt algorithm.


Rotations

Ia={0,1,…,q-1} Ib={q,q+1,…,n-1}

 : {0,1,…n-1} {a,b} defined as:

 (x )=a if x Ia, b otherwise.

a

a

b

0

1

7

a

2

b

6

3

5

a

4

b

a

  • THEOREM: Let w=x0x1…xn-1in {a,b}* , |w|a=q and |w|b=p.

  • w is a standard word with suffix ba  xi=

  • w is a standard word with suffix ab  xi=

REMARK: Let u=u0u1…un-1, v=v0v1…vn-1

If ui= and vi= then u and v are conjugate.

Rotations

Standard words can also be generated by rotations.

Let p,q2 such that gcd(p,q)=1 and n=p+q.

p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n)

If n=8, p=3, q=5,…

w=abaababa


A new characterization of standard words

THEOREM:Let u be a word over the alphabet {a,b}.

BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a standard word.

In particular, in order to reconstruct u from BWT(u) and the index I:

if I=p then u is a standard word with suffix ba

if I=p-1 then u is a standard word with suffix ab

COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where v is a conjugate of a standard word.

A new characterization of standard words


Idea of the proof

 :

F

0 a

1 a

2 a

3 a

4 a

5 b

6 b

7 b

L

b 0

b 1

b 2

a 3

a 4

a 5

a 6

a 7

Idea of the proof:

The permutation  giving the correspondence between the positions of characters in F and L is (z)=z+p(mod n).

Starting, for example, from the position I=p we can recover the word u, ui=F(i(p)).


Further Research

Further Research

  • Study extremal case of the BWT for k-letters alphabets with k>2.

  • For instance for k=3, characterize the words w such that BWT(w) belongs to c*a*b* or b*c*a*.

  • This property does work neither with 3-Standard words nor with balanced words.

  • Does a relation between the complexity function of a word w and the structure of BWT(w) exist?

  • Given a language L, one can define BWT(L)={BWT(w) | w in L}. One can ask whether BWT preserves some properties of a language L, such as belonging to a certain family of languages in the Chomsky Hierarchy.

  • We found negative results

L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language

L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language


Consider for instance the words generated by finite iterations of the Thue-Morse morphism m(a)=ab m(b)=ba.

Denote by vRthe reversal word of v and by v the word obtained by interchanging a with b and vice-versa.

Then:

BWT(mn(a))=vvR

Where

v=b2n-2a2n-3b2n-4...b20a if n is even

v=b2n-2a2n-3b2n-4...a20b if n is odd

Further Research

  • Is it possible to characterize interesting families of words in terms of their BWT?


ad