Loading in 5 sec....

Combinatorial aspects of the Burrows-Wheeler transformPowerPoint Presentation

Combinatorial aspects of the Burrows-Wheeler transform

- 109 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Combinatorial aspects of the Burrows-Wheeler transform' - jaime-randolph

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Combinatorial aspects of the Burrows-Wheeler transform

Sabrina Mantaci

Antonio Restivo

Marinella Sciortino

University of Palermo

Burrows-Wheeler Transform

- In 1994 M. Burrows and D. Wheeler introduced a new data compression method based on a preprocessing on the input string. Such a preprocessing, called after them the Burrows-Wheeler Transform (BWT), produces a permutation of the letters in the input string such that:
- the transformed string is easier to compress than the original one.
- the original string can be recovered;
- The use of this preprocessing allowed to define a class of lossless data compression algorithms that:
- achieve speed comparable to the algorithms based on the techniques by Lempel and Ziv;
- obtains a compression ratio close to the best statistical modelling techniques.

FL

0 a a b r a c

1 a b r a c a

2 a c a a b r

3 b r a c a a

4 c a a b r a

5 r a c a a b

I

- OUTPUT: BWT(w)=L=caraab and the index I=1, denoting the position of the original word w after the lexicographic ordering.

- INPUT:w = abraca

- Lexicographically sort the cyclic rotations of w

- The following properties hold:
- the character L[i] is followed in w by F[i];
- for each character ch, the i-th occurrence of ch in F corresponds to the i-th occurrence of ch in L.

0 a

1 a

2 a

3 b

4 c

5 r

L

c 0

a 1

r 2

a 3

a 4

b 5

I

:

0 1 2 3 4 5

1 3 4 5 0 2

=

w=

a

b

r

a

c

a

ReversibilityThe Burrows-Wheeler transform is reversible, in the sense that given BWT(w) and an index I, it is possible to recover w.

- Given L=BWT(w)=caraab and I=1:
- Construct F by alphabetically sorting the letters in L

- Define a permutation on {0,1,…,n-1}, establishing a correspondence between the positions of the same letters in F and in L;

- Starting from position I, we can recover w=w0 … wn as follows:
- wi =F[i(I)], where 0(x)=x, i+1(x)= (i(x))

- REMARK: Two words x and y are conjugate BWT(x)=BWT(y)
- PROPOSITION:
- If and BWT(v)=a0a1…an-1then BWT(u)= ;
- If BWT(v)=a0a1…an-1and BWT(u)= then there exists a
- conjugate u’ of u such that u’=vd.

We can deduce that:

Therefore we can study combinatorial properties of the BWT by studying the conjugacy classes of primitive words.

Standard Words

d1, d2,…,dn,… a sequence of natural numbers

d10, >0 i =2,…,n

Consider the sequence {sn}n0 defined as:

- s is a characteristic Sturmian word
- {sn}0 is called approximating sequence of s
- (d1, d2,…,dn,… )is the directive sequence of s
- Each finite word snis a standard word

Characterization of standard words

- A word w is standard if and only if it is a letter or w=vab (or equivalently w=vba) and v has periods p,q such that gcd(p,q)=1 and |v|=p+q-2.(extremal case of Fine and Wilf theorem)
- A word w is standard if and only if it is a letter or there exist palindrome words P,Q,R, such that w = QR= Pxy where {x,y}={a,b}.
- Standard words correspond to an extremal case of Knuth-Morris-Pratt algorithm.

Ia={0,1,…,q-1} Ib={q,q+1,…,n-1}

: {0,1,…n-1} {a,b} defined as:

(x )=a if x Ia, b otherwise.

a

a

b

0

1

7

a

2

b

6

3

5

a

4

b

a

- THEOREM: Let w=x0x1…xn-1in {a,b}* , |w|a=q and |w|b=p.
- w is a standard word with suffix ba xi=
- w is a standard word with suffix ab xi=

REMARK: Let u=u0u1…un-1, v=v0v1…vn-1

If ui= and vi= then u and v are conjugate.

RotationsStandard words can also be generated by rotations.

Let p,q2 such that gcd(p,q)=1 and n=p+q.

p:{0,1,…,n-1}{0,1,…,n-1} defined as p(z)=z+p (mod n)

If n=8, p=3, q=5,…

w=abaababa

THEOREM:Let u be a word over the alphabet {a,b}.

BWT(u)=bpaq with gcd(p,q)=1 if and only if u is a conjugate of a standard word.

In particular, in order to reconstruct u from BWT(u) and the index I:

if I=p then u is a standard word with suffix ba

if I=p-1 then u is a standard word with suffix ab

COROLLARY: BWT(u) =bkah with gcd(k,h)=d if and only if u=vd where v is a conjugate of a standard word.

A new characterization of standard wordsF

0 a

1 a

2 a

3 a

4 a

5 b

6 b

7 b

L

b 0

b 1

b 2

a 3

a 4

a 5

a 6

a 7

Idea of the proof:The permutation giving the correspondence between the positions of characters in F and L is (z)=z+p(mod n).

Starting, for example, from the position I=p we can recover the word u, ui=F(i(p)).

Further Research

- Study extremal case of the BWT for k-letters alphabets with k>2.
- For instance for k=3, characterize the words w such that BWT(w) belongs to c*a*b* or b*c*a*.
- This property does work neither with 3-Standard words nor with balanced words.

- Does a relation between the complexity function of a word w and the structure of BWT(w) exist?

- Given a language L, one can define BWT(L)={BWT(w) | w in L}. One can ask whether BWT preserves some properties of a language L, such as belonging to a certain family of languages in the Chomsky Hierarchy.
- We found negative results

L1=(ab)*, BWT(L1)={bnan | n≥0} a context free language

L2=(abc)*, BWT(L2)={cnanbn | n≥0} a context sensitive language

Consider for instance the words generated by finite iterations of the Thue-Morse morphism m(a)=ab m(b)=ba.

Denote by vRthe reversal word of v and by v the word obtained by interchanging a with b and vice-versa.

Then:

BWT(mn(a))=vvR

Where

v=b2n-2a2n-3b2n-4...b20a if n is even

v=b2n-2a2n-3b2n-4...a20b if n is odd

Further Research

- Is it possible to characterize interesting families of words in terms of their BWT?

Download Presentation

Connecting to Server..