Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortinoraffaele@math.unipa.itmari@math.unipa.it University of Palermo ITALY

The Burrows-Wheeler Transform [BW94] • abraca bacraa, 1 I O BWT MTF H/AC Outline of the Talk • BWT Compression Algorithms • A New Class of Algorithms • Combinatorial Dependency [BCCFM00, BFG02] • Lower Bound on Compression Performance • Conjecture by Manzini [M01] • Universal Encoding of Integers [L68, E75]

FL e t h a t h e T h e t h e t h e T h o t h e t h Why BWT is Useful ? INTUITION Let us consider the effect on a single letter in a common word in a block of English text: w = … The…the… The… the…those…the…the…that…the… The characters following th are grouped togetherinside BWT(w). Extensive experimental work confirms this “clustering effect” [BW94, F96]

MoveToFront Coding (MFT) [BeSTaWe86]: • Encodes an instance of a character x by an integer that counts the number of the distinct symbols seen after the latest occurrence of x. • EXAMPLE abaaaabbbbbcccccaaaaa 01100010000200020000 • BWT+MTF =many runs of zeroes good for order 0 encodersRelation between compressibility of files and high percentage of zeroes [F96] Why Useful • “Clustering” of Symbols and MTF

Two Main Research Questions • Is MTF an essential step for the successful use of BWT [F96] ? • Experiments [AM97, BK98, WM01]; • Theory ? • Analysis of the compression performance of BWT-based algorithms. • Experiments (see DCC ) • Information Theory [Ef99, Sa98] • Worst Case Setting • Empirical Entropy of Strings [M01] - No Assumptions

Zero-th Order Empirical Entropy • s is string over the alphabet ={a1, a2, …, ah} • ni number of occurrences of ai in s. Assume that nini+1 • The zero-th order empirical entropy of s: • H0(s)= - • The zero-th order modified empirical entropy [M01]:

k-th Order Empirical Entropy • k the set of all strings of length k over  • kthe set of all strings of length at most k • Fixed an integer k0,for any string y in k, ys is the string consisting of the characters following y in s. • The k-th order empirical entropy of s is • The k-th order modified empirical entropy: where Tk denotes a set of strings in k such that each of them has a unique suffix in Tk and such that among the sets having this property, Tk is that one minimizing the right hand.

Results by Manzini • Let BW0 be a BWT-based algorithm with Arithmetic coder as zero-th order compressor. Then, k0 • Let BW0RL be a BWT-based algorithm using run-length encoding with Arithmetic coder as zero-th order compressor. Then, k0 gk’ 0 such that • where =10-2.

THEOREM (Manzini):Let s be a string. For each k0, there exists an fhk and a partition s’1, s’2, …,s’fof BWT(s) such that An analogous result holds for Hk*(s). REMARK:If thereexistedan ideal compressorA such that, for any partition s1,s2,…,sp of a string s then A(BWT(s))|s|Hk(s). Analogously for Hk*(s). Insights by Manzini

Open Problems by Manzini • Conjectures by Manzini: • No BWT-compression method can get to a bound of the form |s\Hk*(s)+gk for k0 and gk0constant. • The ideal algorithm Adoes not exist. We show that A does not exist, but we can approximate it. So, we prove that both conjectures are true.

Our Contributions • We provide a new class of BWT-based algorithms, based on partition of strings, that do not use MTF as a part of the compression process. • We analyze two of those new methods in the worst case setting. We obtain better theoretic boundsthan Manzini. • Under a natural hypothesis on the inner working of the algorithm no BWT-compressor using that type of algorithm can achieve • |s|Hk*(s) + gk

Optimally Partition the transformed string with respect to a suitable • cost function; # # # # # Algorithms That Use Optimal Partitions of Strings(rather than MTF) • Compute BWT(s); • Compress each piece separately.

Fix a data compressor C that adds a special end-of-string # before compressing the string. • DEFINITION: Two strings x and y are combinatorial dependent with respect the data compressor C if |C(xy#)|<|C(x#)|+|C(y#)|. • OPTIMAL PARTITION IN TERMS OF THE BASE COMPRESSOR C: By Dynamic Programming Combinatorial Dependency • Techniques by Buchsbaum et al. [BCCFM00, BFG02] for Table Compression. • Surprisingly, it specializes to strings

The new class BWTOPT • ASSUMPTIONS:Let C be a data compressor such that: • given an input string x adds a special end-of-string # and compress x# • either # is really appended at the end of the string or the length of x is explicitly stored as a prefix of the compressed string (universal encoding of integer). Given the input string s 1. Compute BWT(s); 2. Optimally partition of BWT(s) using C as the base compressor; 3. Compress separately each pieces of the partition. TIME COMPLEXITY of BWTOPT: It depends critically on that of C and it is (n2). Fortunately, if C has a linear time decompressionalgorithm then BWTOPT also admits a linear time decompressionalgorithm.

Lower Bound • ASSUMPTIONS: • Given a compressor C, we assume that {C(an) |n>0} is a codeword set for the integers. • For technical reasons we also assume that |C(an)| is non-decreasing function of n. The lower bound comes from a theorem in [Levenshtein,1968], which we restate in our notation: THEOREM There exists a countable number of string s such that |C(s)||s|Hk*(s)+(|s|)where (n) is a diverging function of n. COROLLARYNo compression algorithm satisfying previous assumptions can achieve the bound formulated in conjecture by Manzini, i.e.|s\Hk*(s)+gk for k0 and gk 0constant.Such a result holds independently of whether or not BWT is applied as a preprocessing step.

A prefix code compressor HC • # is an end-of-string marker • The base compressor C is a modification of Huffman encoding so that we can encode # basically for free. THEOREMConsider a string s. Let p1, p2, …, ph be the empirical probability distribution of s. Then

PROBLEM Given two positive integers t and w, t<w, and the increasing sequence of integers d1,d2,…,dt in [1,w], find an algorithm to produce a binary encoding of d1,d2,…,dt and w. The solution we propose works well in conjunction with CD where the lengths of the strings we need to compress may even consists of few symbols. THEOREMConsider a string s. Let p1, p2, …, ph be the empirical probability distribution of s. Then A compressor RHC based on Prefix and Run Length Encoding • It combines Huffman encoding with Run length encoding. • It use knowledge about the symbol frequency in a string. For low entropy string it is essential to use RLE. • The RLE scheme we use depends critically on a variable length encoding of a sequence of integers.

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms