Compressing and Indexing Strings and (labeled) Trees

Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa

book chapter chapter section section section section Two types of data • String = raw sequence of symbols from an alphabet  • Texts • DNA sequences • Executables • Audio files • ... • Labeled tree = tree of arbitrary shape and depth whose nodes are labeled with strings drawn from an alphabet  • XML files • Parse trees • Tries and Suffix Trees • Compiler intermediate representations • Execution traces • ... Paolo Ferragina, Università di Pisa

Substring searches String statistics Motif extraction i-th child with some label constraint Parent, or ancestor Labeled path anchored anywhere What do we mean by “Indexing” ? • Word-based indexes, here a notion of “word” must be devised ! • Inverted files, Signature files, Bitmaps. • Full-text indexes, no constraint on text and queries ! • Suffix Array, Suffix tree, ... • Path indexes that also support navigational operations ! • see next... Subset of XPath [W3C] Paolo Ferragina, Università di Pisa

Folk tale: More economical to store data in compressed form than uncompressed What do we mean by “Compression” ? • Data compression has two positive effects: • Space saving (or, enlarge memory at the same cost) • Performance improvement • Better use of memory levels closer to CPU • Increased network, disk and memory bandwidth • Reduced (mechanical) seek time Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

In terms of space occupancy Also in terms of compression ratio Study the interplay of Compression and Indexing • Do we witness a paradoxical situation ? • An index injects redundant data, in order to speed up the pattern searches • Compression removes redundancy, in order to squeeze the space occupancy • NO, new results proved a mutual reinforcement behaviour ! • Better indexes can be designed by exploiting compression techniques • Better compressors can be designed by exploiting indexing techniques • More surprisingly, strings and labeled trees are closer than expected ! • Labeled-tree compression can be reduced to string compression • Labeled-tree indexing can be reduced to “special” string indexing problems Paolo Ferragina, Università di Pisa

Compression Booster A combinatorial tool to transform poor compressors into better compressors [Ferragina, Giancarlo,Manzini,Sciortino, JACM ‘05] • Compressed Index • Space close to gzip, bzip • Query time close to P’s length • [Ferragina-Manzini, Focs ’00 + JACM ‘05] Our journey over “string data” Index design (Weiner ’73) Compressor design (Shannon ’48) Burrows-Wheeler Transform (1994) Suffix Array ‘87 and ‘90 Wavelet Tree [Grossi-Gupta-Vitter, Soda ’03] Improved indexes and compressors for strings [Ferragina-Manzini-Makinen-Navarro, ‘04] And many other papers of many other authors... Paolo Ferragina, Università di Pisa

5 T = mississippi# O(|P| + log2 N) time [Manber-Myers, 90] SA SUF(T) O(|P|/B + logB N) I/Os[Ferragina-Grossi, JACM 99] 12 11 8 5 2 1 10 9 7 4 6 3 # i# ippi# issippi# ississippi# mississippi# pi# ppi# sippi# sissippi# ssippi# ssissippi# suffix pointer Non-uniform access costs [Navarro et al, Algorithmica 00] Self-adjusting SA on disk [Ferragina et al, FOCS 02] SA + T occupy (N log2 N + N log2 |S|) bits The Suffix Array [BaezaYates-Gonnet, 87 and Manber-Myers, 90] T = mississippi# P=si • Suffix permutation cannot be any of {1,...,N} • # binary texts = 2N« N! = # permutations on {1, 2, ..., N} • (N) bits is the worst-case lower bound  • (N H(T)) bits for compressible texts  Several papers on characterizing the SA’s permutation [Duval et al, 02; Bannai et al, 03; Munro et al, 05; Stoye et al, 05] Paolo Ferragina, Università di Pisa

Can we compress the Suffix Array ? [Ferragina-Manzini, Focs ’00] [Ferragina-Manzini, JACM ‘05] The FM-index is a data structure that mixes the best of: • Suffix array data structure • Burrows-Wheeler Transform The theoretical result: • Query complexity: O(p + occ logeN) time • Space occupancy: O( N Hk(T)) + o(N) bits  o(N) if T compressible The corollary is that: • The Suffix Array is compressible • It is a self-index Index does not depend on k Bound holds for all k, simultaneously New concept:The FM-index is an opportunistic data structure that takes advantage of repetitiveness in the input data to achieve compressed space occupancy, and still efficient query performance. Paolo Ferragina, Università di Pisa

# mississipp i i #mississipp i ppi#mississ i ssippi#miss i ssissippi# m Sort the rows m ississippi# T p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i The Burrows-Wheeler Transform (1994) Let us given a text T = mississippi# F L mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Paolo Ferragina, Università di Pisa

L is highly compressible i ssippi#miss i ssissippi# m m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i Algorithm Bzip : • Move-to-Front coding of L • Run-Length coding • Statistical coder: Arithmetic, Huffman Why L is so interesting for compression ? F L unknown # mississipp i A key observation: • L is locally homogeneous i #mississipp i ppi#mississ • Bzip vs. Gzip: 20% vs. 33% compression ratio ![Some theory behind: Manzini, JACM ’01] Building the BWT  SA construction Inverting the BWT  array visit ...overall (N) time, but slower than gzip... Paolo Ferragina, Università di Pisa

SA L Rotated text L includes SA and T. Can we search within L ? 12 11 8 5 2 1 10 9 7 4 6 3 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i L is helpful for full-text searching ? #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m mississippi Paolo Ferragina, Università di Pisa

i ssippi#miss How do we map L’s onto F’s chars ? i ssissippi# m ... Need to distinguish equal chars... m ississippi# p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i occ(“i”,11) = 3 A useful tool: L  F mapping F L unknown # mississipp i i #mississipp i ppi#mississ To implement the LF-mapping we need an oracle occ( ‘c’ , j )= Rank of char c in L[1,j] Paolo Ferragina, Università di Pisa

C P[ j ] # 1 i 2 m 7 p 8 s 10 P = si L Available info First step #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i fr rows prefixed by char “i” lr mississippi Inductive step: Given fr,lr for P[j+1,p] • Take c=P[j] fr occ=2 [lr-fr+1] lr Substring search in T (Count the pattern occurrences) unknown s s • Find the first c in L[fr, lr] • Find the last c in L[fr, lr] • L-to-F mapping of these chars Occ() oracle is enough (ie. Rank/Select primitives over L) Paolo Ferragina, Università di Pisa

 The column L is actually kept compressed:  FM-index + LZ78 parsing [see also LZ-index by Navarro] • Still guarantee O(p) time to count the P’s occurrences • Achieves logarithmic-time to locate each pattern occurrence • Achieves O(p+occ) time • …but it looses a sub-logarithmic factor in the front of Hk Many details are missing... • What about a large  • Wavelet Tree and variations[Grossi et al, Soda 03; F.M.-Makinen-Navarro, Spire 04] • New approaches to Rank/Select primitives[Munro et al. Soda ’06] • Efficient and succinct index construction [Hon et al., Focs 03] • In practice, Lightweight Algorithms: (5+)N bytes of space • [see Manzini-Ferragina, Algorithmica 04] Paolo Ferragina, Università di Pisa

JACM FM-index + LZ78-algorithm Space: O(N Hk(T) loge N ) bits Search: O( p + occ ), for any p and occ Self-indexed CSA(Sadakane, Isaac 00 + Soda 02) Space: O(N H0(T)) bits Search: O(p log N + occ logeN ) Alphabet-friendly FM-index (FM and Makinen-Navarro, Spire 04) High-order entropy CSA (GV and Gupta, Soda 03) Space: N Hk(T) + o(N) bits, for|S|=polylog(N) andany k Search: O( p + occ log(N)/loglog(N) ) Space: N Hk(T) + o(N) bits, for |S|=polylog(N) Search: O( log |S| (p + polylog(N) + occ log(N)/loglog(N)) ) [o(p) time with Patricia Tree] Five years of history... FM-index(Ferragina-Manzini, Focs 00) Compact Suffix Array(Grossi-Vitter, Stoc 00) Space: 5 N Hk(T) + o(N) bits, for any k Search: O( p + occ logeN ) Space: (N) bits [+ text] Search: O(p + polylog(N) + occ logeN ) [o(p) time with Patricia Tree, O(occ) for short P ] Look at the survey by Gonzalo Navarro and Veli Makinen Wavelet Tree WT variant q-gram index [Kärkkäinen-Ukkonen, 96] Succinct Suffix Tree: N log N +(N) bits[Munro et al., 97ss] LZ-index: (N) bits and fast occ retrieval[Navarro, 03] Variations over CSA and FM-index[Navarro, Makinen] Paolo Ferragina, Università di Pisa

Interesting issues: • What about large S: fast Rank/Select in entropy-space bounds ? [Sadakane et al., Soda 06; Munro et al. Soda 06] • What about disk-aware or cache-oblivious versions ? [Brodal et al., Soda 06] • Applications to show that this is a technological breakthrought... What’s next ? Paolo Ferragina, Università di Pisa

What about their practicality ? [December 2003] [January 2005] Paolo Ferragina, Università di Pisa

Paolo Ferragina, Università di Pisa

How to turn these challenging and mature theoretical achievements into a technological breakthrought ? • Engineered implementations • Flexible API to allow reuse and development • Framework for extensive testing Is this a technological breakthrough ? Paolo Ferragina, Università di Pisa

Joint effort of Navarro’s group and mine, hence two mirros The best known indexes have been implemented with care!! All implemented indexes will follow a carefully designed API which offers: build, count, locate, extract,... A group of variagate texts is available, their sizes range from 50Mb to 2Gb Some tools have been designed to automatically plan, execute and check the index performance over the text collections Paolo Ferragina, Università di Pisa

Where we are... Labeled Trees ? Data type Indexing Compressed Indexing Paolo Ferragina, Università di Pisa

Why we care about labeled trees ? Paolo Ferragina, Università di Pisa

An XML excerpt <dblp> <book> <author> Donald E. Knuth</author> <title> The TeXbook </title> <publisher> Addison-Wesley </publisher> <year> 1986 </year> </book> <article> <author> Donald E. Knuth </author> <author> Ronald W. Moore </author> <title> An Analysis of Alpha-Beta Pruning </title> <pages> 293-326 </pages> <year> 1975 </year> <volume> 6 </volume> <journal> Artificial Intelligence </journal> </article> ... </dblp> Paolo Ferragina, Università di Pisa

A tree interpretation... • XML document exploration  Tree navigation • XML document search  Labeled subpath searches Subset of XPath [W3C] Paolo Ferragina, Università di Pisa

Our problem Consider a rooted, ordered, static tree T of arbitrary shape, whose t nodes are labeled with symbols from an alphabet S. We wish to devise asuccinct representationfor T that efficiently supports some operations over T’s structure: • Navigational operations: parent(u), child(u, i), child(u, i, c) • Subpath searchesover a sequence of k labels • Seminal work by Jacobson [Focs ’90] dealt with binaryunlabeled trees, achieving O(1) time per navigational operation and 2t + o(t) bits. • Munro-Raman [Focs ’97], then many others, extended to unlabeled trees of arbitrary degree and a richer set of navigational ops: subtree size, ancestor,... • Geary et al [Soda ’04] were the first to deal with labeled trees and navigational operations, but the space is Q(t |S|) bits. Yet, subpath searches are unexplored Paolo Ferragina, Università di Pisa

Our journey over “labeled trees” [Ferragina et al, Focs ’05] • We propose the XBW-transform that mimics on trees the nice structural properties of the BW-trasform on strings. • The XBW-transformlinearizes the tree T in such a way that: • the indexing of T reduces to implement simple rank/select operations over a string of symbols from S. • the compression of T reduces to use any k-th order entropy compressor (gzip, bzip,...) over a string of symbols from S. Paolo Ferragina, Università di Pisa

a c C B B D b A a D c D a b c Permutation of tree nodes upward labeled paths The XBW-Transform Sa Sp C B D c a c A b a D c B D b a e C B C D B C D B C B C C A C A C A C D A C C B C D B C B C Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path Paolo Ferragina, Università di Pisa

a c C B B D b A a D c D a b c upward labeled paths The XBW-Transform Sa Sp C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C Step 2. Stably sort according to Sp Paolo Ferragina, Università di Pisa

a D B A B a c C b D c D a c b XBW The XBW-Transform Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C XBW can be built and inverted in optimal O(t) time Key facts Nodes correspond to items in <Slast,Sa> Node numbering has useful properties for compression and indexing Step 3. Add a binary array Slast marking the rows corresponding to last children XBW takes optimal t log |S| + 2t bits Paolo Ferragina, Università di Pisa

a D a c c C B A B a b D c D c b a D a D C B A b c D c a b B XBW The XBW-Transform is highly compressible Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 C b a D D c D a B A B c c a b e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C • XBW is highly compressible: • Sa is locally homogeneous (like BWT for strings) • Slast has some structure (because of T’s structure) Paolo Ferragina, Università di Pisa

XMLPPM XBW XBW lzcs scmppm XBW XMLPPM XML Compression: XBW + PPMdi ! Paolo Ferragina, Università di Pisa String compressors are not so bad !?!

a c C A B D b B a D c D c a b XBW Structural properties of XBW Sp Slast Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C C b a D D c D a B A B c c a b • Properties: • Relative order among nodes having same leading • path reflects the pre-order visit of T • Children are contiguous in XBW (delimited by 1s) • Children reflect the order of their parents Paolo Ferragina, Università di Pisa

D b a D a b a c c D B A B C c XBW-index The XBW is searchable Sp Slast SS Sa 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 C b a D D c D a B A B c c a b A B C D • XBW indexing [reduction to string indexing]: • Store succinct and efficient Rank and Select • data structures over these three arrays Paolo Ferragina, Università di Pisa

P[i+1] C B A B D D c c b a D c a b a B lr fr Rows whose Sp starts with ‘B’ C XBW-index Subpath search in XBW Sp Slast SS Sa e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 C b a D D c D a B A B c c a b P = B D Their children have upward path = ‘D B’ • Inductive step: • Pick the next char in P[i+1], i.e. ‘D’ • Search for the first and last ‘D’ in Sa[fr,lr]  Jump to their children Paolo Ferragina, Università di Pisa

P[i+1] c C B A B D c c b a D D a b a 2° D 3° D lr fr lr fr Rows whose Sp starts with ‘D B’ D XBW-index Subpath search in XBW Sp Slast SS Sa e A C A C A C B C B C B C B C C C C D A C D B C D B C D B C 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 C b a D D c D a B A B c c a b P = B D Look at Slast to find the 2° and 3° group of children Their children have upward path = ‘D B’ • Inductive step: • Pick the next char in P[i+1], i.e. ‘D’ • Search for the first and last ‘D’ in Sa[fr,lr]  Jump to their children Two occurrences because of two 1s Paolo Ferragina, Università di Pisa

XML Compressed Indexing What about XPress and XGrind ? XPress  30% (dblp 50%), XGrind  50% no software running Paolo Ferragina, Università di Pisa

In summary [Ferragina et al, Focs ’05] • The XBW-transform takes optimal space: 2t + t log |S|, and can be computed in optimal linear time. • We can compress and index the XBW-transform so that: • its space occupancy is the optimal t H0(T) + 2t + o(t) bits • navigational operations take O(log |S|) time • subpath searches take O(p log |S|) time If |S|=polylog(t), no log|S|-factor (loglog |S| for general S [Munro et al, Soda 06]) New bread for Rank/Select people !! • It is possible to extend these ideas to other XPath queries, like: • //path[text()=“substring”] • //path1//path2 • ... Paolo Ferragina, Università di Pisa

This is a powerful paradigm to design compressed indexes for both strings and labeled trees based on first transforming the input, and then using simple rank/select primitivesover compressed strings The overall picture on Compressed Indexing... Data type Indexing [Kosaraju, Focs ‘89] Strong connection Compressed Indexing Paolo Ferragina, Università di Pisa

Mutual reinforcement relationship... We investigated the reinforcement relation: Compression ideasIndex design Let’s now turn to the other direction Indexing ideasCompressor design Booster Paolo Ferragina, Università di Pisa

Technically, we go from the bound |c| ≤ λ |s| H0(s) + μ ...to the new performance bound |c’| ≤ λ |s| Hk(s)+log2 |s| +μ’k A s c Booster Researchers may now concentrate on the “apparently” simpler task of designing 0-th order compressors [see e.g. Landau-Verbin, 05] c’ Compression Boosting for strings [Ferragina et al., J.ACM 2005] • Qualitatively, the booster offers various properties: • The more compressible is s, the shorter is c’ wrt c • It deploys compressor A as a black-box, hence no change to A’s structure is needed • No loss in time efficiency, actually it is optimal • Its performance holds for any string s, it results better than Gzip and Bzip • It is fully combinatorial, hence it does not require any parameter estimations Paolo Ferragina, Università di Pisa

Given a string S, compute a permutationP(S) PartitionP(S) into substrings Compress each substring, and concatenate the results An interesting compression paradigm… PPC paradigm (Permutation, Partition, Compression) • Problem 1. Fix a permutation P. Find a partitioning strategy and a compressor that minimize the number of compressed bits. • If P=Id, this is classic data compression ! • Problem 2. Fix a compressor C. Find a permutation P and partitioning strategy that minimize the number of compressed bits. • Taking P=Id, PPC cannot be worse than compressor C alone. • Our booster showed that a “good” P can make PPC far better. • Other contexts: Tables [AT&T people], Graphs [Bondi-Vigna, WWW 04] Paolo Ferragina, Università di Pisa Theory is missing, here!

c C B D c a 2-context Booster any string compressor + + The compression performance with Arithmetic is: t Hk(T) + 2.01 t + o(t) bits This is a powerful paradigm for compressing both strings and labeled trees based on first transforming the input, then using the Booster over any known stringcompressor Compression of labeled trees [Ferragina et al., Focs ‘05] Extend the definition of Hk to labeled trees by taking as k-context of a node its leading path of k-length (related to Markov random fields over trees) A new paradigm for compressing the tree T XBW(T) Paolo Ferragina, Università di Pisa

Thanks !! Paolo Ferragina, Università di Pisa

Where we are ... We investigated the reinforcement relation: Compression ideasIndex design Let’s now turn to the other direction Indexing ideasCompressor design Booster Paolo Ferragina, Università di Pisa

It is a technique that takes a poor compressor A and turns it into a compressor with better performance guarantee What do we mean by “boosting” ? A memoryless compressor is poor in that it assigns codewords to symbols according only to their frequencies (e.g. Huffman) It incurs in some obvious limitations: T = anbn (highly compressible) T’= random string of n ‘a’s and n ‘b’s (uncompressible) Paolo Ferragina, Università di Pisa

For any k-long context • CompressTup to Hk(T)  compress all T[w]up to their H0 Use Huffman or Arithmetic T[i]= mssp, T[is] = ms BWT Suffix Tree The empirical entropy Hk = (1/|T|) ∑|w|=k| T[w] | H0(T[w]) Hk(T) • T[w]= string of symbols that precedewinT Example: Given T = “mississippi”, we have • Problems with this approach: • How to go from all T[w]back to the string T ? • How do we choose efficiently the best k ? Paolo Ferragina, Università di Pisa

w = (1/|T|) ∑|w|=k|T[w]| H0(T[w]) T[w]’spermutation Hk(T) #mississipp i i#mississipp ippi#mississ • CompressTup to Hk(T)  compress all T[w]up to their H0 H1(T) issippi#miss H2(T) ississippi# m mississippi# pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i We have a workable way to approximate Hk Use BWT to approximate Hk Bwt(T) unknown  compress pieces of bwt(T) up to H0 Remember that... Paolo Ferragina, Università di Pisa

Compressing and Indexing Strings and (labeled) Trees

Compressing and Indexing Strings and (labeled) Trees

Presentation Transcript

Arrays and Strings

Particles and Strings

Locating and sequencing genes

Strings and Streams

Strings and IO

Characters and Strings

Strings and Arrays

Normal ICAT Samples labeled (2 hrs) and trypsin-digested (overnight) at 37°C

Pointers and Strings

Characters and Strings

Arrays and Strings

Pointers and Strings

Compressing Relations And Indexes

Tracing and Strings

Arrays and strings

Arrays and Strings

Arrays and Strings

Languages and Strings

Characters and Strings

Compressing Connectivity

Strings and Arrays

Tracing and Strings