Compressed indices for text based on Ziv-Lempel compression

Compressed indices for text based on Ziv-Lempel compression LZ-Index Variations

Introduction Methods of search • Method of the sequential. • Method of the search in indexed text. • When the text is very large. • When the text changes very little. • When there is space enough to store index and keep the efficiency.

Introduction Indices • Classics • Suffix trees • Arrays • Succinct • Ziv-Lempel (n log n) • Self-indexes (n log σ)

Introduction Problem • text a sequence of characters • Σ - alphabet with size σ • search match, a sequence of characters of the alphabet Σ . • Find all the occurrences R (or occ) from P on T. • Locate the occurrences.

Ziv-Lempel compression • Original • It broke the text in phrases and when it appeared a repeated phrase would be substituted by a pointer. • LZ78 • Initiates the partitioning with an empty block. • When it arrives to , then it will search the biggest prefix from and adds a character in the end forming another block with a phrase.

Ziv-Lempel compression

Ziv-Lempel compression • Rank • Order of the phrases by lexicographical order.

Ziv-Lempel compression Data Structures • LZTrie – Formed by all the blocks of phrases • Supports the operations: • idt(x) • leftrankt(x)e rightrankt(x) • parentt(x) • childt(x, c) • rtht(rank)

Ziv-Lempel compression Data Structures • RevTrie – Formed by all the blocks of phrases, but inverted. • Supports the same operations then LZTrie: • idr(x) • leftrankr(x)e rightrankr(x) • parentr(x) • childr(x, c) • rthr(rank)

Ziv-Lempel compression Data Structures • Node – It is a mapping of identifiers of blocks from the LZTrie. • Range – it is a bidimensional search data structure.

Ziv-Lempel compression Search Algorithm • 3 types of occurrence exist: • The occurrence happens all inside of a block. • The occurrence happens inside two blocks. • The occurrence happens inside 3 our more blocks. P spreads by 2 blocks P inside 1 block P spreads by 4 blocks

Ziv-Lempel compression Occurrences inside 1 block • Search for P reverse at RevTrie. • Evaluate the leftrankr(x) and rightrankr(x) getting the lexicographical interval of the blocks that finish with P. • For all rankleftrankr(x)… rightrankr(x), get the correspondent knot at LZTrie, y=Node(rthr(r)).

Ziv-Lempel compression Occurrences inside of 2 blocks • For each , divide P at pref=P1…i and suff=Pi+1…m and do the next steps. • Search for preffr at RevTrie, obtaining the x. Search for suff at LZTrie, obtaining the y. • Search for range [leftrankr(x)… rightrankr(x)] X [leftrankr(y)… rightrankr(y)] using the data structures of the Range. • For each pair (k,k+1) found, report k. Knowing that Pi is aligned with the final of Bk.

Ziv-Lempel compression Occurrences inside of 3 blocks • Identify all the blocks that contain all the phrases Pi…jkeeping the numbers of the blocks in marrays Ai. • To try to find concatenations of successive blocks Bk, Bk+1,… • To each maximum concatenation of blocks Pi…j=Zk… Zℓ from the mach phrase it is determined if Zk+1 ends with P1…i-1 and Zℓ+1 starts with Pj+1…m. If this will be the case, then an occurrence is reported.

Indices based on Ziv-Lempel • The LZ-Index of Karkkainen and Ukkonen (KU-LZI) • Uses a index tree that indexes only the beginnings of the phrases with a partition of T of the Ziv-Lempel type. • Uses a array S[p] of numbers of phrases ordered for its positional origin in the text T. • An array of bitsB[1,n] that indicates in hitch positions of the text begin the phrases of origin.

Indices based on Ziv-Lempel • The LZ-Index of Karkkainen and Ukkonen (KU-LZI) • Primary occurrences - it crosses 2 or more blocks • Secondary occurrences - it is contained in an only block. • The secondary occurrences are repetitions of other primary or secondary occurrences.

Indices based on Ziv-Lempel • The LZ-Index of Karkkainen and Ukkonen (KU-LZI) • Primary occurrences • P1,iis a suffix from a phrase • Pi+1, mstarts in the next phrase. • The occurrences from the partition Pi+1, m are looked for in a suffix tree that indexes the beginnings of the phrases. • The occurrences P1,i are searched in a RevTrie where it starts for the last letter Pi, Pi-1…P1.

Indices based on Ziv-Lempel • The LZ-Index of Karkkainen and Ukkonen (KU-LZI) • Secondary occurrences • From each one of the phrases B(p) • Or either, given a primary occurrence Tj,j+m-1it will find all phrases pwhose origin contains [j,j+m-1].

Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • It is based on the partitioning that is used at LZ78. • Consider the definition of T# as being a text T where it was inserted the special characters “#” after each phrase. • Then |T#|=n+n’. • Example: • T=“alabar_a_la_alabarda$”, would be • T#=“a#l#ab#ar#_#a_#la#_a#lab#ard#a$#”.

Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • To the reverse it would be Tr as T#, but written in the reverse. • Tr=“#$a#dra#bal#a_#al#_a#_#ra#ba#l#a”. • The position t at T belonging to the phrase number p corresponds to the position in TR.

Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • The FM-LZI has 4 components: • The FMI from the text T(self-index based on a relationship between the Burrows-Wheeler compression algorithm and the suffix array data structure); • The FMI from text TR; • The LZTrie from Ziv-Lempel of T; • Range, witch is similar to the KU-LZI one.

Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • Occurrences of P are divided in primary and secundary as in the KU-LZI. • Secundary ocorrencies • A knot from triep is a pioneer to P, if P is a suffix of Z[p]. The pioneer knots for P=“a” are the 1,7 and 8. • All secondary occurrences corresponds to knots from LZTrie that descend from pioneer knots. • Get pioneer knots and to cover all sub trees of these knots reporting all the found positions of text.

Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI)

Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • Primary occurrences • The same idea from the KU-LZI is used to find P1, i in the end of a phrase and Pi+1, nfrom the next phrase. • The search is made of a different way and the FMI is used here of a much more advantageous way for this problem. • First find P using the FMI from T witch gives all ranges in A(from FMI) corresponding to the occurrences from Pi+1, m. • PR at FMI in TR. After having the ranges.

Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Uses essentially the LZTrie and the RevTrie according to the partition LZ78. • It is the only self‑index that doesn’t use the suffix array concept. • O NAV-LZI uses 4 main structures: • The LZTrie from Ziv-Lempel to represent T. • The RevTrie to keep the inverted stringsZ(p). • The bidimensional data structure Range as seen before. • And Node that keeps the numbers of the LZTrie knots.

Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Use parentheses in the LZTrie and RevTrie instead of pointers. • Exemple: “( ( ( ) ( ) ( ( ) ) ( ) ) ( ( ( ) ) ) ( ( ) ) )”, or else in bits. • Advantages: • Less space • Constant time operations

Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Type 1 occurrences • Use the same line of ideas that are used to the secondary occurrences in FM-LZI. • The search is made by P inverted in RevTrie. Arriving to any v’ knot. • Any knot v’ witch is descendent from v, including itself v’=v, corresponds to a phrase ended in P.

Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Occurrences of type 2 • They will be found through the data structure Range such as in the FM-LZI with the difference that this time the search is made to P1,i , inverted i in RevTrie, to Pi+1, m goes to search in the LZTrie, getting two knots: vrev and vlz, respectively. • They interest phrases p descending from vrev and the p+1descendents from vlz.

Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Occurrences of type 2 • They go to be kept in Range all the points (pos(v’), pos(v)) to each phrase, such that v’ (that belongs to RevTrie) have the phrase number p andv (witch belongs to LZTrie) have the phrase number p+1. Therefore, a ranges search to areas of the array that descend from vrev and vlz returns all the occurrences that we want.

Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Occurrences of type 3 • The way of show a sub string of text, is • First locate his phrase p. • Then run up the LZTrie from the Node(p) to get all his characters by inverse order. • If it needed more phrases, it may repeat the process with p–1 or with p+1and so on.

Indices based on Ziv-Lempel Conclusion

Compressed indices for text based on Ziv-Lempel compression

Compressed indices for text based on Ziv-Lempel compression

Presentation Transcript

Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval

Lempel-Ziv Example

Lempel-Ziv-Welch (LZW) Compression Algorithm

Lempel-Ziv Compression Techniques

Lempel-Ziv Encoding

Computing Reversed Lempel-Ziv Factorization Online

On Compression-Based Text Classification

Lempel-Ziv methods

String Matching in Lempel-Ziv Compressed Strings

Entropy coding (Lempel/Ziv)

Lempel-Ziv Compression Techniques

Lempel-Ziv Compression Techniques

Lempel ZIV Compression

Lempel-Ziv-Welch (LZW) Compression Algorithm

Lempel-Ziv methods

Lempel-Ziv Compression Techniques

Lempel-Ziv-Welch (LZW) Compression Algorithm

Language-Model Based Text-Compression

Lempel-Ziv Compression Techniques