1 / 44

Compressed indices for text based on Ziv-Lempel compression

Compressed indices for text based on Ziv-Lempel compression. LZ-Index Variations. Introduction. Methods of search Method of the sequential. Method of the search in indexed text. When the text is very large. When the text changes very little.

naida
Download Presentation

Compressed indices for text based on Ziv-Lempel compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compressed indices for text based on Ziv-Lempel compression LZ-Index Variations

  2. Introduction Methods of search • Method of the sequential. • Method of the search in indexed text. • When the text is very large. • When the text changes very little. • When there is space enough to store index and keep the efficiency.

  3. Introduction Indices • Classics • Suffix trees • Arrays • Succinct • Ziv-Lempel (n log n) • Self-indexes (n log σ)

  4. Introduction Problem • text a sequence of characters • Σ - alphabet with size σ • search match, a sequence of characters of the alphabet Σ . • Find all the occurrences R (or occ) from P on T. • Locate the occurrences.

  5. Ziv-Lempel compression • Original • It broke the text in phrases and when it appeared a repeated phrase would be substituted by a pointer. • LZ78 • Initiates the partitioning with an empty block. • When it arrives to , then it will search the biggest prefix from and adds a character in the end forming another block with a phrase.

  6. Ziv-Lempel compression

  7. Ziv-Lempel compression • Rank • Order of the phrases by lexicographical order.

  8. Ziv-Lempel compression Data Structures • LZTrie – Formed by all the blocks of phrases • Supports the operations: • idt(x) • leftrankt(x)e rightrankt(x) • parentt(x) • childt(x, c) • rtht(rank)

  9. Ziv-Lempel compression

  10. Ziv-Lempel compression Data Structures • RevTrie – Formed by all the blocks of phrases, but inverted. • Supports the same operations then LZTrie: • idr(x) • leftrankr(x)e rightrankr(x) • parentr(x) • childr(x, c) • rthr(rank)

  11. Ziv-Lempel compression

  12. Ziv-Lempel compression Data Structures • Node – It is a mapping of identifiers of blocks from the LZTrie. • Range – it is a bidimensional search data structure.

  13. Ziv-Lempel compression

  14. Ziv-Lempel compression Search Algorithm • 3 types of occurrence exist: • The occurrence happens all inside of a block. • The occurrence happens inside two blocks. • The occurrence happens inside 3 our more blocks. P spreads by 2 blocks P inside 1 block P spreads by 4 blocks

  15. Ziv-Lempel compression Occurrences inside 1 block • Search for P reverse at RevTrie. • Evaluate the leftrankr(x) and rightrankr(x) getting the lexicographical interval of the blocks that finish with P. • For all rankleftrankr(x)… rightrankr(x), get the correspondent knot at LZTrie, y=Node(rthr(r)).

  16. Ziv-Lempel compression

  17. Ziv-Lempel compression

  18. Ziv-Lempel compression

  19. Ziv-Lempel compression Occurrences inside of 2 blocks • For each , divide P at pref=P1…i and suff=Pi+1…m and do the next steps. • Search for preffr at RevTrie, obtaining the x. Search for suff at LZTrie, obtaining the y. • Search for range [leftrankr(x)… rightrankr(x)] X [leftrankr(y)… rightrankr(y)] using the data structures of the Range. • For each pair (k,k+1) found, report k. Knowing that Pi is aligned with the final of Bk.

  20. Ziv-Lempel compression

  21. Ziv-Lempel compression

  22. Ziv-Lempel compression

  23. Ziv-Lempel compression Occurrences inside of 3 blocks • Identify all the blocks that contain all the phrases Pi…jkeeping the numbers of the blocks in marrays Ai. • To try to find concatenations of successive blocks Bk, Bk+1,… • To each maximum concatenation of blocks Pi…j=Zk… Zℓ from the mach phrase it is determined if Zk+1 ends with P1…i-1 and Zℓ+1 starts with Pj+1…m. If this will be the case, then an occurrence is reported.

  24. Ziv-Lempel compression

  25. Ziv-Lempel compression

  26. Ziv-Lempel compression

  27. Ziv-Lempel compression

  28. Indices based on Ziv-Lempel • The LZ-Index of Karkkainen and Ukkonen (KU-LZI) • Uses a index tree that indexes only the beginnings of the phrases with a partition of T of the Ziv-Lempel type. • Uses a array S[p] of numbers of phrases ordered for its positional origin in the text T. • An array of bitsB[1,n] that indicates in hitch positions of the text begin the phrases of origin.

  29. Indices based on Ziv-Lempel • The LZ-Index of Karkkainen and Ukkonen (KU-LZI) • Primary occurrences - it crosses 2 or more blocks • Secondary occurrences - it is contained in an only block. • The secondary occurrences are repetitions of other primary or secondary occurrences.

  30. Indices based on Ziv-Lempel • The LZ-Index of Karkkainen and Ukkonen (KU-LZI) • Primary occurrences • P1,iis a suffix from a phrase • Pi+1, mstarts in the next phrase. • The occurrences from the partition Pi+1, m are looked for in a suffix tree that indexes the beginnings of the phrases. • The occurrences P1,i are searched in a RevTrie where it starts for the last letter Pi, Pi-1…P1.

  31. Indices based on Ziv-Lempel • The LZ-Index of Karkkainen and Ukkonen (KU-LZI) • Secondary occurrences • From each one of the phrases B(p) • Or either, given a primary occurrence Tj,j+m-1it will find all phrases pwhose origin contains [j,j+m-1].

  32. Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • It is based on the partitioning that is used at LZ78. • Consider the definition of T# as being a text T where it was inserted the special characters “#” after each phrase. • Then |T#|=n+n’. • Example: • T=“alabar_a_la_alabarda$”, would be • T#=“a#l#ab#ar#_#a_#la#_a#lab#ard#a$#”.

  33. Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • To the reverse it would be Tr as T#, but written in the reverse. • Tr=“#$a#dra#bal#a_#al#_a#_#ra#ba#l#a”. • The position t at T belonging to the phrase number p corresponds to the position in TR.

  34. Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • The FM-LZI has 4 components: • The FMI from the text T(self-index based on a relationship between the Burrows-Wheeler compression algorithm and the suffix array data structure); • The FMI from text TR; • The LZTrie from Ziv-Lempel of T; • Range, witch is similar to the KU-LZI one.

  35. Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • Occurrences of P are divided in primary and secundary as in the KU-LZI. • Secundary ocorrencies • A knot from triep is a pioneer to P, if P is a suffix of Z[p]. The pioneer knots for P=“a” are the 1,7 and 8. • All secondary occurrences corresponds to knots from LZTrie that descend from pioneer knots. • Get pioneer knots and to cover all sub trees of these knots reporting all the found positions of text.

  36. Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI)

  37. Indices based on Ziv-Lempel • The LZ-Index of Ferragina and Manzini (FM-LZI) • Primary occurrences • The same idea from the KU-LZI is used to find P1, i in the end of a phrase and Pi+1, nfrom the next phrase. • The search is made of a different way and the FMI is used here of a much more advantageous way for this problem. • First find P using the FMI from T witch gives all ranges in A(from FMI) corresponding to the occurrences from Pi+1, m. • PR at FMI in TR. After having the ranges.

  38. Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Uses essentially the LZTrie and the RevTrie according to the partition LZ78. • It is the only self‑index that doesn’t use the suffix array concept. • O NAV-LZI uses 4 main structures: • The LZTrie from Ziv-Lempel to represent T. • The RevTrie to keep the inverted stringsZ(p). • The bidimensional data structure Range as seen before. • And Node that keeps the numbers of the LZTrie knots.

  39. Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Use parentheses in the LZTrie and RevTrie instead of pointers. • Exemple: “( ( ( ) ( ) ( ( ) ) ( ) ) ( ( ( ) ) ) ( ( ) ) )”, or else in bits. • Advantages: • Less space • Constant time operations

  40. Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Type 1 occurrences • Use the same line of ideas that are used to the secondary occurrences in FM-LZI. • The search is made by P inverted in RevTrie. Arriving to any v’ knot. • Any knot v’ witch is descendent from v, including itself v’=v, corresponds to a phrase ended in P.

  41. Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Occurrences of type 2 • They will be found through the data structure Range such as in the FM-LZI with the difference that this time the search is made to P1,i , inverted i in RevTrie, to Pi+1, m goes to search in the LZTrie, getting two knots: vrev and vlz, respectively. • They interest phrases p descending from vrev and the p+1descendents from vlz.

  42. Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Occurrences of type 2 • They go to be kept in Range all the points (pos(v’), pos(v)) to each phrase, such that v’ (that belongs to RevTrie) have the phrase number p andv (witch belongs to LZTrie) have the phrase number p+1. Therefore, a ranges search to areas of the array that descend from vrev and vlz returns all the occurrences that we want.

  43. Indices based on Ziv-Lempel • The LZ-Index of Navarro (Nav-LZI) • Occurrences of type 3 • The way of show a sub string of text, is • First locate his phrase p. • Then run up the LZTrie from the Node(p) to get all his characters by inverse order. • If it needed more phrases, it may repeat the process with p–1 or with p+1and so on.

  44. Indices based on Ziv-Lempel Conclusion

More Related