efficient lz78 factorization of grammar compressed text n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Efficient LZ78 factorization of grammar compressed text PowerPoint Presentation
Download Presentation
Efficient LZ78 factorization of grammar compressed text

Loading in 2 Seconds...

play fullscreen
1 / 25

Efficient LZ78 factorization of grammar compressed text - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

Efficient LZ78 factorization of grammar compressed text. Hideo Bannai , Shunsuke Inenaga , Masayuki Takeda Kyushu University, Japan. Outline. Background LZ78 Factorization Straight Line Programs (SLP) Algorithms LZ78 factorization using suffix trees SLP to LZ78 Improvements.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient LZ78 factorization of grammar compressed text' - adonica


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
efficient lz78 factorization of grammar compressed text
SPIRE 2012 @ Cartagena, Colombia

Efficient LZ78 factorization of grammar compressed text

Hideo Bannai, ShunsukeInenaga, Masayuki Takeda

Kyushu University, Japan

outline
SPIRE 2012 @ Cartagena, ColombiaOutline
  • Background
  • LZ78 Factorization
  • Straight Line Programs (SLP)
  • Algorithms
    • LZ78 factorization using suffix trees
    • SLP to LZ78
    • Improvements
background
SPIRE 2012 @ Cartagena, ColombiaBackground

Compressed String Processing (CSP)

  • compress string for storage … but …don’t decompress all of itwhen using it!
  • can be faster than processing the uncompressed text,by exploiting regularities identified by compression
    • regard compression as a generic preprocessing!

BIG String

Pattern Matching

Compressed Representation of String

process

directly

Edit Distance

Pattern Mining

etc.

This work: LZ78 factorizationof grammar compressed strings

lz78 factorization ziv lempel 78
SPIRE 2012 @ Cartagena, ColombiaLZ78 Factorization [Ziv&Lempel’78]

The LZ78-factorization of string S is a factorization

S = f1f2 ... fm

where fi is the longest prefix of fi ... fmsuch that

fi= fjc for some 0 ≤ j < i(let f0 =ε)

  • S = a l a b a r a l a l a b a r d a $

(0,a)

(0,l)

(1,b)

(1,r)

(1,l)

(5,a)

(0,b)

(5,d)

  • (1,$)

f1

f2

f3

f4

f5

f6

f7

f8

f9

a

0

l

b

$

b

O(N log σ) time

O(m) space

7

1

2

r

l

3

5

4

9

a

d

LZ78 trie of S

6

8

straight line programs
SPIRE 2012 @ Cartagena, ColombiaStraight Line Programs

Straight Line Program

  • CFG in Chomsky normal form that derives single string.
  • Can efficiently model outputs of many compression algorithms: REPAIR, SEQUITUR, LZ78, etc.

X1 = a

X2 = b

X3 = X1 X2

X4 = X1 X3

X5 = X4 X3

X6 = X4 X5

X7 = X6 X5

SLP, n=7

Derivation tree

X7

X6

X5

X4

X5

X4

X3

X1

X3

X4

X3

X1

X3

X1

X2

X1

X2

X1

X3

X1

X2

X1

X2

X1

X2

a

a

b

a

a

b

a

b

a

a

b

a

b

S

problem slp to lz78
SPIRE 2012 @ Cartagena, ColombiaProblem: SLP to LZ78

X1 = a X5= X4 X3

X2= b X6= X4 X5

X3 = X1 X2 X7= X6 X5

X4 = X1 X3

Input: SLP

Output: LZ78 Factorization (Trie)

b

a

0

a

b

6

1

  • Why “re-compress” a compressed representation?
  • Convert the representation Some CSP algorithms require specific compression
  • Re-compress an SLP modified by ad-hoc edits Dynamic compressed texts
  • Compute Normalized Compression Distance [Li et al. 2004] Clustering & classification w/o decompressionCLZ78 (x), CLZ78 (y), CLZ78(xy) from SLPs of x, y

5

a

2

b

3

4

Computer

Scientist

Make Sleeping Files Walk in their Sleep!

our results
SPIRE 2012 @ Cartagena, ColombiaOur Results

Algorithms to compute LZ78 from SLP

N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ)

  • α ≥ 0 is a quantity that representsthe amount of redundancy in the string that is captured by the SLP
suffix tree lz78
SPIRE 2012 @ Cartagena, ColombiaSuffix Tree & LZ78

The LZ78 trie can be superimposed on the suffix tree

1

2

3

4

5

6

7

8

9

10

11

12

13

a

a

b

a

a

b

a

b

a

a

b

a

b

S

a

b

a

a

b

b

0

0

a

13

b

a

b

b

a

a

4

4

1

1

b

a

12

a

a

b

3

3

a

a

2

2

b

b

b

a

a

11

b

b

a

5

5

b

b

b

a

9

10

a

a

a

8

6

6

a

b

a

b

a

b

a

a

a

a

7

a

a

b

b

a

b

a

b

a

a

b

a

b

b

b

a

a

b

b

b

1

4

2

5

3

6

  • LZ78 trie of S
  • suffix tree of S
lz78 factorization on suffix tree
SPIRE 2012 @ Cartagena, ColombiaLZ78 Factorization on Suffix Tree

1

2

3

4

5

6

7

8

9

10

11

12

13

a

a

b

a

a

b

a

b

a

a

b

a

b

S

i

  • Build LZ78 trie on top of suffix tree ST Nodes corresponding to LZ78 trie are marked

0

a

b

  • Next factor is prefix of S[i:N].Find node in ST corresponding to S[i:N]

5

a

13

1

b

a

4

b

2

a

12

a

  • Find longest prefix of S[i:N] in LZ78 trie O(1) time bydynamicnearest marked ancestor queries [Westbrook, ‘92]

a

b

b

b

b

a

a

11

3

a

b

b

b

a

9

10

a

a

Make new node of LZ78 trie on ST O(1) time by level ancestor query on ST [Berkman & Vishkin ‘94]

a

6

8

a

b

a

b

a

b

a

a

a

a

7

a

a

b

b

a

b

a

b

a

a

b

a

b

  • Compute next position i  i + |fi|

b

b

a

a

b

b

b

  • LZ78 factorization in O(m) time,given suffix tree preprocessed for nma& la queries

1

4

2

5

3

6

our algorithm slp to lz78
SPIRE 2012 @ Cartagena, ColombiaOur algorithm: SLP to LZ78

Key Observation

For any string of length N, the length of any LZ78 factor fi satisfies:

|fi| ≤ cN= (2N+¼)½ – ½ = O(N½)

Main Idea

  • We only need a suffix tree that contains all distinct substrings of S with length at most cN
  •  Build GST from a set of substrings of S that contain all distinct length-cNsubstrings of S
important concept stabbing
SPIRE 2012 @ Cartagena, ColombiaImportant Concept: Stabbing

Xistabsan interval [u:v] of S,when it is the shortest variable that derives the interval(any interval is stabbed by a unique variable)

e.g.:aaba at [9:12] is stabbed by X5

X7

X1 = a

X2 = b

X3 = X1 X2

X4 = X1 X3

X5 = X4 X3

X6 = X4 X5

X7 = X6 X5

X6

X5

X4

X5

X4

X3

X1

X3

X4

X3

X1

X3

X1

X2

X1

X2

X1

X3

X1

X2

X1

X2

X1

X2

a

a

b

a

a

b

a

b

a

a

b

a

b

1

2

3

4

5

6

7

8

9

10

11

12

13

substrings stabbed by x i
SPIRE 2012 @ Cartagena, ColombiaSubstrings stabbed by Xi

All length-qsubstrings stabbed by Xi are contained in a stringti(q) of length at most 2(q – 1)

Xi

Xr(i)

Xl(i)

q

Any length-qsubstring of Sis stabbed by some unique variable Xi,and therefore is a substring of some ti(q)

q

  • ti(q)

q– 1

q– 1

  • { ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n }will containall distinctlength-cN substrings of S
lz78 factorization from slp
SPIRE 2012 @ Cartagena, ColombiaLZ78 Factorization from SLP

Algorithm:

  • Compute { ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n }
  • Build generalized suffix tree (GST)for strings{ ti(cN) : |Xi| ≥ cN, 1 ≤ i≤ n }
  • Run LZ78 Factorization algorithm using GST

O(ncN)

time/space

example
SPIRE 2012 @ Cartagena, ColombiaExample
  • N = 13, cN = 4, n = 7
  • { t5(4), t6(4), t7(4) } = { aabab, aabaab, babaab }

X7

X6

X5

X4

X5

X4

X3

X1

X3

X4

X3

X1

X3

X1

X2

X1

X2

X1

X3

X1

X2

X1

X2

S

X1

X2

1

2

3

4

5

6

7

8

9

10

11

12

13

a

a

b

a

a

b

a

b

a

a

b

a

b

gst lz78 factors
SPIRE 2012 @ Cartagena, ColombiaGST & LZ78 Factors

The LZ78 triesuperimposed on GST of {t5(4), t6(4), t7(4)}

1

2

3

4

5

6

7

8

9

10

11

12

13

a

a

b

a

a

b

a

b

a

a

b

a

b

S

a

b

a

a

b

b

0

0

a

5,11,17

b

a

a

a

1

1

b

b

4

4

b

3

3

4,10,16

a

9,15

a

a

2

2

a

a

b

3

a

a

a a b a b a a b a a b b a b a a b

a

5

5

b

b

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

b

b

b

b

b

b

t5(4) t6(4) t7(4)

6

1

7,13

2

12

8,14

3

6

6

  • LZ78 trie of S
  • GST of {t5(4),t6(4),t7(4)}
lz78 factorization on gst
SPIRE 2012 @ Cartagena, ColombiaLZ78 Factorization on GST

0

a

b

a

5,11,17

1

b

a

b

X7

4,10,16

a

9,15

a

a

X6

X5

b

3

i

a

a

a a b a b a a b a a b b a b a a b

a

b

b

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

b

b

b

b

X4

X5

X4

X3

  • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]

t5(4) t6(4) t7(4)

6

1

7,13

2

12

8,14

3

X1

X3

X4

X3

X1

X3

X1

X2

  • Find longest prefix of S[i:N] in LZ78 trie
  • O(log N) time w/ random accesson SLP [Bille et al. 2011]

cN= 4

X1

X2

X1

X3

X1

X2

X1

X2

  • Make new node for LZ78 trie on ST
  • O(1) time w/ dynamic nmaqueries

S

  • Compute next position i  i + |fi|

X1

X2

  • O(1) time w/ dynamic nmaqueries

1

2

3

4

5

6

7

8

9

10

11

12

13

a

a

b

a

a

b

a

b

a

a

b

a

b

lz78 factorization on gst1
SPIRE 2012 @ Cartagena, ColombiaLZ78 Factorization on GST

0

a

b

a

5,11,17

1

b

a

b

X7

4,10,16

2

a

9,15

a

a

X6

X5

b

3

i

a

a

a a b a b a a b a a b b a b a a b

a

b

b

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

b

b

b

b

X4

X5

X4

X3

  • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]

t5(4) t6(4) t7(4)

6

1

7,13

2

12

8,14

3

X1

X3

X4

X3

X1

X3

X1

X2

  • Find longest prefix of S[i:N] in LZ78 trie
  • O(log N) time w/ random accesson SLP [Bille et al. 2011]

cN= 4

X1

X2

X1

X3

X1

X2

X1

X2

  • Make new node for LZ78 trie on ST
  • O(1) time w/ dynamic nmaqueries

S

  • Compute next position i  i + |fi|

X1

X2

  • O(1) time w/ dynamic nmaqueries

1

2

3

4

5

6

7

8

9

10

11

12

13

a

a

b

a

a

b

a

b

a

a

b

a

b

lz78 factorization on gst2
SPIRE 2012 @ Cartagena, ColombiaLZ78 Factorization on GST

0

a

b

a

5,11,17

1

b

a

b

X7

4,10,16

2

a

9,15

a

a

3

X6

X5

b

3

i

a

a

a a b a b a a b a a b b a b a a b

a

b

b

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

b

b

b

b

X4

X5

X4

X3

  • Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]

t5(4) t6(4) t7(4)

6

1

7,13

2

12

8,14

3

X1

X3

X4

X3

X1

X3

X1

X2

  • Find longest prefix of S[i:N] in LZ78 trie
  • O(log N) time w/ random accesson SLP [Bille et al. 2011]

cN= 4

X1

X2

X1

X3

X1

X2

X1

X2

  • Make new node for LZ78 trie on ST
  • O(1) time w/ dynamic nmaqueries

S

  • Compute next position i  i + |fi|

X1

X2

  • O(1) time w/ dynamic nmaqueries

1

2

3

4

5

6

7

8

9

10

11

12

13

  • LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma& la, and SLP preprocessed for random access queries

a

a

b

a

a

b

a

b

a

a

b

a

b

summary of basic algorithm
SPIRE 2012 @ Cartagena, ColombiaSummary of Basic Algorithm

Extreme Cases:

  • If the string is compressible, n = O(log N), m = O(N½), soO(ncN + m log N) = O(N½ log N) = o(N)
  • If the string is not compressible, n, m= O(N) and O(ncN + m log N) = O(N1.5)

cN= O(N½)

can we do better than just revert to decompress & process?

1 improving nc n term to nl nc n
SPIRE 2012 @ Cartagena, Colombia(1) Improving ncNterm to nL≤ ncN

Let Ldenote length of longest LZ78 factor of S

  • We built GST for distinct substrings of length at most cNbut actually, we only need substrings of length at most L
  • However, L is not known beforehand…
  • Doubling Technique:
  • Assume L = 2 and run algorithm.
  • If LZ78 trieexpands beyond GST, L 2×L, rebuild GST and LZ78 trie,and continue
    • Total time complexity for rebuild: Σi=1..log LO(n2i+m)= O(nL+mlogL)
  • O(ncN+ mlogN) time, O(ncN + m) space
  •  O(nL+ mlogN) time, O(nL + m) space
2 improving nc n term to n n
SPIRE 2012 @ Cartagena, Colombia(2) Improving ncNterm to Nα≤ N

Lemma [Goto et al. CPM 2012]

We can replace GST with suffix tree of trie for q = cN

Given SLP for string S, the set of length-q substrings of S can be represented as paths in a reverse trie of sizeNα = N – α(q)≤ N, whereα(q)= Σi:|Xi| ≥q (vOcc(Xi) – 1)(|ti(q)| – (q – 1)) ≥ 0vOcc(Xi) : # of times Xi occurs in derivation tree

The trie can be computed in time linear of its size.

Lemma [Shibuya 2003]

The suffix tree of a reverse triecan be constructed in linear time.

  • O(ncN+ mlogN) time, O(ncN + m) space
  •  O(Nα + mlogN) time, O(Nα + m) space

Nα = O(ncN)

example trie of size n for q 4
SPIRE 2012 @ Cartagena, ColombiaExample: Trie of size Nαfor q = 4

X7

X6

X5

X4

X5

X4

X3

X1

X3

X4

X3

X1

X3

X1

X2

X1

X2

X1

X3

X1

X2

X1

X2

X1

X2

a

a

b

a

a

b

a

b

a

a

b

a

b

S

a

a

b

Σ|ti(q)| : 17

Text size: 13

Trie size: 11

a

a

b

a

b

b

a

b

We can aggregate all ti(q) into

a trie of size at most the text size

summary
SPIRE 2012 @ Cartagena, ColombiaSummary
  • Showed algorithm for SLP  LZ78 factorization
    • at least as fast as naïve decompress & process
    • better when string is compressible

N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α(cN) ≤ Nm : # of LZ78 factors (O(N/log N) for constant σ)