1 / 29

Simple Linear Work Suffix Array Construction

Simple Linear Work Suffix Array Construction. J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003. Work. 在分析 parallel algorithm 時,常用到二種 複雜度 : time and work complexity. Time t(n) : 須執行多少步驟 .

lis
Download Presentation

Simple Linear Work Suffix Array Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simple Linear Work Suffix Array Construction J. Kärkkäinen, P. Sanders Proc. 30th International Conference on Automata, Languages and Programming 2003

  2. Work 在分析 parallel algorithm 時,常用到二種 複雜度: time and work complexity. • Time t(n) : 須執行多少步驟. • Work w(n): t(n) * (所用到的processors的數目). 這篇paper主要的貢獻在於它的方法應用在 External Memory 或 Cache Oblivious model上也是optimal, 而應用在 BSP 和 EREW-PRAM model 上則可以和現有 的演算法有相同的 work complexity, 但更好的 time complexity. 但以下報告內容將只針對RAM model 的 time complexity 作分析.

  3. Today’s Work Suffix Array Depth Array Suffix Tree

  4. Model of Alphabet • Constant alphabet: The size of alphabet is constant. • Integer alphabet: Characters are integers in [1 … n], where n is the number of input characters.

  5. Topic 1: Suffix Array • A suffix array SA of s is the result of sorting the suffixes of s lexicographically. ex: 012 s = [ a b a ] s0 = a b a s1 = b a s2 = a 0 1 2 => SA = [ s2 s0 s1 ] [ 2 0 1 ] in implementation = Some conventions: We call the suffix starting from the the index i as the ith suffix. 除3不等於0的suffix = { ith suffix| i != 0 mod 3} 除3等於0的suffix = { ith suffix| i == 0 mod 3}

  6. Suffix Array Problem • Input: a string s with length n • Output: a suffix array SA of s • Time: O(n)

  7. GetSA Algorithm Outline • Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. • Step 2: SA= 0 = sort the suffixes starting at position i = 0 mod 3. • Step 3: SA = merge SA= 0 and SA≠ 0 .

  8. Step1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3. 0 1 2 3 4 5 6 7 8 9 10 11 12 • 選代表 0 1 2 3 4 5 6 7 8 9 10 s = m i s s i s s i p p i $ $ m i s s i s s i p p i Radixsort 3 3 2 1 5 5 4 1 4 7 10 2 5 8 Let 代= [ 3 3 2 1 5 5 4 ] => getSA(代 ) = SA代= [ 10 7 4 1 8 5 2 ] in T(2n/3) Claim: SA≠0 = SA代

  9. Why SA代= SA≠0 ? 代= [ 3 3 2 1 5 5 4 ] s = m i s s i s s i p p i 代1= 3 3 2 1 5 5 4 1 4 7 10 2 5 8 0 12 3 45 6 78 9 10 s1 = i s s i s s i p p i s4 = 3 2 1 5 5 4 = i s s i p p i 代4 s7 = 2 1 5 5 4 = i p p i 代7 s10 = 1 5 5 4 = i 代10 = 5 5 4 s2 代2 = s s i s s i p p i = 5 4 s5 代5 s s i p p i = = 4 代8 p p i s8 = SA代= SA≠ 0 = [ 107418 5 2 ], It suffices to show that 代i < 代j <=> si < sj.

  10. 代 i < 代j<=> si < sj Case 1: i = j mod 3 1 4 7 102 5 8 0 12 3 45 6 78 9 10 11 12 代= [4 4 3 2 6 6 5 ] s = m i s s i s s i p p i $ $ Ex: 4 7 102 5 8 4 5 6 7 8 9 10 11 12 代4= [ 4 3 2 6 6 5 ] s4 = [ i s s i p p i $ $ ] 1 4 7 102 5 8 1 2 3 4 5 6 7 8 9 10 11 12 代1= [ 4 4 3 2 6 6 5 ] s1 = [ i s s i s s i p p i $ $ ] s4 < s1 代4 < 代1

  11. 代 i < 代j<=> si < sj Case 2: i ≠ j mod 3 1 4 7 102 5 8 0 12 3 45 6 78 9 10 11 12 s12 = [4 4 3 2 6 6 5 ] s = m i s s i s s i p p i $ $ Ex: 4 7 102 5 8 4 5 6 7 8 9 10 11 12 代4= [ 4 3 2 6 6 5 ] s4 = [ i s s i p p i $ $ ] 5 8 5 6 7 8 9 10 代5 = [ 6 5 ] s5=[ s s i p p i ] 代4 < 代5 s4 < s5

  12. Step2: SA= 0= sort the suffixes starting at position i = 0 mod 3. ∵ The rank of sj among {sk | k ≠ 0 mod 3 } was determined in Step1 for all j ≠ 0 mod 3. ∴ Let rank≠0 (sj) = rank of sj among {sk | k ≠ 0 mod 3 } for all j ≠ 0 mod 3. SA=0= radix sort { (s[i], rank≠0(si+1)) | i = 0 mod 3 }.

  13. Step 3: SA = merge SA= 0and SA≠ 0. • SA= 0= [s0s9s6s3] • SA≠0= [s11s10s7s1s8s5s2] • SA = merge SA= 0and SA≠0 =[s11 s10 s7 s4 s1 s0 s9 s8 s6 s3 s5 s2] It is in time O(n) if we can determine the relative order of Si SA= 0 and Sj SA≠0in constant time.

  14. Compare Siand Sj where i = 0 , j ≠ 0 mod 3: case 1: j = 1 mod 3 ∵ i + 1 = 1 mod 3, j+1 = 2 mod 3 ∴ compare (s[i], rank≠0(si+1) ) with (s[j], rank≠0(sj+1) ) in constant time. case 2: j = 2 mod 3 ∵ i + 2 = 2 mod 3, j+2 = 1 mod 3 ∴ compare (s[i], s[i+1], rank≠0(si+2)) with (s[j], s[j+1], rank≠0(sj+2)) in constant time

  15. Time complexity analysis • Step1: O(n) + T(2n/3) • Step2: O(n) • Step3: O(n) • T(n) = O(n) + T(2n/3) = O(n)

  16. Topic 2: Depth array • Definition of Depth array: . . . . . . 0 i-1 i n - 1 SA = Sk Sj . . . . . . 1 i n - 1 DA = DA[i] = longest common prefix of Sj and Sk sk sj

  17. Depth array problem • Input: a string s and its suffix array SA. • Output: a depth array DA of s. • Time: O(|s|) = O(n)

  18. Lemma1: di≥ di-1 -1 i . . . i’ . . . 0 n - 1 S = Si Si ’ . . . . . . 0 rank( i ) n - 1 SA = Si ’ Si . . . . . . n - 1 1 rank( i ) di DA = DA[ rank( i ) ] = di si ’ si

  19. Lemma1: di ≥ di-1 -1 i-1 i . . . . . . 0 n - 1 S = Si-1 Si rank( i ) rank( i - 1) SA = Si ’ S(i – 1)’ Si - 1 Si rank( i ) rank( i - 1) DA = di di-1 1 di-1 di-1 - 1 di si ’ si s( i- 1) ’ si- 1

  20. Lemma1: di≥ di-1 -1 Pf: 1 di di-1 - 1 s( i- 1) ’ si- 1 si ’ si

  21. Lemma1: di≥ di-1 -1 Pf: 1 di di-1 - 1 s( i- 1) ’ si- 1 si ’ s (i-1)’+1 si if < => si ’ < s(i- 1)’+1 < si -><-

  22. How to compute diwhen di-1 is given ? • By Lemma1: di≥ di-1– 1, it suffices to compare si and si ’ from the di-1-th character. di-1 - 1 di si ’ si

  23. Algorithm GetDepth Input: A string s and its suffix array SA 1. d1 = by naïvely comparing s1 and s1’ ; 2. For i := 2 to n-1 do 3. di = by comparing si and si ’ from the (di-1 )-th character; 4. End for Time complexity Analysis: Iteration i: ( di – di-1 + 1) + 1 = di – di-1 + 2 Total =

  24. Topic 3: Suffix Tree Problem • Input: a string s with length n. • Output: a suffix tree ST of s. • Time: O(|s|) = O(n)

  25. GetST Algorithm Outline Algorithm GetST(s) 1. SA = suffix array of s; 2. DA = depth array of s; 3. For i:=0 to n-1 STi = add the SA[i]-th suffix into STi-1. 4. End for 5. Return STn-1;

  26. How to add the SA[i]-th suffix into STi-1? 0 i-1 i n - 1 SA = Sk Sj Observation: The SA[i-1]-th suffix is the right_most_path RP of STi-1, so the longest common prefix of RP and SA[i]-th suffixis DA[ i ]. . . . . . . 1 i n - 1 DA = DA[ i ] [ (SA[i] + DA[i]), - ]

  27. Each node is go over at most once DA[ i ] [ (SA[i] + DA[i]), - ] Nodes on this path will not be go over again.

  28. Time Complexity Analysis • Because each node is go over at most once and there are at most 2n nodes in the tree, the time complexity is O(n).

  29. Conclusions • Advantages: • Alphabet 的限制 • 硬碟的I/O • Easy to show • Disadvantages: • 沒有incremental 的特性

More Related