1 / 48

On the Sorting-Complexity of Suffix Tree Construction

On the Sorting-Complexity of Suffix Tree Construction. MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN. Requires Math fonts downloadable from here. Fact From the Previous Talk.

clint
Download Presentation

On the Sorting-Complexity of Suffix Tree Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from here

  2. Fact From the Previous Talk Harel and Tarjan 1984,Bender and Farach-Colton 2000 A tree T with m nodes can be preprocessed in O(m) time so that, for any pair of its nodes u, v, lca(u, v) can be computed in constant time.

  3. What’s in This Paper • Bounds depend on the alphabet • Constant size alphabet – O(n) (Weiner 1973) • For unbounded alphabet (n log n) • For {1…n} – linear time • RAM algorithm • DAM algorithm (I/O optimal) • Algorithm also works for PRAM, PDAM

  4. Talk Outline • Suffix trees • Reminder • Tools • RAM algorithm for suffix tree construction • Conclusion

  5. S[8,13]=12221$ Suffix Trees S = n = 13 1 $ 2 13 1 $ 2 12 2 2 11 3 4 1 9 7 21$ 6 5 10 8

  6. Suffix Tree Representation 13 1 $ 2 12 2 11 3 4 1 9 7 6 5 10 8 l=

  7. Properties of Suffix Trees lcp((v), (w)) = |(lca(v, w)| 1 =11L=2 =1L=1 lca(v, w) 13 1 2 v w 12 =12L=2 1 2 2 11 3 4 1 9 7 6 5 10 8

  8. Suffix Links Lemma [Weiner 1973] Let a  and *.If there is a node v in Ts such that (v)=a,then there is a node w in Ts such that (w)= .Define the suffix link as sl(v) = w.

  9. Suffix Links 1 2 =1L=1 =2L=1 13 1 2 2 12 =12L=2 1 2 2 11 3 4 1 9 7 =122 L=3 6 5 10 8

  10. Suffix Links Example 1 1 13 2 12 2 2 2 2 11 3 4 1 3 9 7 3 6 5 10 8

  11. Suffix Arrays • Let ={Si | Si* , |Si|=ni} • T = compacted trie of  • In order traversal of leaves gives strings in lexicographical order – S p1, …, S p|| • sort arrayAT[i]=pi • longest common prefix array LCPT[i] = lcp(S pi, S pi+1)

  12. Suffix Array Example 1 =11L=2 13 1 12 2 11 3 4 1 9 7 6 5 10 8 AT LCPT

  13. RAM Algorithm Input: string S Output: Ts Divide and Conquer: • Recursively compute To – compacted trie of suffixes beginning at odd positions • Recursively compute Te – compacted trie of suffixes beginning at even positions • Merge Te and To to get Ts

  14. Divide and Conquer Scheme A(n) Divide A(n/2) A(n/2) A(n/4) A(n/4) A(n/4) A(n/4) Conquer S(n/2) S(n/2) Merge S(n)

  15. 6 4 3 2 1 5 RAM Algorithm Scheme |S|=n, =[n] Divide |S’|=n/2, ’=[n/2] TS’ (n/2) Conquer ATs’ (n/2), LCPTs’ (n/2) ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) TS (n)

  16. 4 3 1 5 Switching Representations |S|=n, =[n] Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) 6 TS (n)

  17. Suffix Tree  Suffix Array 1 =11L=2 13 1 12 2 11 3 4 1 9 7 6 5 10 8 AT LCPT

  18. Suffix Array Suffix Tree 1 =11L=2 13 1 12 2 11 3 4 1 9 7 6 5 10 8 AT LCPT

  19. 5 4 Compressing S |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) 6 TS (n)

  20. Compressing S • Input: |S|=n =[n] • Map character pairs into single characters: • For i=1 to n form pairs S[2i-1], S[2i] • Sort lexicographically by radix sort O(n) • Remove duplicates • S’[i] = rank of S[2i-1], S[2i] • Now |S’|=n/2 and ’=[n/2]

  21. Example S=121112212221$ =[13] • Pairs1,2 1,1 1,2 2,1 2,2 2,1 • Ordered pairs1,1 1,2 1,2 2,1 2,1 2,2 • Duplicates removed1,1 1,2 2,1 2,2 • S’=212343$ =[4]

  22. 5 4 Decompressing S |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) 6 TS (n)

  23. Decompressing S • Input : ATs’ , LCPTs’ • Notice : S[2i-1] · · ·S[n]$ = S’[i] · · ·S[n/2]$ • ATo[i] = ATs’[i] · 2 – 1

  24. 5 Building the Even Tree |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 4 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) Merge AT(n/2), LCPT (n/2) 6 TS (n)

  25. Building the Even Tree • Input : ATo , LCPTo • Observation : P = even suffix of Sthen P = aP’ and P’ = odd suffix of S • To get ATe apply radix sort on even suffixes S[2i,n] using the keys S[2i], S[2i+1,n]

  26. Merging To and Te |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 4 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) 5 Merge AT(n/2), LCPT (n/2) 6 TS (n)

  27. Merging To and Te • Input : ATo, LCPTo and ATe, LCPTe • Trivial method – sort suffixes lexicographically (n2) • What if we have an oracle forlcp(S[2i, n], S[2j-1, n]) ? • Merge ATo and ATe directly (like sorted lists) • Compute LCPTfrom previous results: • lcp of adjacent odd suffixes by LCPTo • lcp of adjacent even suffixes by LCPTe • lcp of odd suffix and even suffix by oracle

  28. Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM

  29. Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1

  30. Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1 1 A+D

  31. Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1 1 2 A+D B

  32. Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1 3 1 2 A+C B E

  33. Coupled-DFS (the uncompacted case) T1 T2 1 2 1 2 1 2 C 1 3 F A B D E TM 1 2 3 1 C+F 2 A+C B E

  34. Coupled-DFS (the compacted case) T1 T2 1234 2 12 2 1 2 C 1 3 F A B D E TM 1234

  35. Coupled-DFS (the compacted case) T1 T2 2 12 2 1234 12 C 1 3 34 F 1 2 D E A B TM 12 2 1 3 C+F D G

  36. Over-Merging To and Te • How do we merge compacted tries? • An over-merge is like a merge but: • Compare only first characters of edges • In case of two edges with different lengths, k<l break l into k and l-k • Identify edges with first letter only

  37. Over-Merge Example T1 T2 2 13 2 1234 12 C 1 3 34 F 1 2 D E A B TM 1x 2 1 3 C+F D G

  38. Over-Merge of Running Example To S=121112212221$ 13 1 1 3 9 2 2 7 1 11 5

  39. Over-Merge of Running Example Te S=121112212221$ 1 1 4 8 12 2 2 6 10

  40. Over-Merge of Running Example TM S=121112212221$ 1 13 10 4 2 12 2 3 3 1 2 7 8 11 9 6 6 10 5

  41. Building the lcp Oracle • Definitions • Node in both TM and To is odd • Node in both TM and Te is even • Node with both odd and even descendents is odd/even • For every odd/even node u find l2i and l2j-1 such that u = lca(l2i, l2j-1) • Compute d(u) = lca(l2i+1, l2j) • Compute (u) = depth(u) in d-pointers tree

  42. Over-Merge of Running Example TM S=121112212221$ 13 1 10 4 2 12 2 3 3 1 2 7 8 11 9 6 6 10 5

  43. Main Theorem The function d defines a tree on the odd/evennodes of TM, and for any l2i and l2j-1 we have ( lca(l2i, l2j-1) ) = lcp(S[2i,n], S[2j-1,n])

  44. Helpful Observations Let u be an odd/even node in TM.u is Either even or odd and so L(u) is defined.Let u be an even node:1. For l2i and l2j below ulcp(S[2i,n], S[2j,n]) L(u)2. For l2i’-1 and l2j’-1 below ulcp(S[2i’-1,n], S[2j’-1,n])  L(u)3. For l2i” and l2j”-1 below ulcp(S[2i”,n], S[2j”-1,n]) L(u) Symmetrical proof is u is an odd node.

  45. Lemma The lcp value of any odd and even pair of leaves whose lca is u must be the same Proof:Suppose lca(l2i’, l2j’-1) = lca(l2i’’, l2j”-1) = u  lcp(S[2i’,n], S[2j’-1,n]) = k L(u)lcp(S[2i’,n], S[2i”,n])  L(u) k  lcp(S[2i”,n], S[2j’-1,n]) = k L(u) k S[2i’,n] S[2j’-1,n] S[2i”,n]

  46. Induction on the lcp Pick a pair of odd an even suffixes S[2i’,n] and S[2j’-1,n].Base: If S[2i’] S[2j’-1] then lca = root (recall the merge procedure)  lcp = 0.Assumption: Suppose theorem is true for lcp< k.Induction Step:lcp(S[2i,n], S[2j-1,n]) = k > 0u = lca(l2i, l2j-1)  u root.Suppose d(u) = lca(l2i’+1, l2j’) then:(u) =1 1 + (d(u)) =2 1 + lcp(S[2i’+1,n], S[2j’,n]) =3lcp(S[2i,n], S[2j-1,n])

  47. Done! |S|=n, =[n] 1 Divide |S’|=n/2, ’=[n/2] TS’ (n/2) 2 Conquer ATs’ (n/2), LCPTs’ (n/2) 3 4 ATe(n/2), LCPTe (n/2) ATo(n/2), LCPTo (n/2) 5 Merge AT(n/2), LCPT (n/2) 6 TS (n)

  48. The End

More Related