1 / 32

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Linear Time Construction of Suffix Tree. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Suffix tree. S= xabxac = abxac = bxac = xac = ac = c. 1 2 3 4 5 6. S= xabxac. Suffix tree. S= xabxa = abxa = bxa = xa = a. 1 2 3 4 5.

nariko
Download Presentation

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Time Construction of Suffix Tree Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU

  2. Suffix tree S=xabxac = abxac = bxac = xac = ac = c 1 2 3 4 5 6 S=xabxac

  3. Suffix tree S=xabxa = abxa = bxa = xa = a 1 2 3 4 5 S=xabxa x a a b b x b x x a a a

  4. Suffix tree (Example) Let s=abab,a suffix tree of scontains all the suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

  5. Trivial algorithm to build a Suffix tree s=abab$ a b a Put the largest suffix in b $ a b b a Put the suffix bab$ in a b b $ $

  6. { abab$ bab$ } a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $

  7. a b { abab$ bab$ ab$ } b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $

  8. { abab$ bab$ ab$ b$ } a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $

  9. { abab$ bab$ ab$ b$ $ } $ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b 5 b $ a a $ b 4 b $ $ 3 2 1

  10. Naive Construction – More Example abbcbab# ab cbab# # 6 4 b abbcbab# bbcbab# bcbab# # ab# 7 1 5 cbab# bcbab# 3 2

  11. Analysis Takes O(n2) time to build. We will see how to do it in O(n) time

  12. Ukkonen’s linear-time Suffix Tree Algorithm • Implicit Suffix Tree Remove the terminal symbols $ from the edge labels of the tree Then remove any edge that has no label

  13. Implicit Suffix Tree – More Example { abab$ bab$ ab$ b$ $ } $ a b 5 b $ a a $ b 4 b $ $ 3 2 1 Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S Let i denote the implicit suffix tree of the string S[1…i]

  14. Ukkonen’s Algorithm at a High Level • Construct an implicit suffix tree i for each prefix S[1..i] of S, starting 1 and incrementing i by one until m is build, where m is the length of the string S. • The true suffix tree for S is constructed from m, and the time for the entire algorithm is O(m)

  15. High-level Description of Ukkonen’s Algorithm • Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i • Each phasei+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].

  16. Naïve Algorithm of Suffix Tree { abab$ bab$ ab$ b$ $ } $ a b b 5 $ a a $ b 4 b $ 3 $ 1 2

  17. High-level of Ukkonen’s Algorithm • Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i • Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1]. a a a b b b 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} a phases b b a a extensions 1 2

  18. O (m3) 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} a b b a a extensions 1 2

  19. Suffix Entension Rules Let i already there and want to extend for i+1 Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character S(i+1) is added to the end of the label of that leaf edge. Rule2: some path from the end of string β starts with character S(i+1). In this case the string β S(i+1) is already in the tree. So do nothing. 3 : S[1…3] {aba, ba, a} 2 : S[1…2] {ab, b} 1 : S[1…1] {a} 4 : S[1…4] {abab, bab, ab, b} a b b S(i+1) a β a b 1 2 b 1 2 1 2 3

  20. Suffix Entension Rules Let, i already there and want to extend for i+1 O (m3) 123456 Let, 5 is drawn for axabxb Now extend for 6 axabxb xabxb RULE1 abxb bxb xb RULE3 b RULE2 Rule3: No path from the end of string β starts with character S(i+1), but at least one labeled path continues from the end of β. Add new node.

  21. Implementation and Speedup, Suffix Links Definition:Let xαdenotes an arbitrary string, where x is a single character and αa substring (possibly empty). For an internal node v with path-label xα, if there is another node s(v) with path-label α, then a pointer from v to s(v) is called a suffix link. No, because not an internal node Does root have a suffix link? Every internal node has a suffix link.

  22. ab cbab# # 6 4 b bcbab# # ab# 7 1 5 cbab# bcbab# 3 2 Suffix Links – More Example v S(v) abbcbab# Suffix link Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. Corollary 6.1.2: In any implicit suffix tree i, if internal node v has path-label xα, then there is a node s(v) of i with path-label α.

  23. 1234567890 MISSISSIPI I M S I I S S S P 10: MISSISSIPI 5 : MISSI 1 : M 2 : MI 3 : MIS 4 : MISS 9 : MISSISSIP 8 : MISSISSI 7 : MISSISS 6 : MISSIS S I S I I I 9 I I S S S S S S I P S P P I S I I I 6 P I 8 I P I P P P 5 I I 3 7 4 1 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. 2

  24. 1234567890 How suffix links help? MISSISSIPI I M S I I S S S P 3 : MIS 6 : MISSIS 4 : MISS 2 : MI 5 : MISSI 10: MISSISSIPI 9 : MISSISSIP 8 : MISSISSI 7 : MISSISS 1 : M S I S I I I 9 I I S S S S S S I P S P P I S I I I 6 P I 8 I P I P P P 5 I I 3 7 4 1 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. 2

  25. What is achieved so far? Not so much. Worst-case running time is O(m2) for a phase.

  26. Trick1: Skip/Count Trick There must be a γ path from s(v).

  27. Trick1: Skip/Count Trick Walking down along γ takes time proportional to |γ| Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path. zabcdefghy Nodes 2 2 3 3 Edge length But what does it buy in terms of worst-case bounds? There must be a γ path from s(v).

  28. Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). s(v)=1 v=2 s(v)=3 v=3 s(v)=5 v=4

  29. Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension All operations except down-walk takes constant time Only needs to analyze down walk time

  30. Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension • Decreases current node-depth by at most one • Decreases node-depth by at most another one • Each down walk moves to greater node-depth • Over the entire phase, current node-depth is decremented by at most 2mtimes • Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3mover the entire phase All operations except down-walk takes constant time Only needs to analyze down walk time • Total number of edge traversal bounded by 3m • Since each edge traversal is constant, in a phase all the down-walking is O(m).

  31. Complexity • There are m phases • Each phase takes O(m) • So the running time is O(m2) Two more tricks and we are done

  32. Reference • Chapter 6: Algorithms on Strings, Trees and Sequences

More Related