320 likes | 477 Views
Linear Time Construction of Suffix Tree. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Suffix tree. S= xabxac = abxac = bxac = xac = ac = c. 1 2 3 4 5 6. S= xabxac. Suffix tree. S= xabxa = abxa = bxa = xa = a. 1 2 3 4 5.
E N D
Linear Time Construction of Suffix Tree Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU
Suffix tree S=xabxac = abxac = bxac = xac = ac = c 1 2 3 4 5 6 S=xabxac
Suffix tree S=xabxa = abxa = bxa = xa = a 1 2 3 4 5 S=xabxa x a a b b x b x x a a a
Suffix tree (Example) Let s=abab,a suffix tree of scontains all the suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $
Trivial algorithm to build a Suffix tree s=abab$ a b a Put the largest suffix in b $ a b b a Put the suffix bab$ in a b b $ $
{ abab$ bab$ } a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $
a b { abab$ bab$ ab$ } b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $
{ abab$ bab$ ab$ b$ } a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $
{ abab$ bab$ ab$ b$ $ } $ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b 5 b $ a a $ b 4 b $ $ 3 2 1
Naive Construction – More Example abbcbab# ab cbab# # 6 4 b abbcbab# bbcbab# bcbab# # ab# 7 1 5 cbab# bcbab# 3 2
Analysis Takes O(n2) time to build. We will see how to do it in O(n) time
Ukkonen’s linear-time Suffix Tree Algorithm • Implicit Suffix Tree Remove the terminal symbols $ from the edge labels of the tree Then remove any edge that has no label
Implicit Suffix Tree – More Example { abab$ bab$ ab$ b$ $ } $ a b 5 b $ a a $ b 4 b $ $ 3 2 1 Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S Let i denote the implicit suffix tree of the string S[1…i]
Ukkonen’s Algorithm at a High Level • Construct an implicit suffix tree i for each prefix S[1..i] of S, starting 1 and incrementing i by one until m is build, where m is the length of the string S. • The true suffix tree for S is constructed from m, and the time for the entire algorithm is O(m)
High-level Description of Ukkonen’s Algorithm • Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i • Each phasei+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].
Naïve Algorithm of Suffix Tree { abab$ bab$ ab$ b$ $ } $ a b b 5 $ a a $ b 4 b $ 3 $ 1 2
High-level of Ukkonen’s Algorithm • Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i • Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1]. a a a b b b 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} a phases b b a a extensions 1 2
O (m3) 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} a b b a a extensions 1 2
Suffix Entension Rules Let i already there and want to extend for i+1 Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character S(i+1) is added to the end of the label of that leaf edge. Rule2: some path from the end of string β starts with character S(i+1). In this case the string β S(i+1) is already in the tree. So do nothing. 3 : S[1…3] {aba, ba, a} 2 : S[1…2] {ab, b} 1 : S[1…1] {a} 4 : S[1…4] {abab, bab, ab, b} a b b S(i+1) a β a b 1 2 b 1 2 1 2 3
Suffix Entension Rules Let, i already there and want to extend for i+1 O (m3) 123456 Let, 5 is drawn for axabxb Now extend for 6 axabxb xabxb RULE1 abxb bxb xb RULE3 b RULE2 Rule3: No path from the end of string β starts with character S(i+1), but at least one labeled path continues from the end of β. Add new node.
Implementation and Speedup, Suffix Links Definition:Let xαdenotes an arbitrary string, where x is a single character and αa substring (possibly empty). For an internal node v with path-label xα, if there is another node s(v) with path-label α, then a pointer from v to s(v) is called a suffix link. No, because not an internal node Does root have a suffix link? Every internal node has a suffix link.
ab cbab# # 6 4 b bcbab# # ab# 7 1 5 cbab# bcbab# 3 2 Suffix Links – More Example v S(v) abbcbab# Suffix link Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. Corollary 6.1.2: In any implicit suffix tree i, if internal node v has path-label xα, then there is a node s(v) of i with path-label α.
1234567890 MISSISSIPI I M S I I S S S P 10: MISSISSIPI 5 : MISSI 1 : M 2 : MI 3 : MIS 4 : MISS 9 : MISSISSIP 8 : MISSISSI 7 : MISSISS 6 : MISSIS S I S I I I 9 I I S S S S S S I P S P P I S I I I 6 P I 8 I P I P P P 5 I I 3 7 4 1 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. 2
1234567890 How suffix links help? MISSISSIPI I M S I I S S S P 3 : MIS 6 : MISSIS 4 : MISS 2 : MI 5 : MISSI 10: MISSISSIPI 9 : MISSISSIP 8 : MISSISSI 7 : MISSISS 1 : M S I S I I I 9 I I S S S S S S I P S P P I S I I I 6 P I 8 I P I P P P 5 I I 3 7 4 1 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. 2
What is achieved so far? Not so much. Worst-case running time is O(m2) for a phase.
Trick1: Skip/Count Trick There must be a γ path from s(v).
Trick1: Skip/Count Trick Walking down along γ takes time proportional to |γ| Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path. zabcdefghy Nodes 2 2 3 3 Edge length But what does it buy in terms of worst-case bounds? There must be a γ path from s(v).
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). s(v)=1 v=2 s(v)=3 v=3 s(v)=5 v=4
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension All operations except down-walk takes constant time Only needs to analyze down walk time
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension • Decreases current node-depth by at most one • Decreases node-depth by at most another one • Each down walk moves to greater node-depth • Over the entire phase, current node-depth is decremented by at most 2mtimes • Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3mover the entire phase All operations except down-walk takes constant time Only needs to analyze down walk time • Total number of edge traversal bounded by 3m • Since each edge traversal is constant, in a phase all the down-walking is O(m).
Complexity • There are m phases • Each phase takes O(m) • So the running time is O(m2) Two more tricks and we are done
Reference • Chapter 6: Algorithms on Strings, Trees and Sequences