180 likes | 282 Views
Explore Dr. Shazzad Hosain's presentation on the efficient construction of suffix trees using Ukkonen’s algorithm. Learn about the phases, extensions, suffix links, and more. Uncover key insights and tricks for optimal running time.
E N D
Linear Time Construction of Suffix Tree Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU
High-level of Ukkonen’s Algorithm • Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i • Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1]. a a a b b b 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} a phases b b a a extensions 1 2
1234567890 How suffix links help? MISSISSIPI I M S I I S S S P 3 : MIS 6 : MISSIS 4 : MISS 2 : MI 5 : MISSI 10: MISSISSIPI 9 : MISSISSIP 8 : MISSISSI 7 : MISSISS 1 : M S I S I I I 9 I I S S S S S S I P S P P I S I I I 6 P I 8 I P I P P P 5 I I 3 7 4 1 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. 2
What is achieved so far? Not so much. Worst-case running time is O(m2) for a phase.
Trick1: Skip/Count Trick There must be a γ path from s(v).
Trick1: Skip/Count Trick Walking down along γ takes time proportional to |γ| Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path. zabcdefghy Nodes 2 2 3 3 Edge length But what does it buy in terms of worst-case bounds? There must be a γ path from s(v).
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). s(v)=1 v=2 s(v)=3 v=3 s(v)=5 v=4
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension All operations except down-walk takes constant time Only needs to analyze down walk time
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension • Decreases current node-depth by at most one • Decreases node-depth by at most another one • Each down walk moves to greater node-depth • Over the entire phase, current node-depth is decremented by at most 2mtimes • Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3mover the entire phase All operations except down-walk takes constant time Only needs to analyze down walk time • Total number of edge traversal bounded by 3m • Since each edge traversal is constant, in a phase all the down-walking is O(m).
Complexity • There are m phases • Each phase takes O(m) • So the running time is O(m2) Two more tricks and we are done
Simple Implementation Detail • Suffix tree may require O(m2) space • Consider the string • Every suffix begins with a distinct character, so there are 26 edges out of the root. • Requires 26x27/2 characters in all • So O(m) is impossible to achieve in this representation.
Alternative Representation of Suffix TreeEdge Label Compression 1 2 3 4 56789 0 1 2 Could be 8,9 A fragment of the suffix tree Edge label compressed Number of edge at most 2m – 1, and two numbers are written in a edge, so space is O(m)
1234567890 MISSISSIPI M S I I S S S 5 : MISSI 6 : MISSIS 4 : MISS 3 : MIS 2 : MI 7 : 1234567 7 : MISSISS 8 : MISSISSI 8 : 12345678 1 : M S I S I I I S S S S S S I S I S Explicit Extension I I Implicit extension 3 4 1 2 Observation 1: Rule 2 is a show stopper. We stop further extension.
1234567890 MISSISSIPI M S I I S S S 8 : MISSISSI 7 : MISSISS 3 : MIS 6 : MISSIS 8 : 12345678 7 : 1234567 4 : MISS 1 : M 5 : MISSI 2 : MI S I S I I I S 1,7 S S S 3,7 S S 4,7 S 2,7 S Explicit Extension The major cost e 3 = 8 4 1 Observation 2: Once a leaf always a leaf 2
1234567890 MISSISSIPI M S I I S S S 8 : MISSISSI 7 : MISSISS 3 : MIS 6 : MISSIS 8 : 12345678 7 : 1234567 4 : MISS 1 : M 5 : MISSI 2 : MI S I S I I I S 1,7 S S S 3,7 S S 4,7 S 2,7 S Explicit Extension The major cost e 3 = 8 4 1 Once a leaf always a leaf 2 At any phase the cost is only for explicit extension
1234567890 MISSISSIPI 9,9 S M I I S S S P 9 : MISSISSIP 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 9 : 123456789 2 : MI 8 : 12345678 1 : M 3 : MIS 4 : MISS S I 9,9 S I 2,5 9,9 I 9 I S 1,9 S S S I 3,9 S S P P 4,9 S 2,9 6,9 P I S 9,9 9,9 I 6 I 5 8 P e 3 = 9 7 4 1 Once a leaf always a leaf 2 At any phase the cost is only for explicit extension
1234567890 MISSISSIPI 8 : 12345 9 : 123456789 Since there are only m phases, the total number of explicit extension is bounded by 2m So the total number of down-walk is bounded by O(m) Or The time to construct the suffix tree is bounded by O(m)
Reference • Chapter 6: Algorithms on Strings, Trees and Sequences