180 likes | 282 Views
Linear Time Construction of Suffix Tree. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. High-level of Ukkonen’s Algorithm. Ukkonen’s algorithm is divided into m phases . In phase i +1, tree i +1 is constructed from i
E N D
Linear Time Construction of Suffix Tree Presented By Dr. ShazzadHosain Asst. Prof. EECS, NSU
High-level of Ukkonen’s Algorithm • Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i • Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1]. a a a b b b 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} a phases b b a a extensions 1 2
1234567890 How suffix links help? MISSISSIPI I M S I I S S S P 3 : MIS 6 : MISSIS 4 : MISS 2 : MI 5 : MISSI 10: MISSISSIPI 9 : MISSISSIP 8 : MISSISSI 7 : MISSISS 1 : M S I S I I I 9 I I S S S S S S I P S P P I S I I I 6 P I 8 I P I P P P 5 I I 3 7 4 1 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. 2
What is achieved so far? Not so much. Worst-case running time is O(m2) for a phase.
Trick1: Skip/Count Trick There must be a γ path from s(v).
Trick1: Skip/Count Trick Walking down along γ takes time proportional to |γ| Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path. zabcdefghy Nodes 2 2 3 3 Edge length But what does it buy in terms of worst-case bounds? There must be a γ path from s(v).
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). s(v)=1 v=2 s(v)=3 v=3 s(v)=5 v=4
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension All operations except down-walk takes constant time Only needs to analyze down walk time
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment , the node-depth of v is at most one greater than the node depth of s(v). • The algorithm walks up at most one edge • Find suffix link and traverse it • Walks down some number of nodes • Applies suffix extension rules • And may add a suffix link Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension • Decreases current node-depth by at most one • Decreases node-depth by at most another one • Each down walk moves to greater node-depth • Over the entire phase, current node-depth is decremented by at most 2mtimes • Since no node can have depth greater than m, the total possible increment to current node-depth is bounded by 3mover the entire phase All operations except down-walk takes constant time Only needs to analyze down walk time • Total number of edge traversal bounded by 3m • Since each edge traversal is constant, in a phase all the down-walking is O(m).
Complexity • There are m phases • Each phase takes O(m) • So the running time is O(m2) Two more tricks and we are done
Simple Implementation Detail • Suffix tree may require O(m2) space • Consider the string • Every suffix begins with a distinct character, so there are 26 edges out of the root. • Requires 26x27/2 characters in all • So O(m) is impossible to achieve in this representation.
Alternative Representation of Suffix TreeEdge Label Compression 1 2 3 4 56789 0 1 2 Could be 8,9 A fragment of the suffix tree Edge label compressed Number of edge at most 2m – 1, and two numbers are written in a edge, so space is O(m)
1234567890 MISSISSIPI M S I I S S S 5 : MISSI 6 : MISSIS 4 : MISS 3 : MIS 2 : MI 7 : 1234567 7 : MISSISS 8 : MISSISSI 8 : 12345678 1 : M S I S I I I S S S S S S I S I S Explicit Extension I I Implicit extension 3 4 1 2 Observation 1: Rule 2 is a show stopper. We stop further extension.
1234567890 MISSISSIPI M S I I S S S 8 : MISSISSI 7 : MISSISS 3 : MIS 6 : MISSIS 8 : 12345678 7 : 1234567 4 : MISS 1 : M 5 : MISSI 2 : MI S I S I I I S 1,7 S S S 3,7 S S 4,7 S 2,7 S Explicit Extension The major cost e 3 = 8 4 1 Observation 2: Once a leaf always a leaf 2
1234567890 MISSISSIPI M S I I S S S 8 : MISSISSI 7 : MISSISS 3 : MIS 6 : MISSIS 8 : 12345678 7 : 1234567 4 : MISS 1 : M 5 : MISSI 2 : MI S I S I I I S 1,7 S S S 3,7 S S 4,7 S 2,7 S Explicit Extension The major cost e 3 = 8 4 1 Once a leaf always a leaf 2 At any phase the cost is only for explicit extension
1234567890 MISSISSIPI 9,9 S M I I S S S P 9 : MISSISSIP 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 9 : 123456789 2 : MI 8 : 12345678 1 : M 3 : MIS 4 : MISS S I 9,9 S I 2,5 9,9 I 9 I S 1,9 S S S I 3,9 S S P P 4,9 S 2,9 6,9 P I S 9,9 9,9 I 6 I 5 8 P e 3 = 9 7 4 1 Once a leaf always a leaf 2 At any phase the cost is only for explicit extension
1234567890 MISSISSIPI 8 : 12345 9 : 123456789 Since there are only m phases, the total number of explicit extension is bounded by 2m So the total number of down-walk is bounded by O(m) Or The time to construct the suffix tree is bounded by O(m)
Reference • Chapter 6: Algorithms on Strings, Trees and Sequences