Selected Applications of Suffix Trees

Selected Applications of Suffix Trees

Reminder – suffix tree Suffix tree for string S of length m: • rooted directed tree with m leaves numbered 1,...,m. • each internal node, except the root, has at least 2 children. • each edge labeled with a nonempty substring of S. • edges out of a node begin with different characters. • path from the root to leaf i spells out suffix S[i...m].

Reminder – suffix tree (continued) • Each substring a of S appears on some unique path from the root. • If a ends at point p, the leaves below p mark all its occurrences. a occurs in S starting at position j  a is a prefix of S[j...m]  a labels an initial part of the path from the root to leaf j.

Example: S=xabxa$1 2 3 4 5 6 x b a a x v a $ b b x x $ $ a $ a $ $ 3 6 5 2 4 1

Exact string matching Find all occurrences of pattern P in text T. • Build suffix tree for T  O(m) (Ukkonen). • Match P along a path from the root  O(1) per character (finite alphabet)  O(n) total. • If P fully matches a path, then the leaves below mark all starting positions of P in T  O(k) where k = number of occurrences.

Matching Statistics • ms(i) – the length of the longest substring of T starting at position i that matches a substring somewhere in P. • example: T = abcxabcdex, P = wyabcwzqabcdw ms(1)=3, ms(5)=4. • There is an occurrence of P starting at position i of T iff ms(i)=|P|.

Goal: Compute ms(i) for each position i in T, in O(m) total time, using only a suffix tree for P. • Naive way: match T[i...m] starting from the root.more than O(m) total. Using suffix links: • Build suffix tree for P (Ukkonen) and keep suffix links. • suffix link: pointer from internal node v with path-label xa to node s(v) with path-label a. (x character, a substring)

Compute ms(i) in order base case: For ms(1), match T[1...m] from the root. general case: Suppose the matching path for ms(i) ended at point b, then for ms(i+1): • Let v be the first internal node at or above b. • If there is no such v – search from the root. • Otherwise – follow the suffix link from v to s(v) and search from s(v).path_label(v)=xa is a prefix of T[i...m] path_label(s(v))=a is a prefix of T[i+1...m].

skip / count • Let b denote the string between node v and point b. • substring xab in P matches a prefix of T[i...m]. • substring ab in P matches a prefix of T[i+1...m]. • Traverse the path labeled b out of s(v) using skip/count trick (time proportional to number of nodes on the path). • From the end of b, match single characters (starting with the first character that didn’t match for ms(i)).

Time analysis In the search for ms(i+1): • back up at most one edge from b to v  O(1). • traverse suffix link from v to s(v)  O(1). • traverse a b-path from s(v) in time proportional to the number of nodes on it  O(m) total. • perform additional comparisons starting with the first character that didn’t match for ms(i)  O(m) total.

Ziv-Lempel data compression

Definitions For any position i in string S of length m: • Priori - longest prefix of S[i...m] that occurs as a substring of S[1...i-1]. • li - length of Priori. • si - starting position of the left-most copy of Priori (li>0). Example: S = abaxcabaxabz, Prioir7 = bax, l7 = 3, s7 = 2. • Copy of Priori starting at si is totally contained in S[1...i-1].

Basic idea • Suppose the text S[1...i-1] has been represented (perhaps in compressed form) and li>0. • Then Priori need not be explicitly represented. • The pair (si,li) points to an earlier occurrence of Priori . • Example:S = abaxcabaxabz (2,3)

Compression algorithm (outline) i := 1 Repeat compute li and siif li > 0 then output (si,li) i := i + lielse output S(i) i := i + 1 Until i > n

Examples S1 = a b a c a b a x a b z        a b (1,1) c (1,3) x (1,2) z S2 = ab ababababababababababababababab      ab(1,2)(1,4) (1,8) (1,16) S = (ab)k  compressed representation is O(log k)

Decompress • Process the compressed string left to right. • Any pair (si,li) in the representation points to a substring that has already been fully decompressed.

Computing (si,li) • The algorithm does not request (si,li) for any position i already in the compressed part of S. • For total O(m) time, find each requested pair (si,li) in O(li) time. compute li and siif li > 0 then output (si,li) i := i + li

Implementation using suffix tree (1) Before compression: • Build a suffix tree T for S. • For each node v, compute cv : • the smallest leaf index in v’s subtree. • the starting position of the leftmost copy of the substring that labels the path from the root to v. • O(m) time.

Implementation using suffix trees (2) root computing (si,li): a |a| + cv ≤ i p v S[i...m] cv i |a| leaf i

Implementation using suffix trees (3) • To compute (si,li), traverse the unique path in T that matches a prefix of S[i...m]: • Let: p - current point, v - first node at or below p. • Traverse as long as: string_depth(p) + cv ≤ i. • At the last point p of traversal:li = string_depth(p), si = cv . • O(li) time.

Example S = abababab 1 2 3 4 5 6 7 8 i=1 li=0  a i=2 li=0  b i=3 li=2 cv=1  (1,2) i=5 li=4 cv=1  (1,4) a string depth=1 b b cv=2 cv=1 v1 a a b b cv=2 v2 cv=1 a a b $ $ b cv=2 cv=1 $ $ a a b b $ $ $ $ 2 4 6 8 7 5 3 1

Online version • Compress S as it is being input one character at a time. • Possible since S[1...i-1] is known before computing si,li. • Implementation: build suffix tree online.  Ukkonen’s algorithm: • In phase i, build implicit suffix tree Ti for prefix S[1...i].

Claim 1 Assume: • The compaction has been done for S[1...i-1]. • Implicit suffix tree Ti-1 for S[1...i-1] has been built. • cv values are given for each node v in Ti-1. Then (si,li) can be obtained in O(li) time.

Suppose we had a suffix tree for S[1...i-1] with cv values  We could find (si,li) in O(li) time. li = string_depth(p) si = cv root S(i) S(i+1) ... S(k-1) p c  S(k) v

The missing leaves in the implicit suffix tree are not needed. root root S(i) S(i) ... ... S(k-1) S(k-1)  p p c  S(k) c  S(k) v $ S(h) ... S(i-1) S(j) ... S(i-1) leaf j h < j leaf h leaf h

Claim 2 cv values for all implicit suffix trees can be computed in total O(m) time. • In Ukkonen’s algorithm: • Only extension rule 2 updates cv values. • Whenever a new internal node v is created by splitting an edge (u,w): cv cw. • Whenever a new leaf j is created: cj  j.  constant update time per new node.

Updating cv values new leaf and new node: new leaf: root root S(j) S(j) u S(i) S(i) v c v S(i+1) S(i+1) c2 w c1 j j

Online algorithm • Base case: output S(1) and build T1. • General case: Suppose S[1...i-1] has been compressed and Ti-1 with cv values has been constructed. • Match S(i),S(i+1),... along a path from the root in Ti-1. • Let S(k) be the first that doesn’t match. • Find (si,li). • If li = 0, output S(i) and build Ti with cv. • If li > 0, output (si,li) and build Ti,...,Tk-1 with cv. • Total time: O(m).

Maximal Repetitive Structures

Maximal Pair • A maximal pair in string S:A pair of identical substrings a and b in S s.t. the character to the immediate left (right) of a is different from the character to the immediate left (right) of b. • Extending a and b in either direction would destroy the equality of the two strings. • Example: S = xabcyiiizabcqabcyrxar

Maximal Pair (continued) • Overlap is allowed:S = cxxaxxaxxbcxxaxxaaxxaxxb • To allow a prefix or suffix of S to be part of a maximal pair:S  #S$ (#,$ don’t appear in S).Example: #abcxabc$

Maximal Repeat • A maximal repeat in string S: A substring of S that occurs in a maximal pair in S. • Example: S = xabcyiiizabcqabcyrxar maximal repeats: abc, abcy, ...

Finding All Maximal RepeatsIn Linear Time • Given: String S of length n. • Goal: Find all maximal repeats in O(n) time. • Lemma: Let T be a suffix tree for S.If string a is a maximal repeat in S,then a is the path-label of an internal node v in T.

Proof – by def. of maximal repeat S = xabcyiiizabcqabcyrxar root a a b c v y q

Conclusion • There can be at most n maximal repeats in any string of length n. • Proof: by the lemma, since T has at most n internal nodes.

Which internal nodes correspond to maximal repeats? • The left character of leaf i in T is S(i-1). • Node v of T is left diverse if at least 2 leaves in v’s subtree have different left characters. • A leaf can’t be left diverse. • Left diversity propagates upward.

Example: S = #xabxa$1 2 3 4 5 6 maximal repeat left diverse x b a a x a $ b b x x $ $ a $ a $ $ 3 6 5 2 4 1 a a x x b #

Theorem The string a labeling the path to an internal node v of T is a maximal repeat  v is left diverse.

Proof of  • Suppose a is a maximal repeat  • It participates in a maximal pair  • It has at least two occurrences with distinct left characters: xa, ya, xy  • Let i and j be the two starting positions of a. Then leaves i and j are in v’s subtree and have different left characters x,y.  • v is left diverse.

Proof of  • Suppose v is left diverse there are substrings xap and yaq in S, xy. • If pq  a’s occurrences in xap and yaq form a maximal pair  a is a maximal repeat. • If p=q  since v is a branching node, there is a substring zar in S, rp.If zx  It forms a maximal pair with xap.If zy  It forms a maximal pair with yap.In either case, a is a maximal repeat.

Proof of  (continued) root root Case 1: Case 2: a a v v r... p... p… q… left char x left char y left char z left char x left char y

Compact Representation • Node v in T is a frontier node if: • v is left diverse. • none of v’s children are left diverse. • Each node at or above the frontier is left diverse. • The subtree of T from the root down to the frontier nodes is a compact representation of the set of all maximal repeats of S. • Representation in O(n) though total length may be larger.

Linear time algorithm • Build suffix tree T. • Find all left diverse nodes in linear time. • Delete all nodes that aren’t left diverse, to achieve compact representation:

finding all left diverse nodes in linear time • Traverse T bottom-up, recording for each node: • either that it is left diverse • or the left character common to all leaves in its subtree. • For each leaf: record its left character. • For each internal node v: • If any child is left diverse  v is left diverse. • Else If all children have a common character x  record x for v. • Else record that v is left diverse.

Finding All Maximal PairsIn Linear Time • Not every two occurrences of a maximal repeat form a maximal pair. Example: S = xabcyiiizabcqabcyrxar • There can be more than O(n) maximal pairs. • The algorithm is O(n+k) where k is the number of maximal pairs.

General Idea For each node u and character x: keep all leaf numbers below u whose left character is x. To find all maximal pairs of a: For each character x, form the cartesian product of the list for x at v1 with every list for a character  x at v2. root a v p… q… v1 v2 leaf i leaf j left char x left char y

The Algorithm • Build suffix tree T for S. • Record the left character of each leaf. • Traverse T bottom-up. • At each node v with path-label a: • Output all maximal pairs of a: cartesian product of lists (u,x) and (u’,x’) for each pair of children u  u’ and pair of characters x  x’. • Create the lists for node v by linking the lists of v’s children.

Time Analysis • Suffix tree construction  O(n). • Bottom-up traversal including all list-linking  O(n). • All cartesian product operations  O(k),where k is the number of maximal pairs. • Total O(n+k).

Finding All Supermaximal Repeats In Linear Time • supermaximal repeat: a maximal repeat that isn’t a substring of any other maximal repeat. • Example: S = xabcyiiizabcqabcyrxarabcy is supermaximal, abc isn’t. • Theorem:A left diverse internal node v in the suffix tree for S represents a supermaximal repeat iff • all of v’s children are leaves • and each has a distinct left character

Longest Common Extension

Selected Applications of Suffix Trees

Selected Applications of Suffix Trees

Presentation Transcript

Suffix trees and suffix arrays

Suffix Trees

Applications of Suffix Trees

Applications of Suffix Trees

Suffix Trees

Suffix Trees and Suffix Arrays

Suffix Trees, Suffix Arrays and Suffix Trays

Suffix Trees and their applications

Suffix trees

Augmenting Suffix Trees, with Applications

Suffix Trees and Suffix Arrays

Exact String Matching, Suffix Trees, and Applications

Suffix Trees

Suffix Trees

Suffix Trees

Compressed Suffix Arrays and Suffix Trees

SUFFIX TREES

Suffix Trees

Suffix Trees and Suffix Arrays

Probabilistic Suffix Trees

Suffix Trees and Derived Applications

Suffix Trees