- 103 Views
- Uploaded on
- Presentation posted in: General

Selected Applications of Suffix Trees

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Selected Applications of Suffix Trees

Suffix tree for string S of length m:

- rooted directed tree with m leaves numbered 1,...,m.
- each internal node, except the root, has at least 2 children.
- each edge labeled with a nonempty substring of S.
- edges out of a node begin with different characters.
- path from the root to leaf i spells out suffix S[i...m].

- Each substring a of S appears on some unique path from the root.
- If a ends at point p, the leaves below p mark all its occurrences.
a occurs in S starting at position j

a is a prefix of S[j...m]

a labels an initial part of the path from the root to leaf j.

x

b

a

a

x

v

a

$

b

b

x

x

$

$

a

$

a

$

$

3

6

5

2

4

1

Find all occurrences of pattern P in text T.

- Build suffix tree for T O(m) (Ukkonen).
- Match P along a path from the root O(1) per character (finite alphabet) O(n) total.
- If P fully matches a path, then the leaves below mark all starting positions of P in T O(k) where k = number of occurrences.

- ms(i) – the length of the longest substring of T starting at position i that matches a substring somewhere in P.
- example: T = abcxabcdex, P = wyabcwzqabcdw ms(1)=3, ms(5)=4.
- There is an occurrence of P starting at position i of T iff ms(i)=|P|.

- Naive way: match T[i...m] starting from the root.more than O(m) total.
Using suffix links:

- Build suffix tree for P (Ukkonen) and keep suffix links.
- suffix link: pointer from internal node v with path-label xa to node s(v) with path-label a. (x character, a substring)

base case: For ms(1), match T[1...m] from the root.

general case: Suppose the matching path for ms(i) ended at point b, then for ms(i+1):

- Let v be the first internal node at or above b.
- If there is no such v – search from the root.
- Otherwise – follow the suffix link from v to s(v) and search from s(v).path_label(v)=xa is a prefix of T[i...m] path_label(s(v))=a is a prefix of T[i+1...m].

- Let b denote the string between node v and point b.
- substring xab in P matches a prefix of T[i...m].
- substring ab in P matches a prefix of T[i+1...m].
- Traverse the path labeled b out of s(v) using skip/count trick (time proportional to number of nodes on the path).
- From the end of b, match single characters (starting with the first character that didn’t match for ms(i)).

In the search for ms(i+1):

- back up at most one edge from b to v O(1).
- traverse suffix link from v to s(v) O(1).
- traverse a b-path from s(v) in time proportional to the number of nodes on it O(m) total.
- perform additional comparisons starting with the first character that didn’t match for ms(i) O(m) total.

Ziv-Lempel data compression

For any position i in string S of length m:

- Priori - longest prefix of S[i...m] that occurs as a substring of S[1...i-1].
- li - length of Priori.
- si - starting position of the left-most copy of Priori (li>0).
Example: S = abaxcabaxabz, Prioir7 = bax, l7 = 3, s7 = 2.

- Copy of Priori starting at si is totally contained in S[1...i-1].

- Suppose the text S[1...i-1] has been represented (perhaps in compressed form) and li>0.
- Then Priori need not be explicitly represented.
- The pair (si,li) points to an earlier occurrence of Priori .
- Example:S = abaxcabaxabz (2,3)

i := 1

Repeat

compute li and siif li > 0 thenoutput (si,li)

i := i + lielseoutput S(i)i := i + 1

Until i > n

S1 = a b a c a b a x a b z

a b (1,1) c (1,3) x (1,2) z

S2 = ab ababababababababababababababab

ab(1,2)(1,4) (1,8) (1,16)

S = (ab)k compressed representation is O(log k)

- Process the compressed string left to right.
- Any pair (si,li) in the representation points to a substring that has already been fully decompressed.

- The algorithm does not request (si,li) for any position i already in the compressed part of S.
- For total O(m) time, find each requested pair (si,li) in O(li) time.

compute li and siif li > 0 thenoutput (si,li)

i := i + li

Before compression:

- Build a suffix tree T for S.
- For each node v, compute cv :
- the smallest leaf index in v’s subtree.
- the starting position of the leftmost copy of the substring that labels the path from the root to v.

- O(m) time.

root

computing (si,li):

a

|a| + cv ≤ i

p

v

S[i...m]

cv

i

|a|

leaf i

- To compute (si,li), traverse the unique path in T that matches a prefix of S[i...m]:
- Let: p - current point, v - first node at or below p.
- Traverse as long as: string_depth(p) + cv ≤ i.
- At the last point p of traversal:li = string_depth(p), si = cv .

- O(li) time.

S = abababab

1 2 3 4 5 6 7 8

i=1 li=0 a

i=2 li=0 b

i=3 li=2 cv=1 (1,2)

i=5 li=4 cv=1 (1,4)

a

string depth=1

b

b

cv=2

cv=1

v1

a

a

b

b

cv=2

v2

cv=1

a

a

b

$

$

b

cv=2

cv=1

$

$

a

a

b

b

$

$

$

$

2

4

6

8

7

5

3

1

- Compress S as it is being input one character at a time.
- Possible since S[1...i-1] is known before computing si,li.
- Implementation: build suffix tree online.
Ukkonen’s algorithm:

- In phase i, build implicit suffix tree Ti for prefix S[1...i].

Assume:

- The compaction has been done for S[1...i-1].
- Implicit suffix tree Ti-1 for S[1...i-1] has been built.
- cv values are given for each node v in Ti-1.
Then (si,li) can be obtained in O(li) time.

li = string_depth(p)

si = cv

root

S(i)

S(i+1)

...

S(k-1)

p

c S(k)

v

root

root

S(i)

S(i)

...

...

S(k-1)

S(k-1)

p

p

c S(k)

c S(k)

v

$

S(h) ... S(i-1)

S(j) ... S(i-1)

leaf j

h < j

leaf h

leaf h

cv values for all implicit suffix trees can be computed in total O(m) time.

- In Ukkonen’s algorithm:
- Only extension rule 2 updates cv values.
- Whenever a new internal node v is created by splitting an edge (u,w): cv cw.
- Whenever a new leaf j is created: cj j.
constant update time per new node.

new leaf and new node:

new leaf:

root

root

S(j)

S(j)

u

S(i)

S(i)

v

c

v

S(i+1)

S(i+1)

c2

w

c1

j

j

- Base case: output S(1) and build T1.
- General case: Suppose S[1...i-1] has been compressed and Ti-1 with cv values has been constructed.
- Match S(i),S(i+1),... along a path from the root in Ti-1.
- Let S(k) be the first that doesn’t match.
- Find (si,li).
- If li = 0, output S(i) and build Ti with cv.
- If li > 0, output (si,li) and build Ti,...,Tk-1 with cv.

- Total time: O(m).

Maximal Repetitive Structures

- A maximal pair in string S:A pair of identical substrings a and b in S s.t. the character to the immediate left (right) of a is different from the character to the immediate left (right) of b.
- Extending a and b in either direction would destroy the equality of the two strings.
- Example: S = xabcyiiizabcqabcyrxar

- Overlap is allowed:S = cxxaxxaxxbcxxaxxaaxxaxxb
- To allow a prefix or suffix of S to be part of a maximal pair:S #S$ (#,$ don’t appear in S).Example: #abcxabc$

- A maximal repeat in string S:
A substring of S that occurs in a maximal pair in S.

- Example: S = xabcyiiizabcqabcyrxar
maximal repeats: abc, abcy, ...

- Given: String S of length n.
- Goal: Find all maximal repeats in O(n) time.
- Lemma: Let T be a suffix tree for S.If string a is a maximal repeat in S,then a is the path-label of an internal node v in T.

S = xabcyiiizabcqabcyrxar

root

a

a

b

c

v

y

q

- There can be at most n maximal repeats in any string of length n.
- Proof:
by the lemma, since T has at most n internal nodes.

- The left character of leaf i in T is S(i-1).
- Node v of T is left diverse if at least 2 leaves in v’s subtree have different left characters.
- A leaf can’t be left diverse.
- Left diversity propagates upward.

maximal repeat

left diverse

x

b

a

a

x

a

$

b

b

x

x

$

$

a

$

a

$

$

3

6

5

2

4

1

a

a

x

x

b

#

The string a labeling the path to an internal node v of T is a maximal repeat

v is left diverse.

- Suppose a is a maximal repeat
- It participates in a maximal pair
- It has at least two occurrences with distinct left characters: xa, ya, xy
- Let i and j be the two starting positions of a. Then leaves i and j are in v’s subtree and have different left characters x,y.
- v is left diverse.

- Suppose v is left diverse there are substrings xap and yaq in S, xy.
- If pq a’s occurrences in xap and yaq form a maximal pair a is a maximal repeat.
- If p=q since v is a branching node, there is a substring zar in S, rp.If zx It forms a maximal pair with xap.If zy It forms a maximal pair with yap.In either case, a is a maximal repeat.

root

root

Case 1:

Case 2:

a

a

v

v

r...

p...

p…

q…

left char x

left char y

left char z

left char x

left char y

- Node v in T is a frontier node if:
- v is left diverse.
- none of v’s children are left diverse.

- Each node at or above the frontier is left diverse.
- The subtree of T from the root down to the frontier nodes is a compact representation of the set of all maximal repeats of S.
- Representation in O(n) though total length may be larger.

- Build suffix tree T.
- Find all left diverse nodes in linear time.
- Delete all nodes that aren’t left diverse, to achieve compact representation:

- Traverse T bottom-up, recording for each node:
- either that it is left diverse
- or the left character common to all leaves in its subtree.

- For each leaf: record its left character.
- For each internal node v:
- If any child is left diverse v is left diverse.
- Else If all children have a common character x record x for v.
- Else record that v is left diverse.

- Not every two occurrences of a maximal repeat form a maximal pair.
Example: S = xabcyiiizabcqabcyrxar

- There can be more than O(n) maximal pairs.
- The algorithm is O(n+k) where k is the number of maximal pairs.

For each node u and character x: keep all leaf numbers below u whose left character is x.

To find all maximal pairs of a:

For each character x, form the cartesian product of the list for x at v1 with every list for a character x at v2.

root

a

v

p…

q…

v1

v2

leaf i

leaf j

left char x

left char y

- Build suffix tree T for S.
- Record the left character of each leaf.
- Traverse T bottom-up.
- At each node v with path-label a:
- Output all maximal pairs of a: cartesian product of lists (u,x) and (u’,x’) for each pair of children u u’ and pair of characters x x’.
- Create the lists for node v by linking the lists of v’s children.

- Suffix tree construction O(n).
- Bottom-up traversal including all list-linking O(n).
- All cartesian product operations O(k),where k is the number of maximal pairs.
- Total O(n+k).

- supermaximal repeat: a maximal repeat that isn’t a substring of any other maximal repeat.
- Example: S = xabcyiiizabcqabcyrxarabcy is supermaximal, abc isn’t.
- Theorem:A left diverse internal node v in the suffix tree for S represents a supermaximal repeat iff
- all of v’s children are leaves
- and each has a distinct left character

Longest Common Extension

Preprocess strings S1 and S2 s.t. the following queries can be computed in O(1) time each:

- Given index pair (i,j), find the length of the longest substring of S1 starting at position i that matches a substring of S2 starting at position j.
S1: ... abcdzzz ...

S2: ... abcdefg ...

j

i

Preprocess: O(|S1|+|S2|)

- Build generalized suffix tree T for S1 and S2.
- Preprocess T for constant-time LCA queries.
- Compute string-depth of every node.
To answer query (i,j): O(1)

- Find LCA node v of leaves corresponding to suffix i of S1 and suffix j of S2.
- Return string-depth(v).

Tandem Repeats

tandem repeat: a string a that can be written as a = bb, where b is a substring.

s = x a b a b a b a b y

ab|ab

a b|ab

ab|ab

ba|ba

ba|ba

abab|abab

note: b is not required to be of maximal length.

b = ab

b = ba

b = abab

For each feasible pair of start position i and middle position j:

- Perform a longest common extension query from i and j.
- If the extension from i reaches j or beyond,(i,j) defines a tandem repeat.
1 ... i ... j-1 j ... 2j-i-1 2j-i ...n

j-i

j-i

- Preprocess for longest common extension: O(n).
- O(n2) feasible pairs, O(1) time to check each one.
- O(n2) total.

- Due to Landau & Schmidt.
- O(nlogn + z) time.
- z = total number of tandem repeats in S.
- z can be as large as Ө(n2).
- Example: all n characters are the same.
- In practice, z is expected to be smaller.

Let h = n/2.

- Find all tandem repeats contained entirely in the first half of S (up to h).
- Find all tandem repeats contained entirely in the second half of S (after h).
- Find all tandem repeats where the first copy contains h.
- Find all tandem repeats where the second copy contains h.

- 1 and 2 solved recursively.
- 3 and 4 symmetric to each other.
- Remains to show: solution for 3.

- For each l = 1,...,h, find all tandem repeats of length exactly 2l whose first copy contains h.

l2

l2

l1

l1

X1

Y1

X2

Y2

. . .

. . .

h-1

h

q-1

q

l

- Let q = h+l.
- Compute longest common extension from h and q.Let l1 denote its length.
- Compute longest common extension from h-1 and q-1 in reverse direction.Let l2 denote its length.
- There is a tandem repeat of length 2l whose first copy contains h iff l1 1 and l1 + l2 l.

5.If the condition holds, output starting positions:Max(h-l2 , h-l+1), ... ,Min(h+l1-l , h).

l1

l2

h-l2

h+l1-l

h+l-l2

h+l1

q

h

-l

- For fixed l:
- O(1) longest common extension queries.

- For subproblem 3 on a string of length n:
- O(n) longest common extension queries.

- Entire algorithm on a string of length n:
- Let T(n) denote the number of longest common extension queries for a string of length n.T(n) = 2T(n/2) + 2n T(n) = O(nlogn)
- Including output: O(nlogn + z) total.

Inexact Matching

- Given: pattern P, text T, fixed number k.
- k-mismatch of P: a |P|-length substring of T that matches at least |P|-k characters of P(i.e. it matches P with at most k mismatches).
- The k-mismatch problem:Find all k-mismatches of P in T.

P = bend

T = abentbananaend

k = 2

- T contains three 2-mismatches of P:a b e n tb a n a n a e n d b e n d b e n db e n d
1-mismatch2-mismatch 1-mismatch

- Notation: |P|=n, |T|=m, k independent of n and m (k<<n).
- General idea:
- For each position i in T, determine whether a k-mismatch of P begins at position i.
- To do this efficiently: successively execute up to k+1 longest common extension queries.
- A k-mismatch of P begins at position i iff these extensions reach the end of P.

1

2

4

n

P

T

i

i+3

query 1

query 2

query 3

- j 1i’ icount 0
- Compute the length l of the longest common extension starting at positions j of P and i’ of T.
- if j+l=n+1then a k-mismatch of P occurs in T starting at i; stop.
- if count<kthen count count+1 j j+l+1 i’ i’+l+1 go to step 2.else, a k-mismatch of P does not occur in T starting at i; stop.

- Preprocessing of T and P for longest common extension queries O(m).
- For each index i=1,...,m-n+1 of T, up to k+1 longest common extension queries O(k) per index O(km) total.
- Total O(km) time.

- definition: tandem repeat in which the two copies differ by at most k mismatches.
- example: axab|aybb
- goal: find all k-mismatch tandem repeats.
- simple solution:
For each feasible pair of starting position i and middle position j, check if S[i...j-1] and S[j...2j-i-1] match with at most k mismatches.

O(kn2)

- faster solution exists: O(knlog(n/k)+z).

- O(knlog(n/k)+z)
- same divide and conquer algorithm.
- subproblem 3 for fixed l: find all k-mismatch tandem repeats of length 2l whose first copy contains h.
- let q = h+l.
- run k successive longest common extension queries forward from h and q. mark every mismatch with ->.
- run k successive longest common extension queries backward from h-1 and q-1. mark every mismatch with <-
- position t in [h+1,q] is a middle position of a k-mismatch tandem repeat iff the number of -> mismatches in positions h,...,t-1 plus the number of <- mismatches in positions t,...,q-1 is ≤ k.

- a c c e e | ab c d e | a c c d d
- calculate sums only for positions with arrows, to get intervals of legal midpoints.
- O(k) time per fixed l, and any l ≤ k need not be checked.

h

t

q