1 / 73

# Selected Applications of Suffix Trees - PowerPoint PPT Presentation

Selected Applications of Suffix Trees. Reminder – suffix tree. Suffix tree for string S of length m: rooted directed tree with m leaves numbered 1,...,m. each internal node, except the root, has at least 2 children. each edge labeled with a nonempty substring of S.

Related searches for Selected Applications of Suffix Trees

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Selected Applications of Suffix Trees' - abra

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Selected Applications of Suffix Trees

Suffix tree for string S of length m:

• rooted directed tree with m leaves numbered 1,...,m.

• each internal node, except the root, has at least 2 children.

• each edge labeled with a nonempty substring of S.

• edges out of a node begin with different characters.

• path from the root to leaf i spells out suffix S[i...m].

• Each substring a of S appears on some unique path from the root.

• If a ends at point p, the leaves below p mark all its occurrences.

a occurs in S starting at position j 

a is a prefix of S[j...m] 

a labels an initial part of the path from the root to leaf j.

Example: S=xabxa\$1 2 3 4 5 6

x

b

a

a

x

v

a

\$

b

b

x

x

\$

\$

a

\$

a

\$

\$

3

6

5

2

4

1

Find all occurrences of pattern P in text T.

• Build suffix tree for T  O(m) (Ukkonen).

• Match P along a path from the root  O(1) per character (finite alphabet)  O(n) total.

• If P fully matches a path, then the leaves below mark all starting positions of P in T  O(k) where k = number of occurrences.

• ms(i) – the length of the longest substring of T starting at position i that matches a substring somewhere in P.

• example: T = abcxabcdex, P = wyabcwzqabcdw ms(1)=3, ms(5)=4.

• There is an occurrence of P starting at position i of T iff ms(i)=|P|.

Goal: Compute ms(i) for each position i in T, in O(m) total time, using only a suffix tree for P.

• Naive way: match T[i...m] starting from the root.more than O(m) total.

• Build suffix tree for P (Ukkonen) and keep suffix links.

• suffix link: pointer from internal node v with path-label xa to node s(v) with path-label a. (x character, a substring)

base case: For ms(1), match T[1...m] from the root.

general case: Suppose the matching path for ms(i) ended at point b, then for ms(i+1):

• Let v be the first internal node at or above b.

• If there is no such v – search from the root.

• Otherwise – follow the suffix link from v to s(v) and search from s(v).path_label(v)=xa is a prefix of T[i...m] path_label(s(v))=a is a prefix of T[i+1...m].

• Let b denote the string between node v and point b.

• substring xab in P matches a prefix of T[i...m].

• substring ab in P matches a prefix of T[i+1...m].

• Traverse the path labeled b out of s(v) using skip/count trick (time proportional to number of nodes on the path).

• From the end of b, match single characters (starting with the first character that didn’t match for ms(i)).

In the search for ms(i+1):

• back up at most one edge from b to v  O(1).

• traverse suffix link from v to s(v)  O(1).

• traverse a b-path from s(v) in time proportional to the number of nodes on it  O(m) total.

• perform additional comparisons starting with the first character that didn’t match for ms(i)  O(m) total.

### Ziv-Lempel data compression

For any position i in string S of length m:

• Priori - longest prefix of S[i...m] that occurs as a substring of S[1...i-1].

• li - length of Priori.

• si - starting position of the left-most copy of Priori (li>0).

Example: S = abaxcabaxabz, Prioir7 = bax, l7 = 3, s7 = 2.

• Copy of Priori starting at si is totally contained in S[1...i-1].

• Suppose the text S[1...i-1] has been represented (perhaps in compressed form) and li>0.

• Then Priori need not be explicitly represented.

• The pair (si,li) points to an earlier occurrence of Priori .

• Example:S = abaxcabaxabz (2,3)

i := 1

Repeat

compute li and siif li > 0 then output (si,li)

i := i + lielse output S(i) i := i + 1

Until i > n

S1 = a b a c a b a x a b z

      

a b (1,1) c (1,3) x (1,2) z

S2 = ab ababababababababababababababab

    

ab(1,2)(1,4) (1,8) (1,16)

S = (ab)k  compressed representation is O(log k)

• Process the compressed string left to right.

• Any pair (si,li) in the representation points to a substring that has already been fully decompressed.

Computing (si,li)

• The algorithm does not request (si,li) for any position i already in the compressed part of S.

• For total O(m) time, find each requested pair (si,li) in O(li) time.

compute li and siif li > 0 then output (si,li)

i := i + li

Before compression:

• Build a suffix tree T for S.

• For each node v, compute cv :

• the smallest leaf index in v’s subtree.

• the starting position of the leftmost copy of the substring that labels the path from the root to v.

• O(m) time.

root

computing (si,li):

a

|a| + cv ≤ i

p

v

S[i...m]

cv

i

|a|

leaf i

• To compute (si,li), traverse the unique path in T that matches a prefix of S[i...m]:

• Let: p - current point, v - first node at or below p.

• Traverse as long as: string_depth(p) + cv ≤ i.

• At the last point p of traversal:li = string_depth(p), si = cv .

• O(li) time.

S = abababab

1 2 3 4 5 6 7 8

i=1 li=0  a

i=2 li=0  b

i=3 li=2 cv=1  (1,2)

i=5 li=4 cv=1  (1,4)

a

string depth=1

b

b

cv=2

cv=1

v1

a

a

b

b

cv=2

v2

cv=1

a

a

b

\$

\$

b

cv=2

cv=1

\$

\$

a

a

b

b

\$

\$

\$

\$

2

4

6

8

7

5

3

1

• Compress S as it is being input one character at a time.

• Possible since S[1...i-1] is known before computing si,li.

• Implementation: build suffix tree online.

 Ukkonen’s algorithm:

• In phase i, build implicit suffix tree Ti for prefix S[1...i].

Assume:

• The compaction has been done for S[1...i-1].

• Implicit suffix tree Ti-1 for S[1...i-1] has been built.

• cv values are given for each node v in Ti-1.

Then (si,li) can be obtained in O(li) time.

Suppose we had a suffix tree for S[1...i-1] with cv values  We could find (si,li) in O(li) time.

li = string_depth(p)

si = cv

root

S(i)

S(i+1)

...

S(k-1)

p

c  S(k)

v

The missing leaves in the implicit suffix tree are not needed.

root

root

S(i)

S(i)

...

...

S(k-1)

S(k-1)

p

p

c  S(k)

c  S(k)

v

\$

S(h) ... S(i-1)

S(j) ... S(i-1)

leaf j

h < j

leaf h

leaf h

cv values for all implicit suffix trees can be computed in total O(m) time.

• In Ukkonen’s algorithm:

• Only extension rule 2 updates cv values.

• Whenever a new internal node v is created by splitting an edge (u,w): cv cw.

• Whenever a new leaf j is created: cj  j.

 constant update time per new node.

Updating cv values

new leaf and new node:

new leaf:

root

root

S(j)

S(j)

u

S(i)

S(i)

v

c

v

S(i+1)

S(i+1)

c2

w

c1

j

j

• Base case: output S(1) and build T1.

• General case: Suppose S[1...i-1] has been compressed and Ti-1 with cv values has been constructed.

• Match S(i),S(i+1),... along a path from the root in Ti-1.

• Let S(k) be the first that doesn’t match.

• Find (si,li).

• If li = 0, output S(i) and build Ti with cv.

• If li > 0, output (si,li) and build Ti,...,Tk-1 with cv.

• Total time: O(m).

### Maximal Repetitive Structures

• A maximal pair in string S:A pair of identical substrings a and b in S s.t. the character to the immediate left (right) of a is different from the character to the immediate left (right) of b.

• Extending a and b in either direction would destroy the equality of the two strings.

• Example: S = xabcyiiizabcqabcyrxar

• Overlap is allowed:S = cxxaxxaxxbcxxaxxaaxxaxxb

• To allow a prefix or suffix of S to be part of a maximal pair:S  #S\$ (#,\$ don’t appear in S).Example: #abcxabc\$

• A maximal repeat in string S:

A substring of S that occurs in a maximal pair in S.

• Example: S = xabcyiiizabcqabcyrxar

maximal repeats: abc, abcy, ...

Finding All Maximal RepeatsIn Linear Time

• Given: String S of length n.

• Goal: Find all maximal repeats in O(n) time.

• Lemma: Let T be a suffix tree for S.If string a is a maximal repeat in S,then a is the path-label of an internal node v in T.

S = xabcyiiizabcqabcyrxar

root

a

a

b

c

v

y

q

• There can be at most n maximal repeats in any string of length n.

• Proof:

by the lemma, since T has at most n internal nodes.

Which internal nodes correspond to maximal repeats?

• The left character of leaf i in T is S(i-1).

• Node v of T is left diverse if at least 2 leaves in v’s subtree have different left characters.

• A leaf can’t be left diverse.

• Left diversity propagates upward.

Example: S = #xabxa\$1 2 3 4 5 6

maximal repeat

left diverse

x

b

a

a

x

a

\$

b

b

x

x

\$

\$

a

\$

a

\$

\$

3

6

5

2

4

1

a

a

x

x

b

#

The string a labeling the path to an internal node v of T is a maximal repeat

v is left diverse.

• Suppose a is a maximal repeat 

• It participates in a maximal pair 

• It has at least two occurrences with distinct left characters: xa, ya, xy 

• Let i and j be the two starting positions of a. Then leaves i and j are in v’s subtree and have different left characters x,y. 

• v is left diverse.

• Suppose v is left diverse there are substrings xap and yaq in S, xy.

• If pq  a’s occurrences in xap and yaq form a maximal pair  a is a maximal repeat.

• If p=q  since v is a branching node, there is a substring zar in S, rp.If zx  It forms a maximal pair with xap.If zy  It forms a maximal pair with yap.In either case, a is a maximal repeat.

Proof of  (continued)

root

root

Case 1:

Case 2:

a

a

v

v

r...

p...

p…

q…

left char x

left char y

left char z

left char x

left char y

• Node v in T is a frontier node if:

• v is left diverse.

• none of v’s children are left diverse.

• Each node at or above the frontier is left diverse.

• The subtree of T from the root down to the frontier nodes is a compact representation of the set of all maximal repeats of S.

• Representation in O(n) though total length may be larger.

• Build suffix tree T.

• Find all left diverse nodes in linear time.

• Delete all nodes that aren’t left diverse, to achieve compact representation:

• Traverse T bottom-up, recording for each node:

• either that it is left diverse

• or the left character common to all leaves in its subtree.

• For each leaf: record its left character.

• For each internal node v:

• If any child is left diverse  v is left diverse.

• Else If all children have a common character x  record x for v.

• Else record that v is left diverse.

Finding All Maximal PairsIn Linear Time

• Not every two occurrences of a maximal repeat form a maximal pair.

Example: S = xabcyiiizabcqabcyrxar

• There can be more than O(n) maximal pairs.

• The algorithm is O(n+k) where k is the number of maximal pairs.

For each node u and character x: keep all leaf numbers below u whose left character is x.

To find all maximal pairs of a:

For each character x, form the cartesian product of the list for x at v1 with every list for a character  x at v2.

root

a

v

p…

q…

v1

v2

leaf i

leaf j

left char x

left char y

• Build suffix tree T for S.

• Record the left character of each leaf.

• Traverse T bottom-up.

• At each node v with path-label a:

• Output all maximal pairs of a: cartesian product of lists (u,x) and (u’,x’) for each pair of children u  u’ and pair of characters x  x’.

• Create the lists for node v by linking the lists of v’s children.

• Suffix tree construction  O(n).

• Bottom-up traversal including all list-linking  O(n).

• All cartesian product operations  O(k),where k is the number of maximal pairs.

• Total O(n+k).

Finding All Supermaximal Repeats In Linear Time

• supermaximal repeat: a maximal repeat that isn’t a substring of any other maximal repeat.

• Example: S = xabcyiiizabcqabcyrxarabcy is supermaximal, abc isn’t.

• Theorem:A left diverse internal node v in the suffix tree for S represents a supermaximal repeat iff

• all of v’s children are leaves

• and each has a distinct left character

### Longest Common Extension

Preprocess strings S1 and S2 s.t. the following queries can be computed in O(1) time each:

• Given index pair (i,j), find the length of the longest substring of S1 starting at position i that matches a substring of S2 starting at position j.

S1: ... abcdzzz ...

S2: ... abcdefg ...

j

i

Preprocess: O(|S1|+|S2|)

• Build generalized suffix tree T for S1 and S2.

• Preprocess T for constant-time LCA queries.

• Compute string-depth of every node.

• Find LCA node v of leaves corresponding to suffix i of S1 and suffix j of S2.

• Return string-depth(v).

### Tandem Repeats

tandem repeat: a string a that can be written as a = bb, where b is a substring.

s = x a b a b a b a b y

ab|ab

a b|ab

ab|ab

ba|ba

ba|ba

abab|abab

note: b is not required to be of maximal length.

b = ab

b = ba

b = abab

For each feasible pair of start position i and middle position j:

• Perform a longest common extension query from i and j.

• If the extension from i reaches j or beyond,(i,j) defines a tandem repeat.

1 ... i ... j-1 j ... 2j-i-1 2j-i ...n

j-i

j-i

• Preprocess for longest common extension: O(n).

• O(n2) feasible pairs, O(1) time to check each one.

• O(n2) total.

• Due to Landau & Schmidt.

• O(nlogn + z) time.

• z = total number of tandem repeats in S.

• z can be as large as Ө(n2).

• Example: all n characters are the same.

• In practice, z is expected to be smaller.

Let h = n/2.

• Find all tandem repeats contained entirely in the first half of S (up to h).

• Find all tandem repeats contained entirely in the second half of S (after h).

• Find all tandem repeats where the first copy contains h.

• Find all tandem repeats where the second copy contains h.

• 1 and 2  solved recursively.

• 3 and 4  symmetric to each other.

• Remains to show: solution for 3.

• For each l = 1,...,h, find all tandem repeats of length exactly 2l whose first copy contains h.

l2

l2

l1

l1

X1

Y1

X2

Y2

. . .

. . .

h-1

h

q-1

q

l

• Let q = h+l.

• Compute longest common extension from h and q.Let l1 denote its length.

• Compute longest common extension from h-1 and q-1 in reverse direction.Let l2 denote its length.

• There is a tandem repeat of length 2l whose first copy contains h iff l1 1 and l1 + l2 l.

5. If the condition holds, output starting positions:Max(h-l2 , h-l+1), ... ,Min(h+l1-l , h).

l1

l2

h-l2

h+l1-l

h+l-l2

h+l1

q

h

-l

• For fixed l:

• O(1) longest common extension queries.

• For subproblem 3 on a string of length n:

• O(n) longest common extension queries.

• Entire algorithm on a string of length n:

• Let T(n) denote the number of longest common extension queries for a string of length n.T(n) = 2T(n/2) + 2n  T(n) = O(nlogn)

• Including output: O(nlogn + z) total.

### Inexact Matching

• Given: pattern P, text T, fixed number k.

• k-mismatch of P: a |P|-length substring of T that matches at least |P|-k characters of P(i.e. it matches P with at most k mismatches).

• The k-mismatch problem:Find all k-mismatches of P in T.

P = bend

T = abentbananaend

k = 2

• T contains three 2-mismatches of P:a b e n tb a n a n a e n d b e n d b e n db e n d

1-mismatch2-mismatch 1-mismatch

• Notation: |P|=n, |T|=m, k independent of n and m (k<<n).

• General idea:

• For each position i in T, determine whether a k-mismatch of P begins at position i.

• To do this efficiently: successively execute up to k+1 longest common extension queries.

• A k-mismatch of P begins at position i iff these extensions reach the end of P.

1

2

4

n

P

T

i

i+3

query 1

query 2

query 3

• j  1i’  icount  0

• Compute the length l of the longest common extension starting at positions j of P and i’ of T.

• if j+l=n+1then a k-mismatch of P occurs in T starting at i; stop.

• if count<kthen count  count+1 j  j+l+1 i’  i’+l+1 go to step 2.else, a k-mismatch of P does not occur in T starting at i; stop.

• Preprocessing of T and P for longest common extension queries  O(m).

• For each index i=1,...,m-n+1 of T, up to k+1 longest common extension queries  O(k) per index  O(km) total.

• Total O(km) time.

• definition: tandem repeat in which the two copies differ by at most k mismatches.

• example: axab|aybb

• goal: find all k-mismatch tandem repeats.

• simple solution:

For each feasible pair of starting position i and middle position j, check if S[i...j-1] and S[j...2j-i-1] match with at most k mismatches.

 O(kn2)

• faster solution exists: O(knlog(n/k)+z).

• O(knlog(n/k)+z)

• same divide and conquer algorithm.

• subproblem 3 for fixed l: find all k-mismatch tandem repeats of length 2l whose first copy contains h.

• let q = h+l.

• run k successive longest common extension queries forward from h and q. mark every mismatch with ->.

• run k successive longest common extension queries backward from h-1 and q-1. mark every mismatch with <-

• position t in [h+1,q] is a middle position of a k-mismatch tandem repeat iff the number of -> mismatches in positions h,...,t-1 plus the number of <- mismatches in positions t,...,q-1 is ≤ k.

• a c c e e | ab c d e | a c c d d

• calculate sums only for positions with arrows, to get intervals of legal midpoints.

• O(k) time per fixed l, and any l ≤ k need not be checked.

h

t

q