Selected applications of suffix trees
Download
1 / 73

Selected Applications of Suffix Trees - PowerPoint PPT Presentation


  • 152 Views
  • Updated On :

Selected Applications of Suffix Trees. Reminder – suffix tree. Suffix tree for string S of length m: rooted directed tree with m leaves numbered 1,...,m. each internal node, except the root, has at least 2 children. each edge labeled with a nonempty substring of S.

Related searches for Selected Applications of Suffix Trees

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Selected Applications of Suffix Trees' - abra


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Selected applications of suffix trees l.jpg

Selected Applications of Suffix Trees


Reminder suffix tree l.jpg
Reminder – suffix tree

Suffix tree for string S of length m:

  • rooted directed tree with m leaves numbered 1,...,m.

  • each internal node, except the root, has at least 2 children.

  • each edge labeled with a nonempty substring of S.

  • edges out of a node begin with different characters.

  • path from the root to leaf i spells out suffix S[i...m].


Reminder suffix tree continued l.jpg
Reminder – suffix tree (continued)

  • Each substring a of S appears on some unique path from the root.

  • If a ends at point p, the leaves below p mark all its occurrences.

    a occurs in S starting at position j 

    a is a prefix of S[j...m] 

    a labels an initial part of the path from the root to leaf j.


Example s xabxa 1 2 3 4 5 6 l.jpg
Example: S=xabxa$1 2 3 4 5 6

x

b

a

a

x

v

a

$

b

b

x

x

$

$

a

$

a

$

$

3

6

5

2

4

1


Exact string matching l.jpg
Exact string matching

Find all occurrences of pattern P in text T.

  • Build suffix tree for T  O(m) (Ukkonen).

  • Match P along a path from the root  O(1) per character (finite alphabet)  O(n) total.

  • If P fully matches a path, then the leaves below mark all starting positions of P in T  O(k) where k = number of occurrences.


Matching statistics l.jpg
Matching Statistics

  • ms(i) – the length of the longest substring of T starting at position i that matches a substring somewhere in P.

  • example: T = abcxabcdex, P = wyabcwzqabcdw ms(1)=3, ms(5)=4.

  • There is an occurrence of P starting at position i of T iff ms(i)=|P|.


Goal compute ms i for each position i in t in o m total time using only a suffix tree for p l.jpg
Goal: Compute ms(i) for each position i in T, in O(m) total time, using only a suffix tree for P.

  • Naive way: match T[i...m] starting from the root.more than O(m) total.

    Using suffix links:

  • Build suffix tree for P (Ukkonen) and keep suffix links.

  • suffix link: pointer from internal node v with path-label xa to node s(v) with path-label a. (x character, a substring)


Compute ms i in order l.jpg
Compute ms(i) in order

base case: For ms(1), match T[1...m] from the root.

general case: Suppose the matching path for ms(i) ended at point b, then for ms(i+1):

  • Let v be the first internal node at or above b.

  • If there is no such v – search from the root.

  • Otherwise – follow the suffix link from v to s(v) and search from s(v).path_label(v)=xa is a prefix of T[i...m] path_label(s(v))=a is a prefix of T[i+1...m].


Skip count l.jpg
skip / count

  • Let b denote the string between node v and point b.

  • substring xab in P matches a prefix of T[i...m].

  • substring ab in P matches a prefix of T[i+1...m].

  • Traverse the path labeled b out of s(v) using skip/count trick (time proportional to number of nodes on the path).

  • From the end of b, match single characters (starting with the first character that didn’t match for ms(i)).


Time analysis l.jpg
Time analysis

In the search for ms(i+1):

  • back up at most one edge from b to v  O(1).

  • traverse suffix link from v to s(v)  O(1).

  • traverse a b-path from s(v) in time proportional to the number of nodes on it  O(m) total.

  • perform additional comparisons starting with the first character that didn’t match for ms(i)  O(m) total.



Definitions l.jpg
Definitions

For any position i in string S of length m:

  • Priori - longest prefix of S[i...m] that occurs as a substring of S[1...i-1].

  • li - length of Priori.

  • si - starting position of the left-most copy of Priori (li>0).

    Example: S = abaxcabaxabz, Prioir7 = bax, l7 = 3, s7 = 2.

  • Copy of Priori starting at si is totally contained in S[1...i-1].


Basic idea l.jpg
Basic idea

  • Suppose the text S[1...i-1] has been represented (perhaps in compressed form) and li>0.

  • Then Priori need not be explicitly represented.

  • The pair (si,li) points to an earlier occurrence of Priori .

  • Example:S = abaxcabaxabz (2,3)


Compression algorithm outline l.jpg
Compression algorithm (outline)

i := 1

Repeat

compute li and siif li > 0 then output (si,li)

i := i + lielse output S(i) i := i + 1

Until i > n


Examples l.jpg
Examples

S1 = a b a c a b a x a b z

      

a b (1,1) c (1,3) x (1,2) z

S2 = ab ababababababababababababababab

    

ab(1,2)(1,4) (1,8) (1,16)

S = (ab)k  compressed representation is O(log k)


Decompress l.jpg
Decompress

  • Process the compressed string left to right.

  • Any pair (si,li) in the representation points to a substring that has already been fully decompressed.


Computing s i l i l.jpg
Computing (si,li)

  • The algorithm does not request (si,li) for any position i already in the compressed part of S.

  • For total O(m) time, find each requested pair (si,li) in O(li) time.

compute li and siif li > 0 then output (si,li)

i := i + li


Implementation using suffix tree 1 l.jpg
Implementation using suffix tree (1)

Before compression:

  • Build a suffix tree T for S.

  • For each node v, compute cv :

    • the smallest leaf index in v’s subtree.

    • the starting position of the leftmost copy of the substring that labels the path from the root to v.

  • O(m) time.


Implementation using suffix trees 2 l.jpg
Implementation using suffix trees (2)

root

computing (si,li):

a

|a| + cv ≤ i

p

v

S[i...m]

cv

i

|a|

leaf i


Implementation using suffix trees 3 l.jpg
Implementation using suffix trees (3)

  • To compute (si,li), traverse the unique path in T that matches a prefix of S[i...m]:

    • Let: p - current point, v - first node at or below p.

    • Traverse as long as: string_depth(p) + cv ≤ i.

    • At the last point p of traversal:li = string_depth(p), si = cv .

  • O(li) time.


Example l.jpg
Example

S = abababab

1 2 3 4 5 6 7 8

i=1 li=0  a

i=2 li=0  b

i=3 li=2 cv=1  (1,2)

i=5 li=4 cv=1  (1,4)

a

string depth=1

b

b

cv=2

cv=1

v1

a

a

b

b

cv=2

v2

cv=1

a

a

b

$

$

b

cv=2

cv=1

$

$

a

a

b

b

$

$

$

$

2

4

6

8

7

5

3

1


Online version l.jpg
Online version

  • Compress S as it is being input one character at a time.

  • Possible since S[1...i-1] is known before computing si,li.

  • Implementation: build suffix tree online.

     Ukkonen’s algorithm:

    • In phase i, build implicit suffix tree Ti for prefix S[1...i].


Claim 1 l.jpg
Claim 1

Assume:

  • The compaction has been done for S[1...i-1].

  • Implicit suffix tree Ti-1 for S[1...i-1] has been built.

  • cv values are given for each node v in Ti-1.

    Then (si,li) can be obtained in O(li) time.


Suppose we had a suffix tree for s 1 i 1 with c v values we could find s i l i in o l i time l.jpg
Suppose we had a suffix tree for S[1...i-1] with cv values  We could find (si,li) in O(li) time.

li = string_depth(p)

si = cv

root

S(i)

S(i+1)

...

S(k-1)

p

c  S(k)

v


The missing leaves in the implicit suffix tree are not needed l.jpg
The missing leaves in the implicit suffix tree are not needed.

root

root

S(i)

S(i)

...

...

S(k-1)

S(k-1)

p

p

c  S(k)

c  S(k)

v

$

S(h) ... S(i-1)

S(j) ... S(i-1)

leaf j

h < j

leaf h

leaf h


Claim 2 l.jpg
Claim 2

cv values for all implicit suffix trees can be computed in total O(m) time.

  • In Ukkonen’s algorithm:

    • Only extension rule 2 updates cv values.

    • Whenever a new internal node v is created by splitting an edge (u,w): cv cw.

    • Whenever a new leaf j is created: cj  j.

       constant update time per new node.


Updating c v values l.jpg
Updating cv values

new leaf and new node:

new leaf:

root

root

S(j)

S(j)

u

S(i)

S(i)

v

c

v

S(i+1)

S(i+1)

c2

w

c1

j

j


Online algorithm l.jpg
Online algorithm

  • Base case: output S(1) and build T1.

  • General case: Suppose S[1...i-1] has been compressed and Ti-1 with cv values has been constructed.

    • Match S(i),S(i+1),... along a path from the root in Ti-1.

    • Let S(k) be the first that doesn’t match.

    • Find (si,li).

    • If li = 0, output S(i) and build Ti with cv.

    • If li > 0, output (si,li) and build Ti,...,Tk-1 with cv.

  • Total time: O(m).



Maximal pair l.jpg
Maximal Pair

  • A maximal pair in string S:A pair of identical substrings a and b in S s.t. the character to the immediate left (right) of a is different from the character to the immediate left (right) of b.

  • Extending a and b in either direction would destroy the equality of the two strings.

  • Example: S = xabcyiiizabcqabcyrxar


Maximal pair continued l.jpg
Maximal Pair (continued)

  • Overlap is allowed:S = cxxaxxaxxbcxxaxxaaxxaxxb

  • To allow a prefix or suffix of S to be part of a maximal pair:S  #S$ (#,$ don’t appear in S).Example: #abcxabc$


Maximal repeat l.jpg
Maximal Repeat

  • A maximal repeat in string S:

    A substring of S that occurs in a maximal pair in S.

  • Example: S = xabcyiiizabcqabcyrxar

    maximal repeats: abc, abcy, ...


Finding all maximal repeats in linear time l.jpg
Finding All Maximal RepeatsIn Linear Time

  • Given: String S of length n.

  • Goal: Find all maximal repeats in O(n) time.

  • Lemma: Let T be a suffix tree for S.If string a is a maximal repeat in S,then a is the path-label of an internal node v in T.


Proof by def of maximal repeat l.jpg
Proof – by def. of maximal repeat

S = xabcyiiizabcqabcyrxar

root

a

a

b

c

v

y

q


Conclusion l.jpg
Conclusion

  • There can be at most n maximal repeats in any string of length n.

  • Proof:

    by the lemma, since T has at most n internal nodes.


Which internal nodes correspond to maximal repeats l.jpg
Which internal nodes correspond to maximal repeats?

  • The left character of leaf i in T is S(i-1).

  • Node v of T is left diverse if at least 2 leaves in v’s subtree have different left characters.

  • A leaf can’t be left diverse.

  • Left diversity propagates upward.


Example s xabxa 1 2 3 4 5 637 l.jpg
Example: S = #xabxa$1 2 3 4 5 6

maximal repeat

left diverse

x

b

a

a

x

a

$

b

b

x

x

$

$

a

$

a

$

$

3

6

5

2

4

1

a

a

x

x

b

#


Theorem l.jpg
Theorem

The string a labeling the path to an internal node v of T is a maximal repeat

v is left diverse.


Proof of l.jpg
Proof of

  • Suppose a is a maximal repeat 

  • It participates in a maximal pair 

  • It has at least two occurrences with distinct left characters: xa, ya, xy 

  • Let i and j be the two starting positions of a. Then leaves i and j are in v’s subtree and have different left characters x,y. 

  • v is left diverse.


Proof of40 l.jpg
Proof of

  • Suppose v is left diverse there are substrings xap and yaq in S, xy.

  • If pq  a’s occurrences in xap and yaq form a maximal pair  a is a maximal repeat.

  • If p=q  since v is a branching node, there is a substring zar in S, rp.If zx  It forms a maximal pair with xap.If zy  It forms a maximal pair with yap.In either case, a is a maximal repeat.


Proof of continued l.jpg
Proof of  (continued)

root

root

Case 1:

Case 2:

a

a

v

v

r...

p...

p…

q…

left char x

left char y

left char z

left char x

left char y


Compact representation l.jpg
Compact Representation

  • Node v in T is a frontier node if:

    • v is left diverse.

    • none of v’s children are left diverse.

  • Each node at or above the frontier is left diverse.

  • The subtree of T from the root down to the frontier nodes is a compact representation of the set of all maximal repeats of S.

  • Representation in O(n) though total length may be larger.


Linear time algorithm l.jpg
Linear time algorithm

  • Build suffix tree T.

  • Find all left diverse nodes in linear time.

  • Delete all nodes that aren’t left diverse, to achieve compact representation:


Finding all left diverse nodes in linear time l.jpg
finding all left diverse nodes in linear time

  • Traverse T bottom-up, recording for each node:

    • either that it is left diverse

    • or the left character common to all leaves in its subtree.

  • For each leaf: record its left character.

  • For each internal node v:

    • If any child is left diverse  v is left diverse.

    • Else If all children have a common character x  record x for v.

    • Else record that v is left diverse.


Finding all maximal pairs in linear time l.jpg
Finding All Maximal PairsIn Linear Time

  • Not every two occurrences of a maximal repeat form a maximal pair.

    Example: S = xabcyiiizabcqabcyrxar

  • There can be more than O(n) maximal pairs.

  • The algorithm is O(n+k) where k is the number of maximal pairs.


General idea l.jpg
General Idea

For each node u and character x: keep all leaf numbers below u whose left character is x.

To find all maximal pairs of a:

For each character x, form the cartesian product of the list for x at v1 with every list for a character  x at v2.

root

a

v

p…

q…

v1

v2

leaf i

leaf j

left char x

left char y


The algorithm l.jpg
The Algorithm

  • Build suffix tree T for S.

  • Record the left character of each leaf.

  • Traverse T bottom-up.

  • At each node v with path-label a:

    • Output all maximal pairs of a: cartesian product of lists (u,x) and (u’,x’) for each pair of children u  u’ and pair of characters x  x’.

    • Create the lists for node v by linking the lists of v’s children.


Time analysis48 l.jpg
Time Analysis

  • Suffix tree construction  O(n).

  • Bottom-up traversal including all list-linking  O(n).

  • All cartesian product operations  O(k),where k is the number of maximal pairs.

  • Total O(n+k).


Finding all supermaximal repeats in linear time l.jpg
Finding All Supermaximal Repeats In Linear Time

  • supermaximal repeat: a maximal repeat that isn’t a substring of any other maximal repeat.

  • Example: S = xabcyiiizabcqabcyrxarabcy is supermaximal, abc isn’t.

  • Theorem:A left diverse internal node v in the suffix tree for S represents a supermaximal repeat iff

    • all of v’s children are leaves

    • and each has a distinct left character



Longest common extension problem l.jpg
Longest common extension problem

Preprocess strings S1 and S2 s.t. the following queries can be computed in O(1) time each:

  • Given index pair (i,j), find the length of the longest substring of S1 starting at position i that matches a substring of S2 starting at position j.

    S1: ... abcdzzz ...

    S2: ... abcdefg ...

j

i


Solution l.jpg
Solution

Preprocess: O(|S1|+|S2|)

  • Build generalized suffix tree T for S1 and S2.

  • Preprocess T for constant-time LCA queries.

  • Compute string-depth of every node.

    To answer query (i,j): O(1)

  • Find LCA node v of leaves corresponding to suffix i of S1 and suffix j of S2.

  • Return string-depth(v).



Definition l.jpg
Definition

tandem repeat: a string a that can be written as a = bb, where b is a substring.

s = x a b a b a b a b y

ab|ab

a b|ab

ab|ab

ba|ba

ba|ba

abab|abab

note: b is not required to be of maximal length.

b = ab

b = ba

b = abab


Finding all tandem repeats simple solution l.jpg
Finding all tandem repeats – simple solution

For each feasible pair of start position i and middle position j:

  • Perform a longest common extension query from i and j.

  • If the extension from i reaches j or beyond,(i,j) defines a tandem repeat.

    1 ... i ... j-1 j ... 2j-i-1 2j-i ...n

j-i

j-i


Time analysis56 l.jpg
Time analysis

  • Preprocess for longest common extension: O(n).

  • O(n2) feasible pairs, O(1) time to check each one.

  • O(n2) total.


Finding all tandem repeats faster solution l.jpg
Finding all tandem repeats – faster solution

  • Due to Landau & Schmidt.

  • O(nlogn + z) time.

  • z = total number of tandem repeats in S.

  • z can be as large as Ө(n2).

  • Example: all n characters are the same.

  • In practice, z is expected to be smaller.


Divide and conquer l.jpg
Divide and conquer

Let h = n/2.

  • Find all tandem repeats contained entirely in the first half of S (up to h).

  • Find all tandem repeats contained entirely in the second half of S (after h).

  • Find all tandem repeats where the first copy contains h.

  • Find all tandem repeats where the second copy contains h.


Solution of subproblems l.jpg
Solution of subproblems

  • 1 and 2  solved recursively.

  • 3 and 4  symmetric to each other.

  • Remains to show: solution for 3.


Solution for subproblem 3 l.jpg
Solution for subproblem 3

  • For each l = 1,...,h, find all tandem repeats of length exactly 2l whose first copy contains h.

l2

l2

l1

l1

X1

Y1

X2

Y2

. . .

. . .

h-1

h

q-1

q

l


Algorithm for fixed l l.jpg
Algorithm for fixed l

  • Let q = h+l.

  • Compute longest common extension from h and q.Let l1 denote its length.

  • Compute longest common extension from h-1 and q-1 in reverse direction.Let l2 denote its length.

  • There is a tandem repeat of length 2l whose first copy contains h iff l1 1 and l1 + l2 l.


Output for fixed l l.jpg
Output for fixed l

5. If the condition holds, output starting positions:Max(h-l2 , h-l+1), ... ,Min(h+l1-l , h).

l1

l2

h-l2

h+l1-l

h+l-l2

h+l1

q

h

-l


Time analysis63 l.jpg
Time analysis

  • For fixed l:

    • O(1) longest common extension queries.

  • For subproblem 3 on a string of length n:

    • O(n) longest common extension queries.

  • Entire algorithm on a string of length n:

    • Let T(n) denote the number of longest common extension queries for a string of length n.T(n) = 2T(n/2) + 2n  T(n) = O(nlogn)

    • Including output: O(nlogn + z) total.



The k mismatch problem l.jpg
The k-mismatch problem

  • Given: pattern P, text T, fixed number k.

  • k-mismatch of P: a |P|-length substring of T that matches at least |P|-k characters of P(i.e. it matches P with at most k mismatches).

  • The k-mismatch problem:Find all k-mismatches of P in T.


Example66 l.jpg
Example

P = bend

T = abentbananaend

k = 2

  • T contains three 2-mismatches of P:a b e n tb a n a n a e n d b e n d b e n db e n d

    1-mismatch2-mismatch 1-mismatch


Solution67 l.jpg
Solution

  • Notation: |P|=n, |T|=m, k independent of n and m (k<<n).

  • General idea:

    • For each position i in T, determine whether a k-mismatch of P begins at position i.

    • To do this efficiently: successively execute up to k+1 longest common extension queries.

    • A k-mismatch of P begins at position i iff these extensions reach the end of P.


Solution continued l.jpg
solution (continued)

1

2

4

n

P

T

i

i+3

query 1

query 2

query 3


Algorithm for index i l.jpg
Algorithm for index i

  • j  1i’  icount  0

  • Compute the length l of the longest common extension starting at positions j of P and i’ of T.

  • if j+l=n+1then a k-mismatch of P occurs in T starting at i; stop.

  • if count<kthen count  count+1 j  j+l+1 i’  i’+l+1 go to step 2.else, a k-mismatch of P does not occur in T starting at i; stop.


Time analysis70 l.jpg
Time Analysis

  • Preprocessing of T and P for longest common extension queries  O(m).

  • For each index i=1,...,m-n+1 of T, up to k+1 longest common extension queries  O(k) per index  O(km) total.

  • Total O(km) time.


K mismatch tandem repeat l.jpg
k-mismatch tandem repeat

  • definition: tandem repeat in which the two copies differ by at most k mismatches.

  • example: axab|aybb

  • goal: find all k-mismatch tandem repeats.

  • simple solution:

    For each feasible pair of starting position i and middle position j, check if S[i...j-1] and S[j...2j-i-1] match with at most k mismatches.

     O(kn2)

  • faster solution exists: O(knlog(n/k)+z).


Faster k mismatch tandem repeats l.jpg
faster k-mismatch tandem repeats

  • O(knlog(n/k)+z)

  • same divide and conquer algorithm.

  • subproblem 3 for fixed l: find all k-mismatch tandem repeats of length 2l whose first copy contains h.

  • let q = h+l.

  • run k successive longest common extension queries forward from h and q. mark every mismatch with ->.

  • run k successive longest common extension queries backward from h-1 and q-1. mark every mismatch with <-

  • position t in [h+1,q] is a middle position of a k-mismatch tandem repeat iff the number of -> mismatches in positions h,...,t-1 plus the number of <- mismatches in positions t,...,q-1 is ≤ k.


Faster k mismatch tandem continued l.jpg
faster k-mismatch tandem (continued)

  • a c c e e | ab c d e | a c c d d

  • calculate sums only for positions with arrows, to get intervals of legal midpoints.

  • O(k) time per fixed l, and any l ≤ k need not be checked.

h

t

q


ad