selected applications of suffix trees
Download
Skip this Video
Download Presentation
Selected Applications of Suffix Trees

Loading in 2 Seconds...

play fullscreen
1 / 73

Selected Applications of Suffix Trees - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Selected Applications of Suffix Trees. Reminder – suffix tree. Suffix tree for string S of length m: rooted directed tree with m leaves numbered 1,...,m. each internal node, except the root, has at least 2 children. each edge labeled with a nonempty substring of S.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Selected Applications of Suffix Trees' - abra


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reminder suffix tree
Reminder – suffix tree

Suffix tree for string S of length m:

  • rooted directed tree with m leaves numbered 1,...,m.
  • each internal node, except the root, has at least 2 children.
  • each edge labeled with a nonempty substring of S.
  • edges out of a node begin with different characters.
  • path from the root to leaf i spells out suffix S[i...m].
reminder suffix tree continued
Reminder – suffix tree (continued)
  • Each substring a of S appears on some unique path from the root.
  • If a ends at point p, the leaves below p mark all its occurrences.

a occurs in S starting at position j 

a is a prefix of S[j...m] 

a labels an initial part of the path from the root to leaf j.

example s xabxa 1 2 3 4 5 6
Example: S=xabxa$1 2 3 4 5 6

x

b

a

a

x

v

a

$

b

b

x

x

$

$

a

$

a

$

$

3

6

5

2

4

1

exact string matching
Exact string matching

Find all occurrences of pattern P in text T.

  • Build suffix tree for T  O(m) (Ukkonen).
  • Match P along a path from the root  O(1) per character (finite alphabet)  O(n) total.
  • If P fully matches a path, then the leaves below mark all starting positions of P in T  O(k) where k = number of occurrences.
matching statistics
Matching Statistics
  • ms(i) – the length of the longest substring of T starting at position i that matches a substring somewhere in P.
  • example: T = abcxabcdex, P = wyabcwzqabcdw ms(1)=3, ms(5)=4.
  • There is an occurrence of P starting at position i of T iff ms(i)=|P|.
goal compute ms i for each position i in t in o m total time using only a suffix tree for p
Goal: Compute ms(i) for each position i in T, in O(m) total time, using only a suffix tree for P.
  • Naive way: match T[i...m] starting from the root.more than O(m) total.

Using suffix links:

  • Build suffix tree for P (Ukkonen) and keep suffix links.
  • suffix link: pointer from internal node v with path-label xa to node s(v) with path-label a. (x character, a substring)
compute ms i in order
Compute ms(i) in order

base case: For ms(1), match T[1...m] from the root.

general case: Suppose the matching path for ms(i) ended at point b, then for ms(i+1):

  • Let v be the first internal node at or above b.
  • If there is no such v – search from the root.
  • Otherwise – follow the suffix link from v to s(v) and search from s(v).path_label(v)=xa is a prefix of T[i...m] path_label(s(v))=a is a prefix of T[i+1...m].
skip count
skip / count
  • Let b denote the string between node v and point b.
  • substring xab in P matches a prefix of T[i...m].
  • substring ab in P matches a prefix of T[i+1...m].
  • Traverse the path labeled b out of s(v) using skip/count trick (time proportional to number of nodes on the path).
  • From the end of b, match single characters (starting with the first character that didn’t match for ms(i)).
time analysis
Time analysis

In the search for ms(i+1):

  • back up at most one edge from b to v  O(1).
  • traverse suffix link from v to s(v)  O(1).
  • traverse a b-path from s(v) in time proportional to the number of nodes on it  O(m) total.
  • perform additional comparisons starting with the first character that didn’t match for ms(i)  O(m) total.
definitions
Definitions

For any position i in string S of length m:

  • Priori - longest prefix of S[i...m] that occurs as a substring of S[1...i-1].
  • li - length of Priori.
  • si - starting position of the left-most copy of Priori (li>0).

Example: S = abaxcabaxabz, Prioir7 = bax, l7 = 3, s7 = 2.

  • Copy of Priori starting at si is totally contained in S[1...i-1].
basic idea
Basic idea
  • Suppose the text S[1...i-1] has been represented (perhaps in compressed form) and li>0.
  • Then Priori need not be explicitly represented.
  • The pair (si,li) points to an earlier occurrence of Priori .
  • Example:S = abaxcabaxabz (2,3)
compression algorithm outline
Compression algorithm (outline)

i := 1

Repeat

compute li and siif li > 0 then output (si,li)

i := i + lielse output S(i) i := i + 1

Until i > n

examples
Examples

S1 = a b a c a b a x a b z

      

a b (1,1) c (1,3) x (1,2) z

S2 = ab ababababababababababababababab

    

ab(1,2)(1,4) (1,8) (1,16)

S = (ab)k  compressed representation is O(log k)

decompress
Decompress
  • Process the compressed string left to right.
  • Any pair (si,li) in the representation points to a substring that has already been fully decompressed.
computing s i l i
Computing (si,li)
  • The algorithm does not request (si,li) for any position i already in the compressed part of S.
  • For total O(m) time, find each requested pair (si,li) in O(li) time.

compute li and siif li > 0 then output (si,li)

i := i + li

implementation using suffix tree 1
Implementation using suffix tree (1)

Before compression:

  • Build a suffix tree T for S.
  • For each node v, compute cv :
    • the smallest leaf index in v’s subtree.
    • the starting position of the leftmost copy of the substring that labels the path from the root to v.
  • O(m) time.
implementation using suffix trees 2
Implementation using suffix trees (2)

root

computing (si,li):

a

|a| + cv ≤ i

p

v

S[i...m]

cv

i

|a|

leaf i

implementation using suffix trees 3
Implementation using suffix trees (3)
  • To compute (si,li), traverse the unique path in T that matches a prefix of S[i...m]:
    • Let: p - current point, v - first node at or below p.
    • Traverse as long as: string_depth(p) + cv ≤ i.
    • At the last point p of traversal:li = string_depth(p), si = cv .
  • O(li) time.
example
Example

S = abababab

1 2 3 4 5 6 7 8

i=1 li=0  a

i=2 li=0  b

i=3 li=2 cv=1  (1,2)

i=5 li=4 cv=1  (1,4)

a

string depth=1

b

b

cv=2

cv=1

v1

a

a

b

b

cv=2

v2

cv=1

a

a

b

$

$

b

cv=2

cv=1

$

$

a

a

b

b

$

$

$

$

2

4

6

8

7

5

3

1

online version
Online version
  • Compress S as it is being input one character at a time.
  • Possible since S[1...i-1] is known before computing si,li.
  • Implementation: build suffix tree online.

 Ukkonen’s algorithm:

    • In phase i, build implicit suffix tree Ti for prefix S[1...i].
claim 1
Claim 1

Assume:

  • The compaction has been done for S[1...i-1].
  • Implicit suffix tree Ti-1 for S[1...i-1] has been built.
  • cv values are given for each node v in Ti-1.

Then (si,li) can be obtained in O(li) time.

suppose we had a suffix tree for s 1 i 1 with c v values we could find s i l i in o l i time
Suppose we had a suffix tree for S[1...i-1] with cv values  We could find (si,li) in O(li) time.

li = string_depth(p)

si = cv

root

S(i)

S(i+1)

...

S(k-1)

p

c  S(k)

v

the missing leaves in the implicit suffix tree are not needed
The missing leaves in the implicit suffix tree are not needed.

root

root

S(i)

S(i)

...

...

S(k-1)

S(k-1)

p

p

c  S(k)

c  S(k)

v

$

S(h) ... S(i-1)

S(j) ... S(i-1)

leaf j

h < j

leaf h

leaf h

claim 2
Claim 2

cv values for all implicit suffix trees can be computed in total O(m) time.

  • In Ukkonen’s algorithm:
    • Only extension rule 2 updates cv values.
    • Whenever a new internal node v is created by splitting an edge (u,w): cv cw.
    • Whenever a new leaf j is created: cj  j.

 constant update time per new node.

updating c v values
Updating cv values

new leaf and new node:

new leaf:

root

root

S(j)

S(j)

u

S(i)

S(i)

v

c

v

S(i+1)

S(i+1)

c2

w

c1

j

j

online algorithm
Online algorithm
  • Base case: output S(1) and build T1.
  • General case: Suppose S[1...i-1] has been compressed and Ti-1 with cv values has been constructed.
    • Match S(i),S(i+1),... along a path from the root in Ti-1.
    • Let S(k) be the first that doesn’t match.
    • Find (si,li).
    • If li = 0, output S(i) and build Ti with cv.
    • If li > 0, output (si,li) and build Ti,...,Tk-1 with cv.
  • Total time: O(m).
maximal pair
Maximal Pair
  • A maximal pair in string S:A pair of identical substrings a and b in S s.t. the character to the immediate left (right) of a is different from the character to the immediate left (right) of b.
  • Extending a and b in either direction would destroy the equality of the two strings.
  • Example: S = xabcyiiizabcqabcyrxar
maximal pair continued
Maximal Pair (continued)
  • Overlap is allowed:S = cxxaxxaxxbcxxaxxaaxxaxxb
  • To allow a prefix or suffix of S to be part of a maximal pair:S  #S$ (#,$ don’t appear in S).Example: #abcxabc$
maximal repeat
Maximal Repeat
  • A maximal repeat in string S:

A substring of S that occurs in a maximal pair in S.

  • Example: S = xabcyiiizabcqabcyrxar

maximal repeats: abc, abcy, ...

finding all maximal repeats in linear time
Finding All Maximal RepeatsIn Linear Time
  • Given: String S of length n.
  • Goal: Find all maximal repeats in O(n) time.
  • Lemma: Let T be a suffix tree for S.If string a is a maximal repeat in S,then a is the path-label of an internal node v in T.
proof by def of maximal repeat
Proof – by def. of maximal repeat

S = xabcyiiizabcqabcyrxar

root

a

a

b

c

v

y

q

conclusion
Conclusion
  • There can be at most n maximal repeats in any string of length n.
  • Proof:

by the lemma, since T has at most n internal nodes.

which internal nodes correspond to maximal repeats
Which internal nodes correspond to maximal repeats?
  • The left character of leaf i in T is S(i-1).
  • Node v of T is left diverse if at least 2 leaves in v’s subtree have different left characters.
  • A leaf can’t be left diverse.
  • Left diversity propagates upward.
example s xabxa 1 2 3 4 5 637
Example: S = #xabxa$1 2 3 4 5 6

maximal repeat

left diverse

x

b

a

a

x

a

$

b

b

x

x

$

$

a

$

a

$

$

3

6

5

2

4

1

a

a

x

x

b

#

theorem
Theorem

The string a labeling the path to an internal node v of T is a maximal repeat

v is left diverse.

proof of
Proof of 
  • Suppose a is a maximal repeat 
  • It participates in a maximal pair 
  • It has at least two occurrences with distinct left characters: xa, ya, xy 
  • Let i and j be the two starting positions of a. Then leaves i and j are in v’s subtree and have different left characters x,y. 
  • v is left diverse.
proof of40
Proof of 
  • Suppose v is left diverse there are substrings xap and yaq in S, xy.
  • If pq  a’s occurrences in xap and yaq form a maximal pair  a is a maximal repeat.
  • If p=q  since v is a branching node, there is a substring zar in S, rp.If zx  It forms a maximal pair with xap.If zy  It forms a maximal pair with yap.In either case, a is a maximal repeat.
proof of continued
Proof of  (continued)

root

root

Case 1:

Case 2:

a

a

v

v

r...

p...

p…

q…

left char x

left char y

left char z

left char x

left char y

compact representation
Compact Representation
  • Node v in T is a frontier node if:
    • v is left diverse.
    • none of v’s children are left diverse.
  • Each node at or above the frontier is left diverse.
  • The subtree of T from the root down to the frontier nodes is a compact representation of the set of all maximal repeats of S.
  • Representation in O(n) though total length may be larger.
linear time algorithm
Linear time algorithm
  • Build suffix tree T.
  • Find all left diverse nodes in linear time.
  • Delete all nodes that aren’t left diverse, to achieve compact representation:
finding all left diverse nodes in linear time
finding all left diverse nodes in linear time
  • Traverse T bottom-up, recording for each node:
    • either that it is left diverse
    • or the left character common to all leaves in its subtree.
  • For each leaf: record its left character.
  • For each internal node v:
    • If any child is left diverse  v is left diverse.
    • Else If all children have a common character x  record x for v.
    • Else record that v is left diverse.
finding all maximal pairs in linear time
Finding All Maximal PairsIn Linear Time
  • Not every two occurrences of a maximal repeat form a maximal pair.

Example: S = xabcyiiizabcqabcyrxar

  • There can be more than O(n) maximal pairs.
  • The algorithm is O(n+k) where k is the number of maximal pairs.
general idea
General Idea

For each node u and character x: keep all leaf numbers below u whose left character is x.

To find all maximal pairs of a:

For each character x, form the cartesian product of the list for x at v1 with every list for a character  x at v2.

root

a

v

p…

q…

v1

v2

leaf i

leaf j

left char x

left char y

the algorithm
The Algorithm
  • Build suffix tree T for S.
  • Record the left character of each leaf.
  • Traverse T bottom-up.
  • At each node v with path-label a:
    • Output all maximal pairs of a: cartesian product of lists (u,x) and (u’,x’) for each pair of children u  u’ and pair of characters x  x’.
    • Create the lists for node v by linking the lists of v’s children.
time analysis48
Time Analysis
  • Suffix tree construction  O(n).
  • Bottom-up traversal including all list-linking  O(n).
  • All cartesian product operations  O(k),where k is the number of maximal pairs.
  • Total O(n+k).
finding all supermaximal repeats in linear time
Finding All Supermaximal Repeats In Linear Time
  • supermaximal repeat: a maximal repeat that isn’t a substring of any other maximal repeat.
  • Example: S = xabcyiiizabcqabcyrxarabcy is supermaximal, abc isn’t.
  • Theorem:A left diverse internal node v in the suffix tree for S represents a supermaximal repeat iff
    • all of v’s children are leaves
    • and each has a distinct left character
longest common extension problem
Longest common extension problem

Preprocess strings S1 and S2 s.t. the following queries can be computed in O(1) time each:

  • Given index pair (i,j), find the length of the longest substring of S1 starting at position i that matches a substring of S2 starting at position j.

S1: ... abcdzzz ...

S2: ... abcdefg ...

j

i

solution
Solution

Preprocess: O(|S1|+|S2|)

  • Build generalized suffix tree T for S1 and S2.
  • Preprocess T for constant-time LCA queries.
  • Compute string-depth of every node.

To answer query (i,j): O(1)

  • Find LCA node v of leaves corresponding to suffix i of S1 and suffix j of S2.
  • Return string-depth(v).
definition
Definition

tandem repeat: a string a that can be written as a = bb, where b is a substring.

s = x a b a b a b a b y

ab|ab

a b|ab

ab|ab

ba|ba

ba|ba

abab|abab

note: b is not required to be of maximal length.

b = ab

b = ba

b = abab

finding all tandem repeats simple solution
Finding all tandem repeats – simple solution

For each feasible pair of start position i and middle position j:

  • Perform a longest common extension query from i and j.
  • If the extension from i reaches j or beyond,(i,j) defines a tandem repeat.

1 ... i ... j-1 j ... 2j-i-1 2j-i ...n

j-i

j-i

time analysis56
Time analysis
  • Preprocess for longest common extension: O(n).
  • O(n2) feasible pairs, O(1) time to check each one.
  • O(n2) total.
finding all tandem repeats faster solution
Finding all tandem repeats – faster solution
  • Due to Landau & Schmidt.
  • O(nlogn + z) time.
  • z = total number of tandem repeats in S.
  • z can be as large as Ө(n2).
  • Example: all n characters are the same.
  • In practice, z is expected to be smaller.
divide and conquer
Divide and conquer

Let h = n/2.

  • Find all tandem repeats contained entirely in the first half of S (up to h).
  • Find all tandem repeats contained entirely in the second half of S (after h).
  • Find all tandem repeats where the first copy contains h.
  • Find all tandem repeats where the second copy contains h.
solution of subproblems
Solution of subproblems
  • 1 and 2  solved recursively.
  • 3 and 4  symmetric to each other.
  • Remains to show: solution for 3.
solution for subproblem 3
Solution for subproblem 3
  • For each l = 1,...,h, find all tandem repeats of length exactly 2l whose first copy contains h.

l2

l2

l1

l1

X1

Y1

X2

Y2

. . .

. . .

h-1

h

q-1

q

l

algorithm for fixed l
Algorithm for fixed l
  • Let q = h+l.
  • Compute longest common extension from h and q.Let l1 denote its length.
  • Compute longest common extension from h-1 and q-1 in reverse direction.Let l2 denote its length.
  • There is a tandem repeat of length 2l whose first copy contains h iff l1 1 and l1 + l2 l.
output for fixed l
Output for fixed l

5. If the condition holds, output starting positions:Max(h-l2 , h-l+1), ... ,Min(h+l1-l , h).

l1

l2

h-l2

h+l1-l

h+l-l2

h+l1

q

h

-l

time analysis63
Time analysis
  • For fixed l:
    • O(1) longest common extension queries.
  • For subproblem 3 on a string of length n:
    • O(n) longest common extension queries.
  • Entire algorithm on a string of length n:
    • Let T(n) denote the number of longest common extension queries for a string of length n.T(n) = 2T(n/2) + 2n  T(n) = O(nlogn)
    • Including output: O(nlogn + z) total.
the k mismatch problem
The k-mismatch problem
  • Given: pattern P, text T, fixed number k.
  • k-mismatch of P: a |P|-length substring of T that matches at least |P|-k characters of P(i.e. it matches P with at most k mismatches).
  • The k-mismatch problem:Find all k-mismatches of P in T.
example66
Example

P = bend

T = abentbananaend

k = 2

  • T contains three 2-mismatches of P:a b e n tb a n a n a e n d b e n d b e n db e n d

1-mismatch2-mismatch 1-mismatch

solution67
Solution
  • Notation: |P|=n, |T|=m, k independent of n and m (k<<n).
  • General idea:
    • For each position i in T, determine whether a k-mismatch of P begins at position i.
    • To do this efficiently: successively execute up to k+1 longest common extension queries.
    • A k-mismatch of P begins at position i iff these extensions reach the end of P.
solution continued
solution (continued)

1

2

4

n

P

T

i

i+3

query 1

query 2

query 3

algorithm for index i
Algorithm for index i
  • j  1i’  icount  0
  • Compute the length l of the longest common extension starting at positions j of P and i’ of T.
  • if j+l=n+1then a k-mismatch of P occurs in T starting at i; stop.
  • if count<kthen count  count+1 j  j+l+1 i’  i’+l+1 go to step 2.else, a k-mismatch of P does not occur in T starting at i; stop.
time analysis70
Time Analysis
  • Preprocessing of T and P for longest common extension queries  O(m).
  • For each index i=1,...,m-n+1 of T, up to k+1 longest common extension queries  O(k) per index  O(km) total.
  • Total O(km) time.
k mismatch tandem repeat
k-mismatch tandem repeat
  • definition: tandem repeat in which the two copies differ by at most k mismatches.
  • example: axab|aybb
  • goal: find all k-mismatch tandem repeats.
  • simple solution:

For each feasible pair of starting position i and middle position j, check if S[i...j-1] and S[j...2j-i-1] match with at most k mismatches.

 O(kn2)

  • faster solution exists: O(knlog(n/k)+z).
faster k mismatch tandem repeats
faster k-mismatch tandem repeats
  • O(knlog(n/k)+z)
  • same divide and conquer algorithm.
  • subproblem 3 for fixed l: find all k-mismatch tandem repeats of length 2l whose first copy contains h.
  • let q = h+l.
  • run k successive longest common extension queries forward from h and q. mark every mismatch with ->.
  • run k successive longest common extension queries backward from h-1 and q-1. mark every mismatch with <-
  • position t in [h+1,q] is a middle position of a k-mismatch tandem repeat iff the number of -> mismatches in positions h,...,t-1 plus the number of <- mismatches in positions t,...,q-1 is ≤ k.
faster k mismatch tandem continued
faster k-mismatch tandem (continued)
  • a c c e e | ab c d e | a c c d d
  • calculate sums only for positions with arrows, to get intervals of legal midpoints.
  • O(k) time per fixed l, and any l ≤ k need not be checked.

h

t

q

ad