Selected applications of suffix trees l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 73

Selected Applications of Suffix Trees PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on
  • Presentation posted in: General

Selected Applications of Suffix Trees. Reminder – suffix tree. Suffix tree for string S of length m: rooted directed tree with m leaves numbered 1,...,m. each internal node, except the root, has at least 2 children. each edge labeled with a nonempty substring of S.

Download Presentation

Selected Applications of Suffix Trees

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Selected applications of suffix trees l.jpg

Selected Applications of Suffix Trees


Reminder suffix tree l.jpg

Reminder – suffix tree

Suffix tree for string S of length m:

  • rooted directed tree with m leaves numbered 1,...,m.

  • each internal node, except the root, has at least 2 children.

  • each edge labeled with a nonempty substring of S.

  • edges out of a node begin with different characters.

  • path from the root to leaf i spells out suffix S[i...m].


Reminder suffix tree continued l.jpg

Reminder – suffix tree (continued)

  • Each substring a of S appears on some unique path from the root.

  • If a ends at point p, the leaves below p mark all its occurrences.

    a occurs in S starting at position j 

    a is a prefix of S[j...m] 

    a labels an initial part of the path from the root to leaf j.


Example s xabxa 1 2 3 4 5 6 l.jpg

Example: S=xabxa$1 2 3 4 5 6

x

b

a

a

x

v

a

$

b

b

x

x

$

$

a

$

a

$

$

3

6

5

2

4

1


Exact string matching l.jpg

Exact string matching

Find all occurrences of pattern P in text T.

  • Build suffix tree for T  O(m) (Ukkonen).

  • Match P along a path from the root  O(1) per character (finite alphabet)  O(n) total.

  • If P fully matches a path, then the leaves below mark all starting positions of P in T  O(k) where k = number of occurrences.


Matching statistics l.jpg

Matching Statistics

  • ms(i) – the length of the longest substring of T starting at position i that matches a substring somewhere in P.

  • example: T = abcxabcdex, P = wyabcwzqabcdw ms(1)=3, ms(5)=4.

  • There is an occurrence of P starting at position i of T iff ms(i)=|P|.


Goal compute ms i for each position i in t in o m total time using only a suffix tree for p l.jpg

Goal: Compute ms(i) for each position i in T, in O(m) total time, using only a suffix tree for P.

  • Naive way: match T[i...m] starting from the root.more than O(m) total.

    Using suffix links:

  • Build suffix tree for P (Ukkonen) and keep suffix links.

  • suffix link: pointer from internal node v with path-label xa to node s(v) with path-label a. (x character, a substring)


Compute ms i in order l.jpg

Compute ms(i) in order

base case: For ms(1), match T[1...m] from the root.

general case: Suppose the matching path for ms(i) ended at point b, then for ms(i+1):

  • Let v be the first internal node at or above b.

  • If there is no such v – search from the root.

  • Otherwise – follow the suffix link from v to s(v) and search from s(v).path_label(v)=xa is a prefix of T[i...m] path_label(s(v))=a is a prefix of T[i+1...m].


Skip count l.jpg

skip / count

  • Let b denote the string between node v and point b.

  • substring xab in P matches a prefix of T[i...m].

  • substring ab in P matches a prefix of T[i+1...m].

  • Traverse the path labeled b out of s(v) using skip/count trick (time proportional to number of nodes on the path).

  • From the end of b, match single characters (starting with the first character that didn’t match for ms(i)).


Time analysis l.jpg

Time analysis

In the search for ms(i+1):

  • back up at most one edge from b to v  O(1).

  • traverse suffix link from v to s(v)  O(1).

  • traverse a b-path from s(v) in time proportional to the number of nodes on it  O(m) total.

  • perform additional comparisons starting with the first character that didn’t match for ms(i)  O(m) total.


Ziv lempel data compression l.jpg

Ziv-Lempel data compression


Definitions l.jpg

Definitions

For any position i in string S of length m:

  • Priori - longest prefix of S[i...m] that occurs as a substring of S[1...i-1].

  • li - length of Priori.

  • si - starting position of the left-most copy of Priori (li>0).

    Example: S = abaxcabaxabz, Prioir7 = bax, l7 = 3, s7 = 2.

  • Copy of Priori starting at si is totally contained in S[1...i-1].


Basic idea l.jpg

Basic idea

  • Suppose the text S[1...i-1] has been represented (perhaps in compressed form) and li>0.

  • Then Priori need not be explicitly represented.

  • The pair (si,li) points to an earlier occurrence of Priori .

  • Example:S = abaxcabaxabz (2,3)


Compression algorithm outline l.jpg

Compression algorithm (outline)

i := 1

Repeat

compute li and siif li > 0 thenoutput (si,li)

i := i + lielseoutput S(i)i := i + 1

Until i > n


Examples l.jpg

Examples

S1 = a b a c a b a x a b z

      

a b (1,1) c (1,3) x (1,2) z

S2 = ab ababababababababababababababab

    

ab(1,2)(1,4) (1,8) (1,16)

S = (ab)k  compressed representation is O(log k)


Decompress l.jpg

Decompress

  • Process the compressed string left to right.

  • Any pair (si,li) in the representation points to a substring that has already been fully decompressed.


Computing s i l i l.jpg

Computing (si,li)

  • The algorithm does not request (si,li) for any position i already in the compressed part of S.

  • For total O(m) time, find each requested pair (si,li) in O(li) time.

compute li and siif li > 0 thenoutput (si,li)

i := i + li


Implementation using suffix tree 1 l.jpg

Implementation using suffix tree (1)

Before compression:

  • Build a suffix tree T for S.

  • For each node v, compute cv :

    • the smallest leaf index in v’s subtree.

    • the starting position of the leftmost copy of the substring that labels the path from the root to v.

  • O(m) time.


Implementation using suffix trees 2 l.jpg

Implementation using suffix trees (2)

root

computing (si,li):

a

|a| + cv ≤ i

p

v

S[i...m]

cv

i

|a|

leaf i


Implementation using suffix trees 3 l.jpg

Implementation using suffix trees (3)

  • To compute (si,li), traverse the unique path in T that matches a prefix of S[i...m]:

    • Let: p - current point, v - first node at or below p.

    • Traverse as long as: string_depth(p) + cv ≤ i.

    • At the last point p of traversal:li = string_depth(p), si = cv .

  • O(li) time.


Example l.jpg

Example

S = abababab

1 2 3 4 5 6 7 8

i=1 li=0  a

i=2 li=0  b

i=3 li=2 cv=1  (1,2)

i=5 li=4 cv=1  (1,4)

a

string depth=1

b

b

cv=2

cv=1

v1

a

a

b

b

cv=2

v2

cv=1

a

a

b

$

$

b

cv=2

cv=1

$

$

a

a

b

b

$

$

$

$

2

4

6

8

7

5

3

1


Online version l.jpg

Online version

  • Compress S as it is being input one character at a time.

  • Possible since S[1...i-1] is known before computing si,li.

  • Implementation: build suffix tree online.

     Ukkonen’s algorithm:

    • In phase i, build implicit suffix tree Ti for prefix S[1...i].


Claim 1 l.jpg

Claim 1

Assume:

  • The compaction has been done for S[1...i-1].

  • Implicit suffix tree Ti-1 for S[1...i-1] has been built.

  • cv values are given for each node v in Ti-1.

    Then (si,li) can be obtained in O(li) time.


Suppose we had a suffix tree for s 1 i 1 with c v values we could find s i l i in o l i time l.jpg

Suppose we had a suffix tree for S[1...i-1] with cv values  We could find (si,li) in O(li) time.

li = string_depth(p)

si = cv

root

S(i)

S(i+1)

...

S(k-1)

p

c  S(k)

v


The missing leaves in the implicit suffix tree are not needed l.jpg

The missing leaves in the implicit suffix tree are not needed.

root

root

S(i)

S(i)

...

...

S(k-1)

S(k-1)

p

p

c  S(k)

c  S(k)

v

$

S(h) ... S(i-1)

S(j) ... S(i-1)

leaf j

h < j

leaf h

leaf h


Claim 2 l.jpg

Claim 2

cv values for all implicit suffix trees can be computed in total O(m) time.

  • In Ukkonen’s algorithm:

    • Only extension rule 2 updates cv values.

    • Whenever a new internal node v is created by splitting an edge (u,w): cv cw.

    • Whenever a new leaf j is created: cj  j.

       constant update time per new node.


Updating c v values l.jpg

Updating cv values

new leaf and new node:

new leaf:

root

root

S(j)

S(j)

u

S(i)

S(i)

v

c

v

S(i+1)

S(i+1)

c2

w

c1

j

j


Online algorithm l.jpg

Online algorithm

  • Base case: output S(1) and build T1.

  • General case: Suppose S[1...i-1] has been compressed and Ti-1 with cv values has been constructed.

    • Match S(i),S(i+1),... along a path from the root in Ti-1.

    • Let S(k) be the first that doesn’t match.

    • Find (si,li).

    • If li = 0, output S(i) and build Ti with cv.

    • If li > 0, output (si,li) and build Ti,...,Tk-1 with cv.

  • Total time: O(m).


Maximal repetitive structures l.jpg

Maximal Repetitive Structures


Maximal pair l.jpg

Maximal Pair

  • A maximal pair in string S:A pair of identical substrings a and b in S s.t. the character to the immediate left (right) of a is different from the character to the immediate left (right) of b.

  • Extending a and b in either direction would destroy the equality of the two strings.

  • Example: S = xabcyiiizabcqabcyrxar


Maximal pair continued l.jpg

Maximal Pair (continued)

  • Overlap is allowed:S = cxxaxxaxxbcxxaxxaaxxaxxb

  • To allow a prefix or suffix of S to be part of a maximal pair:S  #S$ (#,$ don’t appear in S).Example: #abcxabc$


Maximal repeat l.jpg

Maximal Repeat

  • A maximal repeat in string S:

    A substring of S that occurs in a maximal pair in S.

  • Example: S = xabcyiiizabcqabcyrxar

    maximal repeats: abc, abcy, ...


Finding all maximal repeats in linear time l.jpg

Finding All Maximal RepeatsIn Linear Time

  • Given: String S of length n.

  • Goal: Find all maximal repeats in O(n) time.

  • Lemma: Let T be a suffix tree for S.If string a is a maximal repeat in S,then a is the path-label of an internal node v in T.


Proof by def of maximal repeat l.jpg

Proof – by def. of maximal repeat

S = xabcyiiizabcqabcyrxar

root

a

a

b

c

v

y

q


Conclusion l.jpg

Conclusion

  • There can be at most n maximal repeats in any string of length n.

  • Proof:

    by the lemma, since T has at most n internal nodes.


Which internal nodes correspond to maximal repeats l.jpg

Which internal nodes correspond to maximal repeats?

  • The left character of leaf i in T is S(i-1).

  • Node v of T is left diverse if at least 2 leaves in v’s subtree have different left characters.

  • A leaf can’t be left diverse.

  • Left diversity propagates upward.


Example s xabxa 1 2 3 4 5 637 l.jpg

Example: S = #xabxa$1 2 3 4 5 6

maximal repeat

left diverse

x

b

a

a

x

a

$

b

b

x

x

$

$

a

$

a

$

$

3

6

5

2

4

1

a

a

x

x

b

#


Theorem l.jpg

Theorem

The string a labeling the path to an internal node v of T is a maximal repeat

v is left diverse.


Proof of l.jpg

Proof of 

  • Suppose a is a maximal repeat 

  • It participates in a maximal pair 

  • It has at least two occurrences with distinct left characters: xa, ya, xy 

  • Let i and j be the two starting positions of a. Then leaves i and j are in v’s subtree and have different left characters x,y. 

  • v is left diverse.


Proof of40 l.jpg

Proof of 

  • Suppose v is left diverse there are substrings xap and yaq in S, xy.

  • If pq  a’s occurrences in xap and yaq form a maximal pair  a is a maximal repeat.

  • If p=q  since v is a branching node, there is a substring zar in S, rp.If zx  It forms a maximal pair with xap.If zy  It forms a maximal pair with yap.In either case, a is a maximal repeat.


Proof of continued l.jpg

Proof of  (continued)

root

root

Case 1:

Case 2:

a

a

v

v

r...

p...

p…

q…

left char x

left char y

left char z

left char x

left char y


Compact representation l.jpg

Compact Representation

  • Node v in T is a frontier node if:

    • v is left diverse.

    • none of v’s children are left diverse.

  • Each node at or above the frontier is left diverse.

  • The subtree of T from the root down to the frontier nodes is a compact representation of the set of all maximal repeats of S.

  • Representation in O(n) though total length may be larger.


Linear time algorithm l.jpg

Linear time algorithm

  • Build suffix tree T.

  • Find all left diverse nodes in linear time.

  • Delete all nodes that aren’t left diverse, to achieve compact representation:


Finding all left diverse nodes in linear time l.jpg

finding all left diverse nodes in linear time

  • Traverse T bottom-up, recording for each node:

    • either that it is left diverse

    • or the left character common to all leaves in its subtree.

  • For each leaf: record its left character.

  • For each internal node v:

    • If any child is left diverse  v is left diverse.

    • Else If all children have a common character x  record x for v.

    • Else record that v is left diverse.


Finding all maximal pairs in linear time l.jpg

Finding All Maximal PairsIn Linear Time

  • Not every two occurrences of a maximal repeat form a maximal pair.

    Example: S = xabcyiiizabcqabcyrxar

  • There can be more than O(n) maximal pairs.

  • The algorithm is O(n+k) where k is the number of maximal pairs.


General idea l.jpg

General Idea

For each node u and character x: keep all leaf numbers below u whose left character is x.

To find all maximal pairs of a:

For each character x, form the cartesian product of the list for x at v1 with every list for a character  x at v2.

root

a

v

p…

q…

v1

v2

leaf i

leaf j

left char x

left char y


The algorithm l.jpg

The Algorithm

  • Build suffix tree T for S.

  • Record the left character of each leaf.

  • Traverse T bottom-up.

  • At each node v with path-label a:

    • Output all maximal pairs of a: cartesian product of lists (u,x) and (u’,x’) for each pair of children u  u’ and pair of characters x  x’.

    • Create the lists for node v by linking the lists of v’s children.


Time analysis48 l.jpg

Time Analysis

  • Suffix tree construction  O(n).

  • Bottom-up traversal including all list-linking  O(n).

  • All cartesian product operations  O(k),where k is the number of maximal pairs.

  • Total O(n+k).


Finding all supermaximal repeats in linear time l.jpg

Finding All Supermaximal Repeats In Linear Time

  • supermaximal repeat: a maximal repeat that isn’t a substring of any other maximal repeat.

  • Example: S = xabcyiiizabcqabcyrxarabcy is supermaximal, abc isn’t.

  • Theorem:A left diverse internal node v in the suffix tree for S represents a supermaximal repeat iff

    • all of v’s children are leaves

    • and each has a distinct left character


Longest common extension l.jpg

Longest Common Extension


Longest common extension problem l.jpg

Longest common extension problem

Preprocess strings S1 and S2 s.t. the following queries can be computed in O(1) time each:

  • Given index pair (i,j), find the length of the longest substring of S1 starting at position i that matches a substring of S2 starting at position j.

    S1: ... abcdzzz ...

    S2: ... abcdefg ...

j

i


Solution l.jpg

Solution

Preprocess: O(|S1|+|S2|)

  • Build generalized suffix tree T for S1 and S2.

  • Preprocess T for constant-time LCA queries.

  • Compute string-depth of every node.

    To answer query (i,j): O(1)

  • Find LCA node v of leaves corresponding to suffix i of S1 and suffix j of S2.

  • Return string-depth(v).


Tandem repeats l.jpg

Tandem Repeats


Definition l.jpg

Definition

tandem repeat: a string a that can be written as a = bb, where b is a substring.

s = x a b a b a b a b y

ab|ab

a b|ab

ab|ab

ba|ba

ba|ba

abab|abab

note: b is not required to be of maximal length.

b = ab

b = ba

b = abab


Finding all tandem repeats simple solution l.jpg

Finding all tandem repeats – simple solution

For each feasible pair of start position i and middle position j:

  • Perform a longest common extension query from i and j.

  • If the extension from i reaches j or beyond,(i,j) defines a tandem repeat.

    1 ... i ... j-1 j ... 2j-i-1 2j-i ...n

j-i

j-i


Time analysis56 l.jpg

Time analysis

  • Preprocess for longest common extension: O(n).

  • O(n2) feasible pairs, O(1) time to check each one.

  • O(n2) total.


Finding all tandem repeats faster solution l.jpg

Finding all tandem repeats – faster solution

  • Due to Landau & Schmidt.

  • O(nlogn + z) time.

  • z = total number of tandem repeats in S.

  • z can be as large as Ө(n2).

  • Example: all n characters are the same.

  • In practice, z is expected to be smaller.


Divide and conquer l.jpg

Divide and conquer

Let h = n/2.

  • Find all tandem repeats contained entirely in the first half of S (up to h).

  • Find all tandem repeats contained entirely in the second half of S (after h).

  • Find all tandem repeats where the first copy contains h.

  • Find all tandem repeats where the second copy contains h.


Solution of subproblems l.jpg

Solution of subproblems

  • 1 and 2  solved recursively.

  • 3 and 4  symmetric to each other.

  • Remains to show: solution for 3.


Solution for subproblem 3 l.jpg

Solution for subproblem 3

  • For each l = 1,...,h, find all tandem repeats of length exactly 2l whose first copy contains h.

l2

l2

l1

l1

X1

Y1

X2

Y2

. . .

. . .

h-1

h

q-1

q

l


Algorithm for fixed l l.jpg

Algorithm for fixed l

  • Let q = h+l.

  • Compute longest common extension from h and q.Let l1 denote its length.

  • Compute longest common extension from h-1 and q-1 in reverse direction.Let l2 denote its length.

  • There is a tandem repeat of length 2l whose first copy contains h iff l1 1 and l1 + l2 l.


Output for fixed l l.jpg

Output for fixed l

5.If the condition holds, output starting positions:Max(h-l2 , h-l+1), ... ,Min(h+l1-l , h).

l1

l2

h-l2

h+l1-l

h+l-l2

h+l1

q

h

-l


Time analysis63 l.jpg

Time analysis

  • For fixed l:

    • O(1) longest common extension queries.

  • For subproblem 3 on a string of length n:

    • O(n) longest common extension queries.

  • Entire algorithm on a string of length n:

    • Let T(n) denote the number of longest common extension queries for a string of length n.T(n) = 2T(n/2) + 2n  T(n) = O(nlogn)

    • Including output: O(nlogn + z) total.


Inexact matching l.jpg

Inexact Matching


The k mismatch problem l.jpg

The k-mismatch problem

  • Given: pattern P, text T, fixed number k.

  • k-mismatch of P: a |P|-length substring of T that matches at least |P|-k characters of P(i.e. it matches P with at most k mismatches).

  • The k-mismatch problem:Find all k-mismatches of P in T.


Example66 l.jpg

Example

P = bend

T = abentbananaend

k = 2

  • T contains three 2-mismatches of P:a b e n tb a n a n a e n d b e n d b e n db e n d

    1-mismatch2-mismatch 1-mismatch


Solution67 l.jpg

Solution

  • Notation: |P|=n, |T|=m, k independent of n and m (k<<n).

  • General idea:

    • For each position i in T, determine whether a k-mismatch of P begins at position i.

    • To do this efficiently: successively execute up to k+1 longest common extension queries.

    • A k-mismatch of P begins at position i iff these extensions reach the end of P.


Solution continued l.jpg

solution (continued)

1

2

4

n

P

T

i

i+3

query 1

query 2

query 3


Algorithm for index i l.jpg

Algorithm for index i

  • j  1i’  icount  0

  • Compute the length l of the longest common extension starting at positions j of P and i’ of T.

  • if j+l=n+1then a k-mismatch of P occurs in T starting at i; stop.

  • if count<kthen count  count+1 j  j+l+1 i’  i’+l+1 go to step 2.else, a k-mismatch of P does not occur in T starting at i; stop.


Time analysis70 l.jpg

Time Analysis

  • Preprocessing of T and P for longest common extension queries  O(m).

  • For each index i=1,...,m-n+1 of T, up to k+1 longest common extension queries  O(k) per index  O(km) total.

  • Total O(km) time.


K mismatch tandem repeat l.jpg

k-mismatch tandem repeat

  • definition: tandem repeat in which the two copies differ by at most k mismatches.

  • example: axab|aybb

  • goal: find all k-mismatch tandem repeats.

  • simple solution:

    For each feasible pair of starting position i and middle position j, check if S[i...j-1] and S[j...2j-i-1] match with at most k mismatches.

     O(kn2)

  • faster solution exists: O(knlog(n/k)+z).


Faster k mismatch tandem repeats l.jpg

faster k-mismatch tandem repeats

  • O(knlog(n/k)+z)

  • same divide and conquer algorithm.

  • subproblem 3 for fixed l: find all k-mismatch tandem repeats of length 2l whose first copy contains h.

  • let q = h+l.

  • run k successive longest common extension queries forward from h and q. mark every mismatch with ->.

  • run k successive longest common extension queries backward from h-1 and q-1. mark every mismatch with <-

  • position t in [h+1,q] is a middle position of a k-mismatch tandem repeat iff the number of -> mismatches in positions h,...,t-1 plus the number of <- mismatches in positions t,...,q-1 is ≤ k.


Faster k mismatch tandem continued l.jpg

faster k-mismatch tandem (continued)

  • a c c e e | ab c d e | a c c d d

  • calculate sums only for positions with arrows, to get intervals of legal midpoints.

  • O(k) time per fixed l, and any l ≤ k need not be checked.

h

t

q


  • Login