suffix tree and suffix array techniques for pattern analysis in strings l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Suffix tree and suffix array techniques for pattern analysis in strings PowerPoint Presentation
Download Presentation
Suffix tree and suffix array techniques for pattern analysis in strings

Loading in 2 Seconds...

play fullscreen
1 / 25

Suffix tree and suffix array techniques for pattern analysis in strings - PowerPoint PPT Presentation


  • 171 Views
  • Uploaded on

Suffix tree and suffix array techniques for pattern analysis in strings. Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006. Pattern finding & synthesis problems. T = t 1 t 2 … t n , P = p 1 p 2 … p n , strings of symbols in finite alphabet

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Suffix tree and suffix array techniques for pattern analysis in strings' - jens


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
suffix tree and suffix array techniques for pattern analysis in strings

Suffix tree and suffix array techniques for pattern analysis in strings

Esko Ukkonen

Univ Helsinki

Erice School 30 Oct 2005

Modified Alon Itai 2006

pattern finding synthesis problems
Pattern finding & synthesis problems
  • T = t1t2 … tn, P = p1 p 2 … pn, strings of symbols in finite alphabet
  • Indexing problem: Preprocess T (build an index structure) such that the occurrences of different patterns P can be found fast
    • static text, any given pattern P
  • Pattern synthesis problem: Learn from T new patterns that occur surprisingly often
  • What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, …
slide3

Suffix tree

  • Suffix array
  • Some applications
  • Finding motifs
suffix array example
Suffix array: example

ε

atti

attivatti

hattivatti

i

ivatti

ti

tivatti

tti

ttivatti

vatti

11

7

2

1

10

5

9

4

8

3

6

hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

ε

  • suffix array = lexicographic order of the suffixes
suffix array
Suffix array
  • suffix array SA(T) = an array giving the lexicographic order of the suffixes of T
  • space requirement: 5|T|למה 5?
  • practitioners like suffix arrays (simplicity, space efficiency)
  • theoreticians like suffix trees (explicit structure)
pattern search from suffix array
Pattern search from suffix array

ε

atti

attivatti

hattivatti

i

ivatti

ti

tivatti

tti

ttivatti

vatti

11

7

2

1

10

5

9

4

8

3

6

hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

ε

att

binary search

slide7

m

m

l

l

l

m

u

u

u

pat

pat

pat

U= m

l = m

  • The search time is O(m log n), where

m = length of search string,

n = length of text (and size of suffix array).

With LCA = longest common ancestor

time = O(m + log n).

recent suffix array constructions
Recent suffix array constructions
  • Manber&Myers (1990): O(|T|log|T|)
  • linear time via suffix tree
  • January / June 2003: direct linear time construction of suffix array - Kim, Sim, Park, Park (CPM03) - Kärkkäinen & Sanders (ICALP03) - Ko & Aluru (CPM03)
k rkk inen sanders algorithm
Kärkkäinen-Sanders algorithm
  • Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.
  • Construct the suffix array of the remaining suffixes using the result of the first step.
  • Merge the two suffix arrays into one.
notation
Notation
  • string T = T[0,n) = t0t1 … tn-1
  • suffix Si = T[i,0) = titi+1 … tn-1
  • for C  [0,n]: SC = {Si | i in C}
  • suffix array SA[0,n] of T is a permutation of [0,n] satisfying SSA[0] < SSA[1] < … < SSA[n]

T[SA[0],n)

running example
Running example

0 1 2 3 4 5 6 7 8 9 10 11

  • T[0,n) = y a b b a d a b b a d o 0 0
  • SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)
step 0 construct a sample
Step 0: Construct a sample
  • for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k}
  • C = B1 U B2 sample positions
  • SC sample suffixes
  • Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}
step 1 sort sample suffixes
Step 1: Sort sample suffixes
  • for k = 1,2, construct

Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2]

R = R1 º R2 (concatenation of R1 and R2)

Suffixes of R correspond to SC:

suffix [titi+1ti+2]… corresponds to Si ;

The correspondence is order preserving:

Let Ri’Si andRj’Sj.Then Ri’<Rj’ iff Si < Sj

sort the suffixes of r
Sort the suffixes of R

Radix sort the characters and rename with ranks to obtain R´.

Example:R1 R2

R = [abb][ada][bba][do0] [bba][dab][bad][o00]

1 2 3 4 5 6 7

[abb][ada][bad][bba] [dab] [do0] [o00]

R´ = (1,2,4,6,4,5,3,7)

If all characters are different, their order directly gives the order of suffixes.

Otherwise, sort the suffixes of R´ using Kärkkäinen-Sanders.

Note: |R´| = 2n/3.

step 1 cont
Step 1 (cont.)
  • Once the sample suffixes are sorted, assign a rank to each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0
  • Example: R´ = (1,2,4,6,4,5,3,7) 0: ε3: 37 6: 537

1:12464537 4: 4537 7: 64537

2:24645,7 5: 464537 8: 7

SAR´ = (8,0,1,6,4,2,5,3,7) (The suffix array for R’)

SAR´-1 = (1 2 5 74 6 3 8)

rank(Si) (– 1 4– 2 6– 5 3– 7 8–0 0 )

step 2 sort nonsample suffixes
Step 2: Sort nonsample suffixes
  • for each non-sample Siє SB0 (note that rank(Si+1) is always defined for i є B0): Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1))
  • radix sort the pairs (ti,rank(Si+1)).
  • Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)
slide17
יש לפרט יותר

Example: S12 < S6 < S9 < S3 < S0 because

S0 = yabbadabbado= yS1=(y,S3 = badabbado= bS4=(b,

S6 = abbado= aS7=(a

S9 =ado= aS10=(a

S12=0 = 0eps = (0,0)

(0,0) < (a,5) < (a,7) < (b,2) < (y,1)

step 3 merge
Step 3: Merge
  • merge the two sorted sets of suffixes using a standard comparison-based merging:
  • to compare Siє SC with Sjє SB0, distinguish two cases:
  • i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1))
  • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2))
  • note that the ranks are defined in all cases!
  • S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)

 B1

 B2

running time o n
Running time O(n)
  • excluding the recursive call, everything can be done in linear time
  • the recursion is on a string of length 2n/3
  • thus the time is given by recurrence T(n) = T(2n/3) + O(n)
  • hence T(n) = O(n)
implementation
Implementation
  • about 50 lines of C++
  • code available e.g. via Juha Kärkkäinen’s home page
lcp table
LCP table
  • Longest Common Prefix of successive elements of suffix array:
  • LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1]
  • Algorithm:
  • Enter the suffixes in a trie
  • Find the lca.
  • Complexity = O(n2)
kasai et al cpm2001
Kasai et al, CPM2001

Key observation:

Let LCP[q]=h>1, i.e.,

S SA[q] = titi+1…ai+h-1ti+h

S SA[q+1]= tktk+1…tk+h-1tk+h

= titi+1…ti+h-1ti+h (tk+h≠ti+h)

  • Then ti+1…ti+h-1=tk+1…tk+h-1,.
  • Define p SSA[p] =ti+1…ti+h-1…therefore SSA[p+1]=ti+1…ti+h-1 …
  • i.e., LCP[p] ≥ h-1
  • When computing LCP[p] we can start the comparisons at position p+h-1.
the algorithm
The algorithm

for(i=0; i<n; i++) /* compute SA-1 */

SA-1[SA[i]] = i;

h = 0;

for(p=0; p<n; p++) {

if(SA-1[p] > 0){

r = SA[SA-1 [p]+1] ; while(T[r+h] = T[p+h])

h++;

LCP[SA-1 [p]] = h;

if(h > 0)

h--;

}

}

innermost statement

Complexity:

Since h is decreased at most n times,

and h ≤ n,

h can be increased at most 2n times;

i.e., the innermost statement isexecuted ≤ 2n times.

Total time = O(n).

suffix tree vs suffix array

SSA[0]

Suffix tree vs suffix array
  • suffix tree  suffix array + LCP table

First step

slide25

SSA[0]

S SA[i]

SSA[i-1]

  • Step i

Which edge to split?

Complexity:

The final trie has 2n vertices.

Each edge is traversed ≤ twice.

Time = O(n).