Suffix tree and suffix array techniques for pattern analysis in strings

1 / 25

# Suffix tree and suffix array techniques for pattern analysis in strings - PowerPoint PPT Presentation

Suffix tree and suffix array techniques for pattern analysis in strings. Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006. Pattern finding & synthesis problems. T = t 1 t 2 … t n , P = p 1 p 2 … p n , strings of symbols in finite alphabet

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Suffix tree and suffix array techniques for pattern analysis in strings

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Suffix tree and suffix array techniques for pattern analysis in strings

Esko Ukkonen

Univ Helsinki

Erice School 30 Oct 2005

Modified Alon Itai 2006

Pattern finding & synthesis problems
• T = t1t2 … tn, P = p1 p 2 … pn, strings of symbols in finite alphabet
• Indexing problem: Preprocess T (build an index structure) such that the occurrences of different patterns P can be found fast
• static text, any given pattern P
• Pattern synthesis problem: Learn from T new patterns that occur surprisingly often
• What is a pattern? Exact substring, approximate substring, with generalized symbols, with gaps, …

Suffix tree

• Suffix array
• Some applications
• Finding motifs
Suffix array: example

ε

atti

attivatti

hattivatti

i

ivatti

ti

tivatti

tti

ttivatti

vatti

11

7

2

1

10

5

9

4

8

3

6

hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

ε

• suffix array = lexicographic order of the suffixes
Suffix array
• suffix array SA(T) = an array giving the lexicographic order of the suffixes of T
• space requirement: 5|T|למה 5?
• practitioners like suffix arrays (simplicity, space efficiency)
• theoreticians like suffix trees (explicit structure)
Pattern search from suffix array

ε

atti

attivatti

hattivatti

i

ivatti

ti

tivatti

tti

ttivatti

vatti

11

7

2

1

10

5

9

4

8

3

6

hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

ε

att

binary search

m

m

l

l

l

m

u

u

u

pat

pat

pat

U= m

l = m

• The search time is O(m log n), where

m = length of search string,

n = length of text (and size of suffix array).

With LCA = longest common ancestor

time = O(m + log n).

Recent suffix array constructions
• Manber&Myers (1990): O(|T|log|T|)
• linear time via suffix tree
• January / June 2003: direct linear time construction of suffix array - Kim, Sim, Park, Park (CPM03) - Kärkkäinen & Sanders (ICALP03) - Ko & Aluru (CPM03)
Kärkkäinen-Sanders algorithm
• Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.
• Construct the suffix array of the remaining suffixes using the result of the first step.
• Merge the two suffix arrays into one.
Notation
• string T = T[0,n) = t0t1 … tn-1
• suffix Si = T[i,0) = titi+1 … tn-1
• for C  [0,n]: SC = {Si | i in C}
• suffix array SA[0,n] of T is a permutation of [0,n] satisfying SSA[0] < SSA[1] < … < SSA[n]

T[SA[0],n)

Running example

0 1 2 3 4 5 6 7 8 9 10 11

• T[0,n) = y a b b a d a b b a d o 0 0
• SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)
Step 0: Construct a sample
• for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k}
• C = B1 U B2 sample positions
• SC sample suffixes
• Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}
Step 1: Sort sample suffixes
• for k = 1,2, construct

Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2]

R = R1 º R2 (concatenation of R1 and R2)

Suffixes of R correspond to SC:

suffix [titi+1ti+2]… corresponds to Si ;

The correspondence is order preserving:

Let Ri’Si andRj’Sj.Then Ri’<Rj’ iff Si < Sj

Sort the suffixes of R

Radix sort the characters and rename with ranks to obtain R´.

Example:R1 R2

1 2 3 4 5 6 7

R´ = (1,2,4,6,4,5,3,7)

If all characters are different, their order directly gives the order of suffixes.

Otherwise, sort the suffixes of R´ using Kärkkäinen-Sanders.

Note: |R´| = 2n/3.

Step 1 (cont.)
• Once the sample suffixes are sorted, assign a rank to each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0
• Example: R´ = (1,2,4,6,4,5,3,7) 0: ε3: 37 6: 537

1:12464537 4: 4537 7: 64537

2:24645,7 5: 464537 8: 7

SAR´ = (8,0,1,6,4,2,5,3,7) (The suffix array for R’)

SAR´-1 = (1 2 5 74 6 3 8)

rank(Si) (– 1 4– 2 6– 5 3– 7 8–0 0 )

Step 2: Sort nonsample suffixes
• for each non-sample Siє SB0 (note that rank(Si+1) is always defined for i є B0): Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1))
• radix sort the pairs (ti,rank(Si+1)).
• Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)
יש לפרט יותר

Example: S12 < S6 < S9 < S3 < S0 because

S12=0 = 0eps = (0,0)

(0,0) < (a,5) < (a,7) < (b,2) < (y,1)

Step 3: Merge
• merge the two sorted sets of suffixes using a standard comparison-based merging:
• to compare Siє SC with Sjє SB0, distinguish two cases:
• i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1))
• i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2))
• note that the ranks are defined in all cases!
• S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)

 B1

 B2

Running time O(n)
• excluding the recursive call, everything can be done in linear time
• the recursion is on a string of length 2n/3
• thus the time is given by recurrence T(n) = T(2n/3) + O(n)
• hence T(n) = O(n)
Implementation
• about 50 lines of C++
LCP table
• Longest Common Prefix of successive elements of suffix array:
• LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1]
• Algorithm:
• Enter the suffixes in a trie
• Find the lca.
• Complexity = O(n2)
Kasai et al, CPM2001

Key observation:

Let LCP[q]=h>1, i.e.,

S SA[q] = titi+1…ai+h-1ti+h

S SA[q+1]= tktk+1…tk+h-1tk+h

= titi+1…ti+h-1ti+h (tk+h≠ti+h)

• Then ti+1…ti+h-1=tk+1…tk+h-1,.
• Define p SSA[p] =ti+1…ti+h-1…therefore SSA[p+1]=ti+1…ti+h-1 …
• i.e., LCP[p] ≥ h-1
• When computing LCP[p] we can start the comparisons at position p+h-1.
The algorithm

for(i=0; i<n; i++) /* compute SA-1 */

SA-1[SA[i]] = i;

h = 0;

for(p=0; p<n; p++) {

if(SA-1[p] > 0){

r = SA[SA-1 [p]+1] ; while(T[r+h] = T[p+h])

h++;

LCP[SA-1 [p]] = h;

if(h > 0)

h--;

}

}

innermost statement

Complexity:

Since h is decreased at most n times,

and h ≤ n,

h can be increased at most 2n times;

i.e., the innermost statement isexecuted ≤ 2n times.

Total time = O(n).

SSA[0]

Suffix tree vs suffix array
• suffix tree  suffix array + LCP table

First step

SSA[0]

S SA[i]

SSA[i-1]

• Step i

Which edge to split?

Complexity:

The final trie has 2n vertices.

Each edge is traversed ≤ twice.

Time = O(n).