1 / 20

# Presentation For VLSI Application of Suffix Trees - PowerPoint PPT Presentation

Presentation For VLSI Application of Suffix Trees. Chunkai Yin Apr 2002. Topics. Approximate palindromes and repeats Multiple common substring problem. Approximate palindromes and repeats. Definitions Find all tandem repeats (naive method) Find all tandem repeats (Landau-Schmidt)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Presentation For VLSI Application of Suffix Trees' - salene

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Presentation For VLSIApplication of Suffix Trees

Chunkai Yin

Apr 2002

• Approximate palindromes and repeats

• Multiple common substring problem

• Definitions

• Find all tandem repeats (naive method)

• Find all tandem repeats (Landau-Schmidt)

• Find all k-mismatch tandem repeats (L-S)

• Tandem repeat

string written as AA, where A is a substring

abcabc

Specified by starting position and length of A

• K-mismatch palindrome/tandem

a substring that becomes a palindrome/tandem after k or fewer characters are changed

zzcbbcca 2-mismatch palindrome pzcb + bcca

axabaybb 2-mismatch tamdem axab + aybb

Find all tandem repeats(naive method)

• Guess a start position i , middle position j for the tandem

• Do a longest common extension query from i and j. If the extension from i reaches j, the tandem exists starting at position i.

• Time complexity O(n2)

• EX1

fababd

The Landau-Schmidt method divide the problem

into four subproblems

(Recursive, divide-and-conquer)

• Find all tandem repeats contained entirely in the first half of S (up to h = n/2)

• Find all tandem repeats contained entirely in the second half of S (after h )

• Find all tandem repeats where the first copy contains position h of S.

• Find all tandem repeats where the second copy contains position h of S.

• Subproblem A, B will be solved by recursively applying the method

• Subproblem C and D are symmetric, just consider subproblem C.

Therefore, the core problem is subproblem C.

• Initialization:

h= n/2; q=h+L; L is a fixed number each time

• Compute the longest common extension (from left to right) from h and q.

L1 is the length of extension.

• Compute the longest common extension (from right to left) from h-1 and q-1. L2

• If and only if L1+L2>=L, a tandem repeat exists whose first copy contains position h with the length 2L, and it can begin at any position between h-L2 and h+L1-L inclusive. The second copy begins L places to the right. Output the starting point.

• EX2

L1, L2 and range for starting point

L

q

A

B

h

Δ1=L-L2, L= Δ1 +L2

A=h-L2

Δ2=L-L1, L= Δ2 +L1

B=h- Δ2=h+L1-L

Δ2

L1

Δ2

L1

L2

Δ1

L2

Δ1

….b d c a b b d c a b b t t t….

h q

L=5 L1=3 L2=3

A:1 B:2

Two forms available:

bdcab + bdcab

dcabb + dcabb

• Time used for output O(Z)

Z is the total number of tandem repeats in S

• Time used for the extension queries

which is proportional to the number of extension queries ext.

T(n)=T(n/2) + T(n/2)+ n + n n=n/2 (->) +n/2(<-)

A B C D

T(n)= O(nlogn), total=O(nlogn)+Z

• Immediate extension of the O(nlogn+Z) method for finding exact tandem repeats.

• Run k successive longest common extension queries forward from h and q and run k successive longest common extension queries backward from h-1 and q-1.

• Then try to find t, a midpoint of the tandem repeat.

In the interval between h and q, the number of mismatches from h to t (found during the forward extension) plus the the number of mismatches from t+1 to q-1(found during the backward ext) is <=k.

O(knlogn +Z)

• Problem:

What substring are common to a large number of distinct strings?

• Definitions

T: a generalized suffix tree for K strings

**The identical suffixes appearing in more

than one string end at distinct leaves. Each

leaf has one of K unique string identifier.

• C(v): the number of distinct leaf string identifiers in the subtree of v

• S(v): the total number of leaves in the subtree of v

• U(v): how many duplicate suffixes from the same string occur in v’s subtree.

• C(v)=S(v)-U(v)

• L(k): the length of the longest substring common to at least k of the strings.

• U(v)=i:ni(v)>0(ni(v)-1)

• ni(v): the number of leaves with identifier i

in the subtree rooted at v

The key work we need do subtree of v

• Our object: get C(v)

• What is the algorithm?

• Define Li: the list of leaves with identifier i, in increasing order of their DFS numbers

Algorithm for the problem subtree of v

• Build a generalized suffix tree T for the K strings.

• Number the leaves of T as they are encountered in a depth-first traversal of T

• For each string identifier i, extract the ordered list Li, of leaves with identifier i.

• For each identifier i, compute the lca of each consecutive pair of leaves in Li, and increment by one each time that w is the computed lca.

Algorithm Cont. subtree of v

Lemma:

If we compute the lca for each consecutive pair of leaves in Li, then for any node v, exactly ni(v)-1 of the computed lcas will lie in the substring of v.

• With a bottom-up traversal of T, compute, for each node v, S(v), and U(v)=i:ni(v)>0(ni(v)-1)= [h(w):w is in the subtree of v]

• Set C(v)=S(v)-U(v) for each v

• Accumulate the table of L(k) values.

• Time complexity: O(n)

End of the presentation subtree of v

Thank you

Thank you

Thank you