- 97 Views
- Uploaded on
- Presentation posted in: General

Presentation For VLSI Application of Suffix Trees

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation For VLSIApplication of Suffix Trees

Chunkai Yin

Apr 2002

- Approximate palindromes and repeats
- Multiple common substring problem

- Definitions
- Find all tandem repeats (naive method)
- Find all tandem repeats (Landau-Schmidt)
- Find all k-mismatch tandem repeats (L-S)

- Tandem repeat
string written as AA, where A is a substring

abcabc

Specified by starting position and length of A

- K-mismatch palindrome/tandem
a substring that becomes a palindrome/tandem after k or fewer characters are changed

zzcbbcca 2-mismatch palindrome pzcb + bcca

axabaybb 2-mismatch tamdem axab + aybb

- Guess a start position i , middle position j for the tandem
- Do a longest common extension query from i and j. If the extension from i reaches j, the tandem exists starting at position i.
- Time complexity O(n2)
- EX1
fababd

The Landau-Schmidt method divide the problem

into four subproblems

(Recursive, divide-and-conquer)

- Find all tandem repeats contained entirely in the first half of S (up to h = n/2)
- Find all tandem repeats contained entirely in the second half of S (after h )
- Find all tandem repeats where the first copy contains position h of S.
- Find all tandem repeats where the second copy contains position h of S.

- Subproblem A, B will be solved by recursively applying the method
- Subproblem C and D are symmetric, just consider subproblem C.
Therefore, the core problem is subproblem C.

- Initialization:
h= n/2; q=h+L; L is a fixed number each time

- Compute the longest common extension (from left to right) from h and q.
L1 is the length of extension.

- Compute the longest common extension (from right to left) from h-1 and q-1. L2

- If and only if L1+L2>=L, a tandem repeat exists whose first copy contains position h with the length 2L, and it can begin at any position between h-L2 and h+L1-L inclusive. The second copy begins L places to the right. Output the starting point.
- EX2

L

q

A

B

h

Δ1=L-L2, L= Δ1 +L2

A=h-L2

Δ2=L-L1, L= Δ2 +L1

B=h- Δ2=h+L1-L

Δ2

L1

Δ2

L1

L2

Δ1

L2

Δ1

….b d c a b b d c a b b t t t….

h q

L=5 L1=3 L2=3

A:1 B:2

Two forms available:

bdcab + bdcab

dcabb + dcabb

- Time used for output O(Z)
Z is the total number of tandem repeats in S

- Time used for the extension queries
which is proportional to the number of extension queries ext.

T(n)=T(n/2) + T(n/2)+ n + n n=n/2 (->) +n/2(<-)

A B C D

T(n)= O(nlogn), total=O(nlogn)+Z

- Immediate extension of the O(nlogn+Z) method for finding exact tandem repeats.
- Run k successive longest common extension queries forward from h and q and run k successive longest common extension queries backward from h-1 and q-1.
- Then try to find t, a midpoint of the tandem repeat.
In the interval between h and q, the number of mismatches from h to t (found during the forward extension) plus the the number of mismatches from t+1 to q-1(found during the backward ext) is <=k.

O(knlogn +Z)

- Problem:
What substring are common to a large number of distinct strings?

- Definitions
T: a generalized suffix tree for K strings

**The identical suffixes appearing in more

than one string end at distinct leaves. Each

leaf has one of K unique string identifier.

- C(v): the number of distinct leaf string identifiers in the subtree of v
- S(v): the total number of leaves in the subtree of v
- U(v): how many duplicate suffixes from the same string occur in v’s subtree.
- C(v)=S(v)-U(v)
- L(k): the length of the longest substring common to at least k of the strings.
- U(v)=i:ni(v)>0(ni(v)-1)
- ni(v): the number of leaves with identifier i
in the subtree rooted at v

- Our object: get C(v)
- What is the algorithm?
- Define Li: the list of leaves with identifier i, in increasing order of their DFS numbers

- Build a generalized suffix tree T for the K strings.
- Number the leaves of T as they are encountered in a depth-first traversal of T
- For each string identifier i, extract the ordered list Li, of leaves with identifier i.
- For each identifier i, compute the lca of each consecutive pair of leaves in Li, and increment by one each time that w is the computed lca.

Lemma:

If we compute the lca for each consecutive pair of leaves in Li, then for any node v, exactly ni(v)-1 of the computed lcas will lie in the substring of v.

- With a bottom-up traversal of T, compute, for each node v, S(v), and U(v)=i:ni(v)>0(ni(v)-1)= [h(w):w is in the subtree of v]
- Set C(v)=S(v)-U(v) for each v
- Accumulate the table of L(k) values.
- Time complexity: O(n)

Thank you

Thank you

Thank you