Presentation For VLSI Application of Suffix Trees. Chunkai Yin Apr 2002. Topics. Approximate palindromes and repeats Multiple common substring problem. Approximate palindromes and repeats. Definitions Find all tandem repeats (naive method) Find all tandem repeats (Landau-Schmidt)
string written as AA, where A is a substring
Specified by starting position and length of A
a substring that becomes a palindrome/tandem after k or fewer characters are changed
zzcbbcca 2-mismatch palindrome pzcb + bcca
axabaybb 2-mismatch tamdem axab + aybb
The Landau-Schmidt method divide the problem
into four subproblems
Therefore, the core problem is subproblem C.
h= n/2; q=h+L; L is a fixed number each time
L1 is the length of extension.
Δ1=L-L2, L= Δ1 +L2
Δ2=L-L1, L= Δ2 +L1
….b d c a b b d c a b b t t t….
L=5 L1=3 L2=3
Two forms available:
bdcab + bdcab
dcabb + dcabb
Z is the total number of tandem repeats in S
which is proportional to the number of extension queries ext.
T(n)=T(n/2) + T(n/2)+ n + n n=n/2 (->) +n/2(<-)
A B C D
T(n)= O(nlogn), total=O(nlogn)+Z
In the interval between h and q, the number of mismatches from h to t (found during the forward extension) plus the the number of mismatches from t+1 to q-1(found during the backward ext) is <=k.
What substring are common to a large number of distinct strings?
T: a generalized suffix tree for K strings
**The identical suffixes appearing in more
than one string end at distinct leaves. Each
leaf has one of K unique string identifier.
in the subtree rooted at v
If we compute the lca for each consecutive pair of leaves in Li, then for any node v, exactly ni(v)-1 of the computed lcas will lie in the substring of v.