Presentation For VLSI Application of Suffix Trees

Presentation For VLSIApplication of Suffix Trees Chunkai Yin Apr 2002

Topics • Approximate palindromes and repeats • Multiple common substring problem

Approximate palindromes and repeats • Definitions • Find all tandem repeats (naive method) • Find all tandem repeats (Landau-Schmidt) • Find all k-mismatch tandem repeats (L-S)

Definitions • Tandem repeat string written as AA, where A is a substring abcabc Specified by starting position and length of A • K-mismatch palindrome/tandem a substring that becomes a palindrome/tandem after k or fewer characters are changed zzcbbcca 2-mismatch palindrome pzcb + bcca axabaybb 2-mismatch tamdem axab + aybb

Find all tandem repeats(naive method) • Guess a start position i , middle position j for the tandem • Do a longest common extension query from i and j. If the extension from i reaches j, the tandem exists starting at position i. • Time complexity O(n2) • EX1 fababd

Problem: Find all tandem repeats The Landau-Schmidt method divide the problem into four subproblems (Recursive, divide-and-conquer) • Find all tandem repeats contained entirely in the first half of S (up to h = n/2) • Find all tandem repeats contained entirely in the second half of S (after h ) • Find all tandem repeats where the first copy contains position h of S. • Find all tandem repeats where the second copy contains position h of S.

How to handle this ? • Subproblem A, B will be solved by recursively applying the method • Subproblem C and D are symmetric, just consider subproblem C. Therefore, the core problem is subproblem C.

Solve the subproblem 3 • Initialization: h= n/2; q=h+L; L is a fixed number each time • Compute the longest common extension (from left to right) from h and q. L1 is the length of extension. • Compute the longest common extension (from right to left) from h-1 and q-1. L2

Solve the subproblem 3 • If and only if L1+L2>=L, a tandem repeat exists whose first copy contains position h with the length 2L, and it can begin at any position between h-L2 and h+L1-L inclusive. The second copy begins L places to the right. Output the starting point. • EX2

L1, L2 and range for starting point L q A B h Δ1=L-L2, L= Δ1 +L2 A=h-L2 Δ2=L-L1, L= Δ2 +L1 B=h- Δ2=h+L1-L Δ2 L1 Δ2 L1 L2 Δ1 L2 Δ1

An Example ….b d c a b b d c a b b t t t…. h q L=5 L1=3 L2=3 A:1 B:2 Two forms available: bdcab + bdcab dcabb + dcabb

Time Complexity • Time used for output O(Z) Z is the total number of tandem repeats in S • Time used for the extension queries which is proportional to the number of extension queries ext. T(n)=T(n/2) + T(n/2)+ n + n n=n/2 (->) +n/2(<-) A B C D T(n)= O(nlogn), total=O(nlogn)+Z

Find all k-mismatch tandem repeats (L-S) • Immediate extension of the O(nlogn+Z) method for finding exact tandem repeats. • Run k successive longest common extension queries forward from h and q and run k successive longest common extension queries backward from h-1 and q-1. • Then try to find t, a midpoint of the tandem repeat. In the interval between h and q, the number of mismatches from h to t (found during the forward extension) plus the the number of mismatches from t+1 to q-1(found during the backward ext) is <=k.

Time complexity O(knlogn +Z)

Multiple common substring prob. • Problem: What substring are common to a large number of distinct strings? • Definitions T: a generalized suffix tree for K strings **The identical suffixes appearing in more than one string end at distinct leaves. Each leaf has one of K unique string identifier.

C(v): the number of distinct leaf string identifiers in the subtree of v • S(v): the total number of leaves in the subtree of v • U(v): how many duplicate suffixes from the same string occur in v’s subtree. • C(v)=S(v)-U(v) • L(k): the length of the longest substring common to at least k of the strings. • U(v)=i:ni(v)>0(ni(v)-1) • ni(v): the number of leaves with identifier i in the subtree rooted at v

The key work we need do • Our object: get C(v) • What is the algorithm? • Define Li: the list of leaves with identifier i, in increasing order of their DFS numbers

Algorithm for the problem • Build a generalized suffix tree T for the K strings. • Number the leaves of T as they are encountered in a depth-first traversal of T • For each string identifier i, extract the ordered list Li, of leaves with identifier i. • For each identifier i, compute the lca of each consecutive pair of leaves in Li, and increment by one each time that w is the computed lca.

Algorithm Cont. Lemma: If we compute the lca for each consecutive pair of leaves in Li, then for any node v, exactly ni(v)-1 of the computed lcas will lie in the substring of v. • With a bottom-up traversal of T, compute, for each node v, S(v), and U(v)=i:ni(v)>0(ni(v)-1)= [h(w):w is in the subtree of v] • Set C(v)=S(v)-U(v) for each v • Accumulate the table of L(k) values. • Time complexity: O(n)

End of the presentation Thank you Thank you Thank you

Presentation For VLSI Application of Suffix Trees

Presentation For VLSI Application of Suffix Trees

Presentation Transcript

Suffix trees and suffix arrays

Selected Applications of Suffix Trees

Suffix Trees

Applications of Suffix Trees

Applications of Suffix Trees

Suffix Trees

Suffix Trees and Suffix Arrays

Suffix Trees, Suffix Arrays and Suffix Trays

Suffix trees

Augmenting Suffix Trees, with Applications

Suffix Trees and Suffix Arrays

Suffix Trees

Suffix Trees

Suffix Trees

Compressed Suffix Arrays and Suffix Trees

SUFFIX TREES

Suffix Trees

Suffix Trees and Suffix Arrays

Probabilistic Suffix Trees

Suffix Trees and Derived Applications

Suffix Trees

Suffix Trees