Presentation for vlsi application of suffix trees
Download
1 / 20

Presentation For VLSI Application of Suffix Trees - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

Presentation For VLSI Application of Suffix Trees. Chunkai Yin Apr 2002. Topics. Approximate palindromes and repeats Multiple common substring problem. Approximate palindromes and repeats. Definitions Find all tandem repeats (naive method) Find all tandem repeats (Landau-Schmidt)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Presentation For VLSI Application of Suffix Trees' - salene


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Presentation for vlsi application of suffix trees

Presentation For VLSIApplication of Suffix Trees

Chunkai Yin

Apr 2002


Topics
Topics

  • Approximate palindromes and repeats

  • Multiple common substring problem


Approximate palindromes and repeats
Approximate palindromes and repeats

  • Definitions

  • Find all tandem repeats (naive method)

  • Find all tandem repeats (Landau-Schmidt)

  • Find all k-mismatch tandem repeats (L-S)


Definitions
Definitions

  • Tandem repeat

    string written as AA, where A is a substring

    abcabc

    Specified by starting position and length of A

  • K-mismatch palindrome/tandem

    a substring that becomes a palindrome/tandem after k or fewer characters are changed

    zzcbbcca 2-mismatch palindrome pzcb + bcca

    axabaybb 2-mismatch tamdem axab + aybb


Find all tandem repeats naive method
Find all tandem repeats(naive method)

  • Guess a start position i , middle position j for the tandem

  • Do a longest common extension query from i and j. If the extension from i reaches j, the tandem exists starting at position i.

  • Time complexity O(n2)

  • EX1

    fababd


Problem find all tandem repeats
Problem: Find all tandem repeats

The Landau-Schmidt method divide the problem

into four subproblems

(Recursive, divide-and-conquer)

  • Find all tandem repeats contained entirely in the first half of S (up to h = n/2)

  • Find all tandem repeats contained entirely in the second half of S (after h )

  • Find all tandem repeats where the first copy contains position h of S.

  • Find all tandem repeats where the second copy contains position h of S.


How to handle this
How to handle this ?

  • Subproblem A, B will be solved by recursively applying the method

  • Subproblem C and D are symmetric, just consider subproblem C.

    Therefore, the core problem is subproblem C.


Solve the subproblem 3
Solve the subproblem 3

  • Initialization:

    h= n/2; q=h+L; L is a fixed number each time

  • Compute the longest common extension (from left to right) from h and q.

    L1 is the length of extension.

  • Compute the longest common extension (from right to left) from h-1 and q-1. L2


Solve the subproblem 31
Solve the subproblem 3

  • If and only if L1+L2>=L, a tandem repeat exists whose first copy contains position h with the length 2L, and it can begin at any position between h-L2 and h+L1-L inclusive. The second copy begins L places to the right. Output the starting point.

  • EX2


L 1 l 2 and range for starting point
L1, L2 and range for starting point

L

q

A

B

h

Δ1=L-L2, L= Δ1 +L2

A=h-L2

Δ2=L-L1, L= Δ2 +L1

B=h- Δ2=h+L1-L

Δ2

L1

Δ2

L1

L2

Δ1

L2

Δ1


An example
An Example

….b d c a b b d c a b b t t t….

h q

L=5 L1=3 L2=3

A:1 B:2

Two forms available:

bdcab + bdcab

dcabb + dcabb


Time complexity
Time Complexity

  • Time used for output O(Z)

    Z is the total number of tandem repeats in S

  • Time used for the extension queries

    which is proportional to the number of extension queries ext.

    T(n)=T(n/2) + T(n/2)+ n + n n=n/2 (->) +n/2(<-)

    A B C D

    T(n)= O(nlogn), total=O(nlogn)+Z


Find all k mismatch tandem repeats l s
Find all k-mismatch tandem repeats (L-S)

  • Immediate extension of the O(nlogn+Z) method for finding exact tandem repeats.

  • Run k successive longest common extension queries forward from h and q and run k successive longest common extension queries backward from h-1 and q-1.

  • Then try to find t, a midpoint of the tandem repeat.

    In the interval between h and q, the number of mismatches from h to t (found during the forward extension) plus the the number of mismatches from t+1 to q-1(found during the backward ext) is <=k.


Time complexity1
Time complexity

O(knlogn +Z)


Multiple common substring prob
Multiple common substring prob.

  • Problem:

    What substring are common to a large number of distinct strings?

  • Definitions

    T: a generalized suffix tree for K strings

    **The identical suffixes appearing in more

    than one string end at distinct leaves. Each

    leaf has one of K unique string identifier.


  • C(v): the number of distinct leaf string identifiers in the subtree of v

  • S(v): the total number of leaves in the subtree of v

  • U(v): how many duplicate suffixes from the same string occur in v’s subtree.

  • C(v)=S(v)-U(v)

  • L(k): the length of the longest substring common to at least k of the strings.

  • U(v)=i:ni(v)>0(ni(v)-1)

  • ni(v): the number of leaves with identifier i

    in the subtree rooted at v


The key work we need do
The key work we need do subtree of v

  • Our object: get C(v)

  • What is the algorithm?

  • Define Li: the list of leaves with identifier i, in increasing order of their DFS numbers


Algorithm for the problem
Algorithm for the problem subtree of v

  • Build a generalized suffix tree T for the K strings.

  • Number the leaves of T as they are encountered in a depth-first traversal of T

  • For each string identifier i, extract the ordered list Li, of leaves with identifier i.

  • For each identifier i, compute the lca of each consecutive pair of leaves in Li, and increment by one each time that w is the computed lca.


Algorithm cont
Algorithm Cont. subtree of v

Lemma:

If we compute the lca for each consecutive pair of leaves in Li, then for any node v, exactly ni(v)-1 of the computed lcas will lie in the substring of v.

  • With a bottom-up traversal of T, compute, for each node v, S(v), and U(v)=i:ni(v)>0(ni(v)-1)= [h(w):w is in the subtree of v]

  • Set C(v)=S(v)-U(v) for each v

  • Accumulate the table of L(k) values.

  • Time complexity: O(n)


End of the presentation
End of the presentation subtree of v

Thank you

Thank you

Thank you


ad