Presentation for vlsi application of suffix trees
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Presentation For VLSI Application of Suffix Trees PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

Presentation For VLSI Application of Suffix Trees. Chunkai Yin Apr 2002. Topics. Approximate palindromes and repeats Multiple common substring problem. Approximate palindromes and repeats. Definitions Find all tandem repeats (naive method) Find all tandem repeats (Landau-Schmidt)

Download Presentation

Presentation For VLSI Application of Suffix Trees

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Presentation for vlsi application of suffix trees

Presentation For VLSIApplication of Suffix Trees

Chunkai Yin

Apr 2002


Topics

Topics

  • Approximate palindromes and repeats

  • Multiple common substring problem


Approximate palindromes and repeats

Approximate palindromes and repeats

  • Definitions

  • Find all tandem repeats (naive method)

  • Find all tandem repeats (Landau-Schmidt)

  • Find all k-mismatch tandem repeats (L-S)


Definitions

Definitions

  • Tandem repeat

    string written as AA, where A is a substring

    abcabc

    Specified by starting position and length of A

  • K-mismatch palindrome/tandem

    a substring that becomes a palindrome/tandem after k or fewer characters are changed

    zzcbbcca 2-mismatch palindrome pzcb + bcca

    axabaybb 2-mismatch tamdem axab + aybb


Find all tandem repeats naive method

Find all tandem repeats(naive method)

  • Guess a start position i , middle position j for the tandem

  • Do a longest common extension query from i and j. If the extension from i reaches j, the tandem exists starting at position i.

  • Time complexity O(n2)

  • EX1

    fababd


Problem find all tandem repeats

Problem: Find all tandem repeats

The Landau-Schmidt method divide the problem

into four subproblems

(Recursive, divide-and-conquer)

  • Find all tandem repeats contained entirely in the first half of S (up to h = n/2)

  • Find all tandem repeats contained entirely in the second half of S (after h )

  • Find all tandem repeats where the first copy contains position h of S.

  • Find all tandem repeats where the second copy contains position h of S.


How to handle this

How to handle this ?

  • Subproblem A, B will be solved by recursively applying the method

  • Subproblem C and D are symmetric, just consider subproblem C.

    Therefore, the core problem is subproblem C.


Solve the subproblem 3

Solve the subproblem 3

  • Initialization:

    h= n/2; q=h+L; L is a fixed number each time

  • Compute the longest common extension (from left to right) from h and q.

    L1 is the length of extension.

  • Compute the longest common extension (from right to left) from h-1 and q-1. L2


Solve the subproblem 31

Solve the subproblem 3

  • If and only if L1+L2>=L, a tandem repeat exists whose first copy contains position h with the length 2L, and it can begin at any position between h-L2 and h+L1-L inclusive. The second copy begins L places to the right. Output the starting point.

  • EX2


L 1 l 2 and range for starting point

L1, L2 and range for starting point

L

q

A

B

h

Δ1=L-L2, L= Δ1 +L2

A=h-L2

Δ2=L-L1, L= Δ2 +L1

B=h- Δ2=h+L1-L

Δ2

L1

Δ2

L1

L2

Δ1

L2

Δ1


An example

An Example

….b d c a b b d c a b b t t t….

h q

L=5 L1=3 L2=3

A:1 B:2

Two forms available:

bdcab + bdcab

dcabb + dcabb


Time complexity

Time Complexity

  • Time used for output O(Z)

    Z is the total number of tandem repeats in S

  • Time used for the extension queries

    which is proportional to the number of extension queries ext.

    T(n)=T(n/2) + T(n/2)+ n + n n=n/2 (->) +n/2(<-)

    A B C D

    T(n)= O(nlogn), total=O(nlogn)+Z


Find all k mismatch tandem repeats l s

Find all k-mismatch tandem repeats (L-S)

  • Immediate extension of the O(nlogn+Z) method for finding exact tandem repeats.

  • Run k successive longest common extension queries forward from h and q and run k successive longest common extension queries backward from h-1 and q-1.

  • Then try to find t, a midpoint of the tandem repeat.

    In the interval between h and q, the number of mismatches from h to t (found during the forward extension) plus the the number of mismatches from t+1 to q-1(found during the backward ext) is <=k.


Time complexity1

Time complexity

O(knlogn +Z)


Multiple common substring prob

Multiple common substring prob.

  • Problem:

    What substring are common to a large number of distinct strings?

  • Definitions

    T: a generalized suffix tree for K strings

    **The identical suffixes appearing in more

    than one string end at distinct leaves. Each

    leaf has one of K unique string identifier.


Presentation for vlsi application of suffix trees

  • C(v): the number of distinct leaf string identifiers in the subtree of v

  • S(v): the total number of leaves in the subtree of v

  • U(v): how many duplicate suffixes from the same string occur in v’s subtree.

  • C(v)=S(v)-U(v)

  • L(k): the length of the longest substring common to at least k of the strings.

  • U(v)=i:ni(v)>0(ni(v)-1)

  • ni(v): the number of leaves with identifier i

    in the subtree rooted at v


The key work we need do

The key work we need do

  • Our object: get C(v)

  • What is the algorithm?

  • Define Li: the list of leaves with identifier i, in increasing order of their DFS numbers


Algorithm for the problem

Algorithm for the problem

  • Build a generalized suffix tree T for the K strings.

  • Number the leaves of T as they are encountered in a depth-first traversal of T

  • For each string identifier i, extract the ordered list Li, of leaves with identifier i.

  • For each identifier i, compute the lca of each consecutive pair of leaves in Li, and increment by one each time that w is the computed lca.


Algorithm cont

Algorithm Cont.

Lemma:

If we compute the lca for each consecutive pair of leaves in Li, then for any node v, exactly ni(v)-1 of the computed lcas will lie in the substring of v.

  • With a bottom-up traversal of T, compute, for each node v, S(v), and U(v)=i:ni(v)>0(ni(v)-1)= [h(w):w is in the subtree of v]

  • Set C(v)=S(v)-U(v) for each v

  • Accumulate the table of L(k) values.

  • Time complexity: O(n)


End of the presentation

End of the presentation

Thank you

Thank you

Thank you


  • Login