Designing multiple simultaneous seeds for dna similarity search
Download
1 / 26

designing multiple simultaneous seeds for dna similarity search - PowerPoint PPT Presentation


  • 204 Views
  • Uploaded on

Designing Multiple Simultaneous Seeds for DNA Similarity Search. Yanni Sun , Jeremy Buhler Washington University in Saint Louis. Outline. Problem of multi-seed design Methods Greedy covering algorithm Compute conditional match probabilities Experiments and results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' designing multiple simultaneous seeds for dna similarity search ' - Jims


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Designing multiple simultaneous seeds for dna similarity search l.jpg

Designing Multiple Simultaneous Seeds for DNA Similarity Search

Yanni Sun, Jeremy Buhler Washington University in Saint Louis


Outline l.jpg
Outline Search

  • Problem of multi-seed design

  • Methods

    • Greedy covering algorithm

      • Compute conditional match probabilities

  • Experiments and results

  • Conclusion and future work

WashU. Laboratory for Computational Genomics


Sequence alignment l.jpg
Sequence Alignment Search

  • Functional regions conserved despite DNA mutations over time

  • Conserved region can be aligned with high score

  • Exact solution: DP; time complexity: O(MN)

  • Fast but heuristic solution: seeded alignment algorithm

WashU. Laboratory for Computational Genomics


Seeded alignment algorithm l.jpg

TAGG SearchACCTAACC

GACCACCTTTT

Seeded Alignment Algorithm

  • BLAST is the most popular tool.

    Step 1: word matchstep 2: extend the match to find the high similarity pair

    TAGGACCTAACC

    GACCACCTTTT

WashU. Laboratory for Computational Genomics


Seed and similarity l.jpg
Seed and Similarity Search

  • Example of a similarity and a single seed

    tgcagaaatgcagaggca

    | || | | ||||

    tacacaggcaccgaggag

    Similarity: 101101000010111100

    Seed: 11*1, weight = 3, span = 4

    The seed detects/matchesthis similarity.

WashU. Laboratory for Computational Genomics


Seed choice is important l.jpg
Seed Choice is Important Search

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Significant alignment

Seed match

WashU. Laboratory for Computational Genomics


Seed design previous work l.jpg
Seed Design: Previous Work Search

  • Traditional seed: word (e.g. 11111111111)

  • Discontiguous patterns of matching bases: [CR1993]; [MTL’02] {111010010100110111}

  • Our work on single discontiguous seed: [BKS’03]

WashU. Laboratory for Computational Genomics


Multiple simultaneous seeds l.jpg
Multiple Simultaneous Seeds Search

  • Multiple simultaneous seeds are defined as a set of seeds.

    • ∏= {seed1, seed2,…seed i,…, seedn}

    • ∏ detects a similarity if at least one of the component seeds detects the similarity

    • Example

      • Simultaneous seeds {11*1, 1*11} detect similarities 100110100001, 1000010110001, 1101001011001

WashU. Laboratory for Computational Genomics


Multi seed design balance sensitivity with specificity l.jpg
Multi-seed Design – Balance Sensitivity with Specificity Search

  • Sensitivity=A / Biologically

    meaningful alignments

  • Specificity=A / seed matches

  • Increase sensitivity:

    • Decrease weight of single seed

    • Use multiple seeds

    • Both methods hurt specificity

  • Hypothesis: a set of multiple seeds

    has a better tradeoff of sensitivity vs. specificity comparing to single seed

biologically meaningful alignments

A

seed matches


Our work design multiple simultaneous seeds efficiently l.jpg
Our Work – Design Multiple Simultaneous Seeds Efficiently Search

  • Use a new local search method to optimize seed set

  • Design an efficient algorithm to calculate conditional match probability

  • Empirical verification that multiple simultaneous seeds have better tradeoff of sensitivity vs. specificity

WashU. Laboratory for Computational Genomics


Multi seed design problem l.jpg
Multi-seed Design Problem Search

  • Input:

    • Ungapped alignments sampled from two genomic DNA sequences

    • Resource constraints of seeds: weight, span, number

  • Goal: find a set of seeds ∏ to maximize the detection probability Pr[∏ detects S].

    • Pr(∏ detects S) = Pr( (seed1 detects S) or (seed2 detects S)…or (seedn detects S))


Outline12 l.jpg
Outline Search

  • Problem of multi-seed Design

  • Methods

    • Greedy covering algorithm

      • Compute conditional match probabilities

  • Experiments and results

  • Conclusion and future work

WashU. Laboratory for Computational Genomics


Computing match probability for specified seeds bks 03 l.jpg
Computing Match Probability for Specified Seeds [BKS ’03] Search

  • Learn a kth-order Markov model from similarities.

  • Build a DFA that only accepts strings containing the given seeds

  • Compute the probability that the DFA accepts a string chosen randomly from model M by DP.

WashU. Laboratory for Computational Genomics


Seek the locally optimal set of seeds l.jpg
Seek the Locally Optimal Set of Seeds Search

  • Original local search

  • Greedy covering algorithm – a faster local search strategy

    • Efficient computation of conditional match probability

WashU. Laboratory for Computational Genomics


Find optimal set of seeds by original local search l.jpg

1***1*1, Search

1*****11

Pr=0.75

1**1**1,

1*****11

Pr=0.67

1****11, 1*****11

Pr=0.71

Find Optimal Set of Seeds by Original Local Search

Seed space with span<=8,weight=3

1*1***1,

1*****11

Pr=0.70

WashU. Laboratory for Computational Genomics


Greedy covering algorithm l.jpg

Similarities detected by S Search1

Similarities detected by S2

Similarities detected by S3

Greedy Covering Algorithm

Similarity space

Design 3 simultaneous seeds:{s1,s2,s3}

s1= argmaxxPr(x)

s2=argmaxx Pr(x|~s1)

s3=argmaxx Pr(x|~{s1,s2})

WashU. Laboratory for Computational Genomics


Calculate conditional match probabilities l.jpg
Calculate Conditional Match Probabilities Search

  • Challenge: how to calculate the conditional probability efficiently ?

    • Seeds with small span: exact computation via DFAs

    • Seeds with large span: Monte Carlo

WashU. Laboratory for Computational Genomics


Calculate conditional match probability via dfa l.jpg
Calculate Conditional Match Probability via DFA Search

  • Pr( x| ) = Pr(x )/ Pr( )

  • Build DFA corresponding to x by using cross product and complementation of DFA

  • Efficiency: in the process of local search to find optimal single seed x, Pr( ) can be precomputed

WashU. Laboratory for Computational Genomics


Outline19 l.jpg
Outline Search

  • Problem of multi-seed design

  • Methods

    • Greedy covering algorithm

      • Compute conditional match probabilities

  • Experiments and results

  • Conclusion and future work

WashU. Laboratory for Computational Genomics


Greedy covering vs original local search l.jpg
Greedy Covering vs. Original Local Search Search

Detection probability


Greedy covering is much faster l.jpg
Greedy Covering is Much Faster Search

  • When n=5, on the same hardware platform(P4)

    • Greedy covering needs 20 minutes

    • The original local search needs 2.4 hours

WashU. Laboratory for Computational Genomics


Experimental setup l.jpg
Experimental Setup Search

  • The ungapped alignments are sampled uniformly from human and mouse syntenies

  • For a specified seed set

    • sensitivity : the number of significant gapped alignments found by our BLAST-like alignment tool

    • False positive rate : approximated by the number of seed matches

WashU. Laboratory for Computational Genomics


Results verify the hypothesis on noncoding sequences l.jpg
Results: Verify the Hypothesis on Noncoding Sequences Search

WashU. Laboratory for Computational Genomics


Summary of contributions l.jpg
Summary of Contributions Search

  • Efficient algorithms to design multiple simultaneous seeds at reasonable cost

  • Empirical verification: multiple simultaneous seeds have a better tradeoff between sensitivity and specificity

WashU. Laboratory for Computational Genomics


Future work l.jpg
Future Work Search

  • Design a better evaluation platform for different seeds

  • Investigate utility of seeds in multiple sequence alignment

WashU. Laboratory for Computational Genomics


Acknowledgements l.jpg
Acknowledgements Search

  • Dr. Jeremy Buhler (advisor), Ben Westover, Rachel Nordgren, Joseph Lancaster and Christopher Swope

  • Laboratory for computational genomics in Washington University in Saint Louis

    http://www.cse.wustl.edu/~jbuhler/mandala

WashU. Laboratory for Computational Genomics


ad