Patternhunter ii highly sensitive and fast homology search
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

PatternHunter II: Highly Sensitive and Fast Homology Search PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation. PatternHunter II: Highly Sensitive and Fast Homology Search. Ming Li, Bin Ma Derek Kisman, John Tromp. R94922059 林語君. Overview. Homology search Local alignment algorithms PH I PH II Multiple Spaced Seeds

Download Presentation

PatternHunter II: Highly Sensitive and Fast Homology Search

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Patternhunter ii highly sensitive and fast homology search

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation

PatternHunter II: Highly Sensitive and Fast Homology Search

Ming Li, Bin Ma

Derek Kisman, John Tromp

R94922059 林語君


Overview

Overview

  • Homology search

    • Local alignment algorithms

  • PH I

  • PH II

    • Multiple Spaced Seeds

    • Computing hit probability

    • Finding a good seed set

    • PH II Design

    • Performance


Local alignment

Local alignment

  • Smith-Waterman

    • Smith and Waterman, 1981; Waterman and Eggert, 1987

    • SSearch

  • FastA

    • Wilbur and Lipman, 1983; Lipman and Pearson, 1985

  • BLAST

    • Altschul et al., 1990; Altschul et al., 1997

    • Blast Family: BLASTN, BLASTP, etc.

    • MEGABLAST


Patternhunter

PatternHunter

  • Seed

    • Tradeoff: sensitivity <-> computation

  • Consecutive k letters

    • k=11 in Blastn, k=28 in MegaBlast

  • Nonconsecutive k letters

    • Spaced seed

    • A model of k as its weight


Patternhunter ii

PatternHunter II

  • Genome Informatics 14 (2003)

  • Extend single optimized spaced seed of PH to multiple ones

  • Speed: BLASTN (MEGABLAST)

  • Sensitivity: Smith-Waterman (SSearch)


Definition

Definition

  • A homologous region, R

  • A seed hitsR

  • A seed set A={a1,…ak} hits R

  • Similarity

    • R has p=x% identities

  • Sensitivity

    • Hit probability

    • Optimal (DP) = 1


Computing hit probability

Computing Hit Probability

  • NP-hard on multiple seeds

  • DP on 1 seed

  • Extend DP to multiple seeds


Computing hit probability of multiple seeds

Computing Hit Probability of Multiple Seeds

  • Let A={a1,…ak} be a set of k seeds and R a random region of Length L with similarity level p.

  • Binary string b is a suffix of R[0:i]

  • Answer: f ( L,Є ), Є= empty string


Computing hit probability of multiple seeds1

Computing Hit Probability of Multiple Seeds


Computing hit probability of multiple seeds2

Computing Hit Probability of Multiple Seeds


Finding a good seed set

Finding a Good Seed Set

  • NP-hard for both optimal seed and multiple seeds

  • Greedy


Finding a good seed set1

Finding a Good Seed Set

  • Compute the 1st seed a1 which maximizes the hit probability of {a1}

  • Compute the 2nd seed a2 which maximizes the hit probability of {a1, a2}

  • Repeat until

    • Reach the desired number of seeds

    • Reach the desired hit probability


Finding a good seed set2

Finding a Good Seed Set

  • May not optimize the combined hit probability

  • Good enough

    • Optimal

      • 16 weight, 11 seeds, L=64, similarity=70%, first four seeds:{111010010100110111,111100110010100001011,110100001100010101111,1110111010001111}

    • Greedy

      • 16 weight, 12 seeds, L=64, similarity=70%, first four seeds:{111010010100110111,1111000100010011010111,1100110100101000110111,1110100011110010001101}


Performance of the seeds

Performance of the seeds

  • From low to high

    • Solid: weight-11 k=1,2,4,8,16 seeds

    • Dashed: 1-seed, weight=10,9,8,7


Performance of the seeds1

Performance of the seeds

  • Reducing the weight by 1

    • Increase the expected number of hits by a factor of 4

  • Doubling the number of seeds

    • Increase the expected number of hits by a factor of 2

  • Better: Multiple seeds


Ph ii performance

PH II Performance

  • Compare with Blast(Blastn), Smith-Waterman(SSearch)

  • Sensitivity of SSearch = 1

  • Alignment score

    • BLAST methods (hash, DP)

    • match=1, mismatch=-1, gapopen=-5, gapextension=-1


Ph ii performance1

PH II Performance

  • From low to high

    • Solid: PH II, 1, 2, 4, 8 seeds weight 11

    • Dashed: Blastn, seed weight 11


Complexity proof

Complexity Proof

  • Finding optimal spaced seeds

    • NP-hard

  • Finding one optimal seed

    • NP-hard

  • Computing the hit probability of multiple seeds

    • NP-hard


  • Login