practical algorithms in sequence alignment
Download
Skip this Video
Download Presentation
Practical algorithms in Sequence Alignment

Loading in 2 Seconds...

play fullscreen
1 / 28

Practical algorithms in Sequence Alignment - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

Practical algorithms in Sequence Alignment. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ [email protected] Sep 16 th , 2014. Key concepts. Assessing the significance of an alignment Extreme value distribution gives an analytical form to compute the significance of a score

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Practical algorithms in Sequence Alignment' - brooklyn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
practical algorithms in sequence alignment

Practical algorithms in Sequence Alignment

Sushmita Roy

BMI/CS 576

www.biostat.wisc.edu/bmi576/

[email protected]

Sep 16th, 2014

key concepts
Key concepts
  • Assessing the significance of an alignment
    • Extreme value distribution gives an analytical form to compute the significance of a score
  • Heuristic algorithms
    • BLAST algorithm
readings from book
Readings from book
  • Chapter 2
    • Section 2.5
    • Section 2.7
recap four issues in sequence alignment
RECAP: Four issues in sequence alignment
  • Type
  • Algorithm
  • Score
  • Significance
classical approach to assessing the significance of score
Classical approach to assessing the significance of score
  • Develop a “background” distribution of alignment score
  • Assess the probability of observing our score S from this background distribution
thought experiment for generating th e background distribution
Thought experiment for generating the background distribution
  • Suppose we assume
    • Sequence lengths m and n
    • A particular substitution matrix and amino-acid frequencies
  • And we consider generating random sequences of lengths m and n and finding the best alignment of these sequences
  • This will give us a distribution over alignment scores for random pairs of sequences
the extreme v alue d istribution
The extreme value distribution
  • Because we are picking thebest alignments, we want to know the distribution of max scores for alignments against a random set of sequences looks like
  • The background distribution is given by an extreme value distribution (EVD)
  • We need to assess the probability of observing a score of S or higher by random chance, that is, we need the form of the CDF: P(x>S)
assessing significance of sequence score alignments
Assessing significance of sequence score alignments
  • It can be shown that the mean of optimal scores is
    • K, λestimated from the substitution matrix
  • Probability of observing a score greater than S
  • SubstitutingU into the equation
need to speed up sequence alignment
Need to speed up sequence alignment
  • Consider the task of searching the RefSeq collection of sequences against a query sequence:
    • most recent release of DB contains 32,504,738 proteins
    • Entails 33,000,000*(300*300) matrix operations (assuming query sequence is of length 300 and avg. sequence length is 300)
  • O(mn) too slow for large databases with high query traffic
  • We will look at a heuristic algorithm to speed up the search process
heuristic alignment
Heuristic alignment
  • Heuristic algorithm: a problem-solving method which isn’t guaranteed to find the optimal solution, but which is efficient and finds good solutions
  • Heuristic methods do fast approximation to dynamic programming
    • FASTA [Pearson & Lipman, 1988]
    • BLAST [Altschulet al., 1990; Altschul et al., Nucleic Acids Research 1997]
blast basic local alignment search tool
BLAST: Basic Local Alignment Search Tool
  • Altshulet al 1990
    • Cited >48,000 times!
  • Key heuristics in BLAST
    • A good alignment is made up short stretches of matches: seeds
    • Extend seeds to make longer alignments
  • Key tradeoff made: sensitivity vs. speed
  • Used EVD theory for random sequence score
  • Works for both protein sequence and DNA sequence
    • Only scores differ
blast continued
BLAST continued
  • Two parameters control how BLAST searches the database
    • w: This specifies the length of words to seed the alignment
    • T: The smallest threshold of word pair match to be considered in the alignment
key steps of the blast algorithm
Key steps of the BLAST algorithm
  • For each query sequence
    • Compile a list of high-scoring words of score at least T
      • First generate words in the query sequence
      • Then find words that match query sequence words with score at least T
      • Thus allows for inexact matches
    • Scan the database for hits of these words
      • Relies on indexing performed as pre-processing
    • Extend hits
determining query words
Determining query words

Given:

query sequence: QLNFSAGW

word length w = 2 (default for protein usually w = 3)

word score threshold T = 9

Step 1: determine all words of length w in query sequence

QL LN NF FS SA AG GW

determining query words1
Determining query words

Step 2: Determine all words that score at least T when compared to a word in the query sequence

QLQL=9

LNLN=10

NFNF=12, NY=9

SA none

...

words from

query sequence

words with T≥9

Aminoacid substitution matrix

Additional words

potentially in database

scanning the database
Scanning the database
  • Search database for all occurrences of query words
  • Approach:
    • index database sequences into table of words (pre-compute this)
    • index query words into table (at query time)

NP

NS

NT

NW

NY

QLNFSAGW

query sequence

MFNYT, STNYD…

NPGAT, TSQRPNP…

database sequences

extending a hit
Extending a hit
  • Extending a word hit to a larger alignment is straightforward
  • Terminate extension when the score of the current alignment falls a certain distance below the best score found for shorter extensions

Query sequence: Q L N F S A

DB sequence: R L N Y S W

Score: 1 4 5 3 4 -3

Total: 1+4+5+3+4=17

how to choose w and t
How to choose w and T?
  • Tradeoff between running time and sensitivity
  • Sensitivity
  • T
    • small T: greater sensitivity, more hits to expand
    • large T: lower sensitivity, fewer hits to expand
  • w
    • Larger w: fewer query word seeds, lower time for extending, but more possible words (20w for AAs)
updates to blast
Updates to BLAST
  • Two hit method
    • Lower the threshold but require two words to be on the same diagonal and be no more than A characters apart
  • Ability to handle gaps
  • Ability to handle position-specific score matrix created from alignments generated from iteration ifor iteration i+1

Altshul et al 1997

two hit method
Two-hit method

+ Hits wit T>=13 (15 hits)

.Hits with T>=11 (22 hits)

Only these two are considered as they satisfy the two-hit method

Figure from Altshul et al 1997

summary of blast
Summary of BLAST
  • T: Don’t consider seeds with score < T
  • Don’t extend hits when score falls below a specified threshold
  • Pre-processing of database or query helps to improve the running time
fasta
FASTA
  • Starts with exact seed matches instead of inexact matches that satisfy a threshold
  • Extends seeds (similar to BLAST)
  • Join high scoring seeds allowing for gaps
  • Re-align high scoring matches using dynamic programming
sequence databases
Sequence databases
  • Web portals/Knowledge bases
    • NCBI: http://www.ncbi.nih.gov
    • EBI: http://www.ebi.ac.uk
    • Sanger: http://www.sanger.ac.uk
    • Each of these centers link to hundreds of databases
  • Nucleotide sequences
    • Genbank
    • EMBL-EBI Nucleotide Sequence Database
    • Comprise ~8% of the total database (Nucleic Acid Research 2006 Database edition)
  • Protein sequences
    • UniProtKB
using blast
Using BLAST
  • http://blast.ncbi.nlm.nih.gov/Blast.cgi
  • Will blast a DNA sequence against NCBI nucleotide database
  • We will select
    • http://www.ncbi.nlm.nih.gov/nuccore/NG_000007.3?from=70545&to=72150&report=fasta
using blast1
Using BLAST

Enter the query sequence

Choose the database

using blast2
Using BLAST

The sequence corresponds to the human HBB (hemoglobin) gene.

But we will select the mouse DB

Use Megablast (large word size)

interpreting results
Interpreting results

Assesses significance of a score.

Related to P-value, but gives the

expected number of alignments of

this score value or higher.

ad