pairwise sequence alignment
Download
Skip this Video
Download Presentation
Pairwise Sequence Alignment

Loading in 2 Seconds...

play fullscreen
1 / 66

Pairwise Sequence Alignment - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Pairwise Sequence Alignment. Misha Kapushesky Slides: Stuart M. Brown, Fourie Joubert, NYU St. Petersburg Russia 2010. Protein Evolution. “For many protein sequences, evolutionary history can be traced back 1-2 billion years” -William Pearson

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Pairwise Sequence Alignment' - clive


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
pairwise sequence alignment

Pairwise Sequence Alignment

Misha Kapushesky

Slides: Stuart M. Brown, Fourie Joubert, NYU

St. Petersburg Russia 2010

protein evolution
Protein Evolution

“For many protein sequences, evolutionary history can be traced back 1-2 billion years”

-William Pearson

  • When we align sequences, we assume that they share a common ancestor
    • They are then homologous
  • Protein fold is much more conserved than protein sequence
  • DNA sequences tend to be less informative than protein sequences
definition
Definition
  • Homology: related by descent
  • Homologous sequence positions

 ATTGCGC

ATTGCGC

ATTGCGC

AT-CCGC

ATTGCGC

 ATCCGC

C

orthologous and paralogous
Orthologous and paralogous
  • Orthologous sequences differ because they are found in different species (a speciation event)
  • Paralogous sequences differ due to a gene duplication event
  • Sequences may be both orthologous and paralogous
pairwise alignment
Pairwise Alignment
  • The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.
    • There are lots of possible alignments.
  • Two sequences can always be aligned.
  • Sequence alignments have to be scored.
  • Often there is more than one solution with the same score.
methods of alignment
Methods of Alignment
  • By hand - slide sequences on two lines of a word processor
  • Dot plot
    • with windows
  • Rigorous mathematical approach
    • Dynamic programming (slow, optimal)
  • Heuristic methods (fast, approximate)
    • BLAST and FASTA
      • Word matching and hash tables0
align by hand
Align by Hand

GATCGCCTA_TTACGTCCTGGAC <--

--> AGGCATACGTA_GCCCTTTCGC

You still need some kind of scoring system to find the best alignment

percent sequence identity
Percent Sequence Identity
  • The extent to which two nucleotide or amino acid sequences are invariant

A C C T G A G – A G

A C G T G – G C A G

mismatch

indel

70% identical

slide9

Dotplot:

A dotplot gives an overview of all possible alignments

A   T   T   C  A   C  A   T   A    T A C A T T A C G T A C

Sequence 2

Sequence 1

slide10

Dotplot:

In a dotplot each diagonal corresponds to a possible (ungapped) alignment

A   T   T   C  A   C  A   T   A    T A C A T T A C G T A C

Sequence 2

Sequence 1

T A C A T T A C G T A C

A T A C A C T T A

One possible alignment:

slide11

Insertions / Deletions in a Dotplot

T

A

C

T

G

T

C

A

T

T A C T G T T C A T

Sequence 2

Sequence 1

T A C T G-T C A T

| | | | | | | | |

T A C T G T T C A T

slide12

Dotplot(Window = 130 / Stringency = 9)

Hemoglobin-chain

Hemoglobin -chain

slide13

Word Size Algorithm

T A C G G T A T G

A C A G T A T C

Word Size = 3

C T A T G A C A T A C G G T A T G

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

slide14

Window / Stringency

Score = 11

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Scoring Matrix Filtering

Score = 11

Matrix: PAM250

Window = 12 Stringency = 9

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Score = 7

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

slide15

Dotplot(Window = 18 / Stringency = 10)

Hemoglobin-chain

Hemoglobin -chain

slide16

Considerations

  • The window/stringency method is more sensitive than the wordsize
  • method (ambiguities are permitted).
  • The smaller the window, the larger the weight of statistical
  • (unspecific) matches.
  • With large windows the sensitivity for short sequences is reduced.
  • Insertions/deletions are not treated explicitly.
alignment methods
Alignment methods
  • Rigorous algorithms = Dynamic Programming
    • Needleman-Wunsch (global)
    • Smith-Waterman (local)
  • Heuristic algorithms (faster but approximate)
      • BLAST
      • FASTA
the rocks game
The Rocks game
  • N rocks, 2 piles, 2 players
  • Player can
    • Remove 1 rock from either pile
    • Remove 1 rock from each pile
  • Last to remove a rock wins
  • Assume 10 rocks in each pile – winning strategy?
slide20

Basic principles of dynamic programming

  • - Creation of an alignment path matrix
  • - Stepwise calculation of score values
  • - Backtracking (evaluation of the optimal path)
dynamic programming1
Dynamic Programming
  • Dynamic Programming is a very general programming technique.
  • It is applicable when a large search space can be structured into a succession of stages, such that:
    • the initial stage contains trivial solutions to sub-problems
    • each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage
    • the final stage contains the overall solution
slide22

Creation of an alignment path matrix

Idea:Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences

  • Construct matrix F indexed by i and j (one index for each sequence)
  • F(i,j) is the score of the best alignment between the initial segment x1...iof x up to xiand the initial segment y1...jof y up to yj
  • Build F(i,j) recursively beginning with F(0,0) = 0
slide24

Creation of an alignment path matrix

  • If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)
  • Three possibilities:
    • xiand yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)
    • xi is aligned to a gap, F(i,j) = F(i-1,j) - d
    • yjis aligned to a gap, F(i,j) = F(i,j-1) - d
  • The best score up to (i,j) will be the largest of the three options
slide25

E

E

Backtracking

H E A G A W G H E E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

0

-8

-16

-17

-25

-20

-5

-13

-3

3

-5

1

H

-

E

-

A

P

G

-

G

-

H

H

E

E

-

A

A

A

W

W

Optimal global alignment:

global vs local alignments
Global vs. Local Alignments
  • Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.
  • Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.
slide28

Global Alignment

Two closely related sequences:

needle (Needleman & Wunsch)creates an end-to-end alignment.

slide29

Global Alignment

Two sequences sharing several regions of local similarity:

1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67

|||||||||||||| | | | |||| || | | | ||

1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70

global alignment needleman wunsch
Global Alignment (Needleman-Wunsch)
  • The the Needleman-Wunsch algorithm creates a global alignment over the length of both sequences (needle)
  • Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.
    • Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.
  • Global methods are useful when you want to force two sequences to align over their entire length
local alignment smith waterman
Local Alignment (Smith-Waterman)
  • Local alignment
    • Identify the most similar sub-region shared between two sequences
    • Smith-Waterman
    • EMBOSS: water
slide32

Parameters of Sequence Alignment

  • Scoring Systems:
  • Each symbol pairing is assigned a numerical value, based on a symbol comparison table.
  • Gap Penalties:
  • Opening: The cost to introduce a gap
  • Extension: The cost to elongate a gap
slide33

actaccagttcatttgatacttctcaaa

taccattaccgtgttaactgaaaggacttaaagact

DNA Scoring Systems

-very simple

Sequence 1

Sequence 2

A G C T

A1 0 0 0

G 0 1 0 0

C 0 0 1 0

T 0 0 0 1

Match: 1

Mismatch: 0

Score = 5

slide34

Protein Scoring Systems

Sequence 1

Sequence 2

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

C S T P A G N D. .

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 5

D -3 0 -1 -1 -2 -1 1 6

.

.

C S T P A G N D. .

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 5

D -3 0 -1 -1 -2 -1 1 6

.

.

Scoring

matrix

T:G = -2

T:T = 5

Score = 48

slide35

Protein Scoring Systems

  • Amino acids have different biochemical and physical properties
  • that influence their relative replaceability in evolution.

tiny

P

aliphatic

C

small

S+S

G

G

I

A

S

V

C

N

SH

L

D

T

hydrophobic

Y

M

K

E

Q

F

W

H

R

positive

aromatic

polar

charged

slide36

Protein Scoring Systems

  • Scoring matrices reflect:
    • # of mutations to convert one to another
    • chemical similarity
    • – observed mutation frequencies
    • – the probability of occurrence of each amino acid
  • Widely used scoring matrices:
    • PAM
    • BLOSUM
pam matrices
PAM matrices
  • Family of matrices PAM 80, PAM 120, PAM 250
  • The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based
  • Greater numbers denote greater distances
slide38

PAM (Percent Accepted Mutations) matrices

  • The numbers of replacements were used to compute a so-called
  • PAM-1 matrix.
  • The PAM-1 matrix reflects an average change of 1% of all amino
  • acid positions. PAM matrices for larger evolutionary distances can
  • be extrapolated from the PAM-1 matrix.
  • PAM250 = 250 mutations per 100 residues.
  • Greater numbers mean bigger evolutionary distance
slide39

PAM (Percent Accepted Mutations) matrices

  • Derived from global alignments of protein families . Family members
  • share at least 85% identity (Dayhoff et al., 1978).
  • Construction of phylogenetic tree and ancestral sequences of
  • each protein family
  • Computation of number of replacements for each pair of amino acids
slide40

C

W

W

-8

17

PAM 250

A R N D C Q E G H I L K M F P S T W Y V B Z

A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1

R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2

N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3

D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4

C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4

Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5

E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5

G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1

H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3

I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1

L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1

K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2

M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0

F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4

P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1

S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1

T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1

W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4

Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3

V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0

B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5

Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6

pam limitations
PAM - limitations
  • Based on only one original dataset
  • Examines proteins with few differences (85% identity)
  • Based mainly on small globular proteins so the matrix is biased
blosum matrices
BLOSUM matrices
  • Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments)
  • BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity
  • BLOSUM62 represents closer sequences than BLOSUM45
slide43

BLOSUM (Blocks Substitution Matrix)

  • Derived from alignments of domains of distantly related
  • proteins (Henikoff & Henikoff,1992).
  • Occurrences of each amino acid pair
  • in each column of each block alignment
  • is counted.
  • The numbers derived from all blocks were
  • used to compute the BLOSUM matrices.

A

A

C

E

C

A

A

C

E

C

A - C = 4

A - E = 2

C - E = 2

A - A = 1

C - C = 1

slide45

BLOSUM (Blocks Substitution Matrix)

  • Sequences within blocks are clustered according to their level of identity.
  • Clusters are counted as a single sequence.
  • Different BLOSUM matrices differ in the percentage of sequence identity
  • used in clustering.
  • The number in the matrix name (e.g. 62 in BLOSUM62) refers to the
  • percentage of sequence identity used to build the matrix.
  • Greater numbers mean smaller evolutionary distance.
pam vs blosum
PAM Vs. BLOSUM

PAM100 = BLOSUM90

PAM120 = BLOSUM80

PAM160 = BLOSUM60

PAM200 = BLOSUM52

PAM250 = BLOSUM45

More distant sequences

  • BLOSUM62 for general use
  • BLOSUM80 for close relations
  • BLOSUM45 for distant relations
  • PAM120 for general use
  • PAM60 for close relations
  • PAM250 for distant relations
slide47

TIPS on choosing a scoring matrix

  • Generally, BLOSUM matrices perform better than PAM matrices
  • for local similarity searches (Henikoff & Henikoff, 1993).
  • When comparing closelyrelatedproteins one should use lower
  • PAMor higher BLOSUM matrices, for distantlyrelatedproteins
  • higher PAM or lower BLOSUM matrices.
  • For database searching the commonly used matrix is BLOSUM62.
slide48

Scoring Insertions and Deletions

A T G T A A T G C A

T A T G T G G A A T G A

A T G T - - A A T G C A

T A T G T G G A A T G A

insertion / deletion

The creation of a gap is penalized with a negative score value.

slide49

Why Gap Penalties?

Gaps not permitted Score: 0

1 GTGATAGACACAGACCGGTGGCATTGTGG 29

||| | | ||| | || || |

1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29

Match = 5

Mismatch = -4

Gaps allowed but not penalized Score: 88

1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29

||| || | | | ||| || | | || || |

1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29

slide50

Why Gap Penalties?

  • The optimal alignment of two similar sequences is usually that which
      • maximizes the number of matches and
      • minimizes the number of gaps.
      • There is a tradeoff between these two
        • - adding gaps reduces mismatches
  • Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences.
  • Penalizing gaps forces alignments to have relatively few gaps.
slide51

Gap Penalties

  • How to balance gaps with mismatches?
  • Gaps must get a steep penalty, or else you’ll end up with nonsense alignments.
  • In real sequences, multi-base (or amino acid) gaps are quite common
      • genetic insertion/deletion events
  • “Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.
slide52

Scoring Insertions and Deletions

match = 1

mismatch = 0

Total Score: 4

A T G T T A T A C

T A T G T G C G T A T A

Total Score: 8 - 3.2 = 4.8

A T G T - - - T A T A C

Gap parameters:

d = 3 (gap opening)

e = 0.1 (gap extension)

g = 3 (gap lenght)

(g) = -3 - (3 -1) 0.1 = -3.2

T A T G T G C G T A T A

insertion / deletion

slide53

Modification of Gap Penalties

Score Matrix: BLOSUM62

1 ...VLSPADKFLTNV 12

||||

1 VFTELSPAKTV.... 11

gap opening penalty = 3

gap extension penalty = 0.1

score = 6.3

1 V...LSPADKFLTNV 12

| |||| | | |

1 VFTELSPA.K..T.V 11

gap opening penalty = 0

gap extension penalty = 0.1

score = 11.3

blast algorithm basic local alignment search tool
BLAST AlgorithmBasic Local Alignment Search Tool
  • Fast alignment technique(s)
    • Similar to FASTA algorithms (not used much now)
    • There are more accurate ones, but they’re slower
    • BLAST makes a big use of lookup tables
  • Idea: statistically significant alignments (hits)
    • Will have regions of at least 3 letters same
      • Or at least high scoring with respect to BLOSUM matrix
  • Based on small local alignments

more likely than

CCNDHRKMTCSPNDNNRK

TTNDHRMTACSPDNNNKH

CCNDHRKMTCSPNDNNRK

YTNHHMMTTYSLDNNNKK

blast overview
BLAST Overview
  • Given a query sequence Q
  • Seven main stages
    • Remove (filter) low complexity regions from Q
    • Harvest k-tuples (triples) from Q
    • Expand each triple into ~50 high scoring words
    • Seed a set of possible alignments
    • Generate high scoring pairs (HSPs) from the seeds
    • Test significance of matches from HSPs
    • Report the alignments found from the HSPs
blast algorithm part 1 removing low complexity segments
BLAST Algorithm Part 1 Removing Low-complexity Segments
  • Imagine matching
    • HHHHHHHHKMAY and HHHHHHHHURHD
    • The KMAY and URHD are the interesting parts
    • But this pair score highly using BLOSUM
  • It’s a good idea to remove the HHHHHHHs
    • From the query sequence (low complexity)
  • SEG program does this kind of thing
    • Comes with most BLAST implementations
    • Often doesn’t do much, and it can be turned off
removing low complexity segments
Removing Low-complexity Segments
  • Given a segment of length L
    • With each amino acid occurring n1 n2 … n20 times
  • Use the following measure for “compositional complexity”:
  • To use this measure
    • Slide a “window” of ~12 residues along Query Sequence Q
    • Use a threshold to determine low complexity windows
    • Use a minimise routine to replace the segment
      • With an optimal minimised segment (or just an X)
blast algorithm part 2 harvesting k tuples
BLAST Algorithm Part 2Harvesting k-tuples
  • Collect all the k-tuples of elements in Q
    • k set to 3 for residues and 11 for DNA (can vary)
    • Triples are called ‘words’. Call this set W

S T S L S T S D K L M R

STS

TSL

SLS

LST

blast algorithm part 3 finding high scoring triples
BLAST Algorithm Part 3Finding High Scoring Triples
  • Given a word w from W
    • Find all other words w’ of same length (3), which:
      • Appear in some database sequence
      • Blosum(w,w’) > a threshold T
  • Choose T to limit number to around 50
    • Call these the high scoring triples (words) for w
  • Example: letting w=PQG, set T to be 13
    • Suppose that PQG, PEG, PSG, PQA are found in database
    • Blosum(PQG,PQG) = 18, Blosum(PQG,PEG) = 15
    • Blosum(PQG,PSG) = 13, Blosum(PQG,PQA) = 12
    • Hence, PQG and PEG only are kept
finding high scoring triples
Finding High Scoring Triples
  • For each w in W, find all the high scoring words
    • Organise these sets of words
      • Remembering all the places where w was found in Q
  • Each high scoring triple is going to be a seed
    • In order to generate possible alignment(s)
      • One seed can generate more than one alignment
  • End of the first half of the algorithm
    • Going to find alignments now
blast algorithm part 4 seeding possible alignments
BLAST Algorithm Part 4Seeding Possible Alignments
  • Look at first triple V in query sequence Q
    • Actually from Q (not from W - which has omissions)
    • Retrieve the set of ~50 high scoring words
      • Call this set HV
    • Retrieve the list of places in Q where V occurs
      • Call this set PV
  • For every pair (word, pos)
    • Where word is from HV and pos is from PV
      • Find all the database sequences D
        • Which have an exact match with word at position pos’
      • Store an alignment between Q and D
        • With Vmatched at pos in Q and pos’ in D
  • Repeat this for the second triple in Q, and so on
seeding possible alignments example
Seeding Possible AlignmentsExample
  • Suppose Q = QQGPHUIQEGQQG
  • Suppose V = QQG, HV = {QQG, QEG}
    • Then PV = {1, 11}
  • Suppose we are looking in the database at:
    • D = PKLMMQQGKQEG
  • Then the alignments seeded are:

QQGPHUIQEGQQG word=QQG QQGPHUIQEGQQG word=QQG

PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11

QQGPHUIQEGQQG word=QEG QQGPHUIQEGQQG word=QEG

PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11

blast algorithm part 5 generating high scoring pairs hsps
BLAST Algorithm Part 5Generating High Scoring Pairs (HSPs)
  • For each alignment A
    • Where sequences Q and D are matched
    • Original region matching was M
  • Extend M to the left
    • Until the Blosum score begins to decrease
  • Extend M to the right
    • Until the Blosum score begins to decrease
  • Larger stretch of sequence now matches
    • May have higher score than the original triple
    • Call these high scoring pairs
  • Throw away any alignments for which the score S of the extended region M is lower than some cutoff score
extending alignment regions example
Extending Alignment RegionsExample

QQGPHUIQEGQQGKEEDPP Blosum(QQG,QQG) = 16

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGK,QQGK) = 21

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKE,QQGKQ) = 23

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKEE,QQGKQE) = 28

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKEED,QQGKQEG) = 27

PKLMMQQGKQEGM

So, the extension to the right stops here

HSP (before left extension) is QQGKEE, scoring 28

blast algorithm part 6 checking statistical significance
BLAST Algorithm Part 6Checking Statistical Significance
  • Reason we extended alignment regions
    • Give a more accurate picture of the probability of that BLOSUM score occurring by chance
  • Question: is a HSP significant?
  • Suppose we have a HSP such that
    • It scores S for a region of length L in sequences Q & D
  • Then the probability of two random sequences Q’ and D’ scoring S in a region of length L is calculated
    • Where Q’ is same length as Q and D’ is same length as D
  • This probability needs to be low for significance
blast algorithm part 7 reporting the alignments
BLAST Algorithm Part 7Reporting the Alignments
  • For each statistically significant HSP
    • The alignment is reported
  • If a sequence D has two HSPs with Query Q
    • Two different alignments are reported
  • Later versions of BLAST
    • Try and unify the two alignments
ad