Pairwise sequence alignment
Download
1 / 66

Pairwise Sequence Alignment - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Pairwise Sequence Alignment. Misha Kapushesky Slides: Stuart M. Brown, Fourie Joubert, NYU St. Petersburg Russia 2010. Protein Evolution. “For many protein sequences, evolutionary history can be traced back 1-2 billion years” -William Pearson

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Pairwise Sequence Alignment' - clive


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Pairwise sequence alignment

Pairwise Sequence Alignment

Misha Kapushesky

Slides: Stuart M. Brown, Fourie Joubert, NYU

St. Petersburg Russia 2010


Protein evolution
Protein Evolution

“For many protein sequences, evolutionary history can be traced back 1-2 billion years”

-William Pearson

  • When we align sequences, we assume that they share a common ancestor

    • They are then homologous

  • Protein fold is much more conserved than protein sequence

  • DNA sequences tend to be less informative than protein sequences


Definition
Definition

  • Homology: related by descent

  • Homologous sequence positions

 ATTGCGC

ATTGCGC

ATTGCGC

AT-CCGC

ATTGCGC

 ATCCGC

C


Orthologous and paralogous
Orthologous and paralogous

  • Orthologous sequences differ because they are found in different species (a speciation event)

  • Paralogous sequences differ due to a gene duplication event

  • Sequences may be both orthologous and paralogous


Pairwise alignment
Pairwise Alignment

  • The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.

    • There are lots of possible alignments.

  • Two sequences can always be aligned.

  • Sequence alignments have to be scored.

  • Often there is more than one solution with the same score.


Methods of alignment
Methods of Alignment

  • By hand - slide sequences on two lines of a word processor

  • Dot plot

    • with windows

  • Rigorous mathematical approach

    • Dynamic programming (slow, optimal)

  • Heuristic methods (fast, approximate)

    • BLAST and FASTA

      • Word matching and hash tables0


Align by hand
Align by Hand

GATCGCCTA_TTACGTCCTGGAC <--

--> AGGCATACGTA_GCCCTTTCGC

You still need some kind of scoring system to find the best alignment


Percent sequence identity
Percent Sequence Identity

  • The extent to which two nucleotide or amino acid sequences are invariant

A C C T G A G – A G

A C G T G – G C A G

mismatch

indel

70% identical


Pairwise sequence alignment

Dotplot:

A dotplot gives an overview of all possible alignments

A   T   T   C  A   C  A   T   A    T A C A T T A C G T A C

Sequence 2

Sequence 1


Pairwise sequence alignment

Dotplot:

In a dotplot each diagonal corresponds to a possible (ungapped) alignment

A   T   T   C  A   C  A   T   A    T A C A T T A C G T A C

Sequence 2

Sequence 1

T A C A T T A C G T A C

A T A C A C T T A

One possible alignment:


Pairwise sequence alignment

Insertions / Deletions in a Dotplot

T

A

C

T

G

T

C

A

T

T A C T G T T C A T

Sequence 2

Sequence 1

T A C T G-T C A T

| | | | | | | | |

T A C T G T T C A T


Pairwise sequence alignment

Dotplot(Window = 130 / Stringency = 9)

Hemoglobin-chain

Hemoglobin -chain


Pairwise sequence alignment

Word Size Algorithm

T A C G G T A T G

A C A G T A T C

Word Size = 3

C T A T G A C A T A C G G T A T G

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C

T A C G G T A T G

A C A G T A T C


Pairwise sequence alignment

Window / Stringency

Score = 11

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Scoring Matrix Filtering

Score = 11

Matrix: PAM250

Window = 12 Stringency = 9

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

Score = 7

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM


Pairwise sequence alignment

Dotplot(Window = 18 / Stringency = 10)

Hemoglobin-chain

Hemoglobin -chain


Pairwise sequence alignment

Considerations

  • The window/stringency method is more sensitive than the wordsize

  • method (ambiguities are permitted).

  • The smaller the window, the larger the weight of statistical

  • (unspecific) matches.

  • With large windows the sensitivity for short sequences is reduced.

  • Insertions/deletions are not treated explicitly.


Alignment methods
Alignment methods

  • Rigorous algorithms = Dynamic Programming

    • Needleman-Wunsch (global)

    • Smith-Waterman (local)

  • Heuristic algorithms (faster but approximate)

    • BLAST

    • FASTA


The rocks game
The Rocks game

  • N rocks, 2 piles, 2 players

  • Player can

    • Remove 1 rock from either pile

    • Remove 1 rock from each pile

  • Last to remove a rock wins

  • Assume 10 rocks in each pile – winning strategy?



Pairwise sequence alignment

Basic principles of dynamic programming

  • - Creation of an alignment path matrix

  • - Stepwise calculation of score values

  • - Backtracking (evaluation of the optimal path)


Dynamic programming1
Dynamic Programming

  • Dynamic Programming is a very general programming technique.

  • It is applicable when a large search space can be structured into a succession of stages, such that:

    • the initial stage contains trivial solutions to sub-problems

    • each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage

    • the final stage contains the overall solution


Pairwise sequence alignment

Creation of an alignment path matrix

Idea:Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences

  • Construct matrix F indexed by i and j (one index for each sequence)

  • F(i,j) is the score of the best alignment between the initial segment x1...iof x up to xiand the initial segment y1...jof y up to yj

  • Build F(i,j) recursively beginning with F(0,0) = 0


Pairwise sequence alignment

Creation of an alignment path matrix

  • If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)

  • Three possibilities:

    • xiand yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)

    • xi is aligned to a gap, F(i,j) = F(i-1,j) - d

    • yjis aligned to a gap, F(i,j) = F(i,j-1) - d

  • The best score up to (i,j) will be the largest of the three options


Pairwise sequence alignment

E

E

Backtracking

H E A G A W G H E E

0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73

A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60

W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37

H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19

E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5

A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2

E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

0

-8

-16

-17

-25

-20

-5

-13

-3

3

-5

1

H

-

E

-

A

P

G

-

G

-

H

H

E

E

-

A

A

A

W

W

Optimal global alignment:


Global vs local alignments
Global vs. Local Alignments

  • Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.

  • Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.


Pairwise sequence alignment

Global Alignment

Two closely related sequences:

needle (Needleman & Wunsch)creates an end-to-end alignment.


Pairwise sequence alignment

Global Alignment

Two sequences sharing several regions of local similarity:

1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67

|||||||||||||| | | | |||| || | | | ||

1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70


Global alignment needleman wunsch
Global Alignment (Needleman-Wunsch)

  • The the Needleman-Wunsch algorithm creates a global alignment over the length of both sequences (needle)

  • Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.

    • Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.

  • Global methods are useful when you want to force two sequences to align over their entire length


Local alignment smith waterman
Local Alignment (Smith-Waterman)

  • Local alignment

    • Identify the most similar sub-region shared between two sequences

    • Smith-Waterman

    • EMBOSS: water


Pairwise sequence alignment

Parameters of Sequence Alignment

  • Scoring Systems:

  • Each symbol pairing is assigned a numerical value, based on a symbol comparison table.

  • Gap Penalties:

  • Opening: The cost to introduce a gap

  • Extension: The cost to elongate a gap


Pairwise sequence alignment

actaccagttcatttgatacttctcaaa

taccattaccgtgttaactgaaaggacttaaagact

DNA Scoring Systems

-very simple

Sequence 1

Sequence 2

A G C T

A1 0 0 0

G 0 1 0 0

C 0 0 1 0

T 0 0 0 1

Match: 1

Mismatch: 0

Score = 5


Pairwise sequence alignment

Protein Scoring Systems

Sequence 1

Sequence 2

PTHPLASKTQILPEDLASEDLTI

PTHPLAGERAIGLARLAEEDFGM

C S T P A G N D. .

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 5

D -3 0 -1 -1 -2 -1 1 6

.

.

C S T P A G N D. .

C 9

S -1 4

T -1 1 5

P -3 -1 -1 7

A 0 1 0 -1 4

G -3 0 -2 -2 0 6

N -3 1 0 -2 -2 0 5

D -3 0 -1 -1 -2 -1 1 6

.

.

Scoring

matrix

T:G = -2

T:T = 5

Score = 48


Pairwise sequence alignment

Protein Scoring Systems

  • Amino acids have different biochemical and physical properties

  • that influence their relative replaceability in evolution.

tiny

P

aliphatic

C

small

S+S

G

G

I

A

S

V

C

N

SH

L

D

T

hydrophobic

Y

M

K

E

Q

F

W

H

R

positive

aromatic

polar

charged


Pairwise sequence alignment

Protein Scoring Systems

  • Scoring matrices reflect:

    • # of mutations to convert one to another

    • chemical similarity

    • – observed mutation frequencies

    • – the probability of occurrence of each amino acid

  • Widely used scoring matrices:

    • PAM

    • BLOSUM


Pam matrices
PAM matrices

  • Family of matrices PAM 80, PAM 120, PAM 250

  • The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based

  • Greater numbers denote greater distances


Pairwise sequence alignment

PAM (Percent Accepted Mutations) matrices

  • The numbers of replacements were used to compute a so-called

  • PAM-1 matrix.

  • The PAM-1 matrix reflects an average change of 1% of all amino

  • acid positions. PAM matrices for larger evolutionary distances can

  • be extrapolated from the PAM-1 matrix.

  • PAM250 = 250 mutations per 100 residues.

  • Greater numbers mean bigger evolutionary distance


Pairwise sequence alignment

PAM (Percent Accepted Mutations) matrices

  • Derived from global alignments of protein families . Family members

  • share at least 85% identity (Dayhoff et al., 1978).

  • Construction of phylogenetic tree and ancestral sequences of

  • each protein family

  • Computation of number of replacements for each pair of amino acids


Pairwise sequence alignment

C

W

W

-8

17

PAM 250

A R N D C Q E G H I L K M F P S T W Y V B Z

A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1

R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2

N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3

D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4

C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4

Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5

E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5

G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1

H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3

I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1

L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1

K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2

M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0

F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4

P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1

S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1

T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1

W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4

Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3

V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0

B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5

Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6


Pam limitations
PAM - limitations

  • Based on only one original dataset

  • Examines proteins with few differences (85% identity)

  • Based mainly on small globular proteins so the matrix is biased


Blosum matrices
BLOSUM matrices

  • Different BLOSUMn matrices are calculated independently from BLOCKS (ungapped local alignments)

  • BLOSUMn is based on a cluster of BLOCKS of sequences that share at least n percent identity

  • BLOSUM62 represents closer sequences than BLOSUM45


Pairwise sequence alignment

BLOSUM (Blocks Substitution Matrix)

  • Derived from alignments of domains of distantly related

  • proteins (Henikoff & Henikoff,1992).

  • Occurrences of each amino acid pair

  • in each column of each block alignment

  • is counted.

  • The numbers derived from all blocks were

  • used to compute the BLOSUM matrices.

A

A

C

E

C

A

A

C

E

C

A - C = 4

A - E = 2

C - E = 2

A - A = 1

C - C = 1



Pairwise sequence alignment

BLOSUM (Blocks Substitution Matrix)

  • Sequences within blocks are clustered according to their level of identity.

  • Clusters are counted as a single sequence.

  • Different BLOSUM matrices differ in the percentage of sequence identity

  • used in clustering.

  • The number in the matrix name (e.g. 62 in BLOSUM62) refers to the

  • percentage of sequence identity used to build the matrix.

  • Greater numbers mean smaller evolutionary distance.


Pam vs blosum
PAM Vs. BLOSUM

PAM100 = BLOSUM90

PAM120 = BLOSUM80

PAM160 = BLOSUM60

PAM200 = BLOSUM52

PAM250 = BLOSUM45

More distant sequences

  • BLOSUM62 for general use

  • BLOSUM80 for close relations

  • BLOSUM45 for distant relations

  • PAM120 for general use

  • PAM60 for close relations

  • PAM250 for distant relations


Pairwise sequence alignment

TIPS on choosing a scoring matrix

  • Generally, BLOSUM matrices perform better than PAM matrices

  • for local similarity searches (Henikoff & Henikoff, 1993).

  • When comparing closelyrelatedproteins one should use lower

  • PAMor higher BLOSUM matrices, for distantlyrelatedproteins

  • higher PAM or lower BLOSUM matrices.

  • For database searching the commonly used matrix is BLOSUM62.


Pairwise sequence alignment

Scoring Insertions and Deletions

A T G T A A T G C A

T A T G T G G A A T G A

A T G T - - A A T G C A

T A T G T G G A A T G A

insertion / deletion

The creation of a gap is penalized with a negative score value.


Pairwise sequence alignment

Why Gap Penalties?

Gaps not permitted Score: 0

1 GTGATAGACACAGACCGGTGGCATTGTGG 29

||| | | ||| | || || |

1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29

Match = 5

Mismatch = -4

Gaps allowed but not penalized Score: 88

1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29

||| || | | | ||| || | | || || |

1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29


Pairwise sequence alignment

Why Gap Penalties?

  • The optimal alignment of two similar sequences is usually that which

    • maximizes the number of matches and

    • minimizes the number of gaps.

    • There is a tradeoff between these two

      • - adding gaps reduces mismatches

  • Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences.

  • Penalizing gaps forces alignments to have relatively few gaps.


  • Pairwise sequence alignment

    Gap Penalties

    • How to balance gaps with mismatches?

    • Gaps must get a steep penalty, or else you’ll end up with nonsense alignments.

    • In real sequences, multi-base (or amino acid) gaps are quite common

      • genetic insertion/deletion events

  • “Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.


  • Pairwise sequence alignment

    Scoring Insertions and Deletions

    match = 1

    mismatch = 0

    Total Score: 4

    A T G T T A T A C

    T A T G T G C G T A T A

    Total Score: 8 - 3.2 = 4.8

    A T G T - - - T A T A C

    Gap parameters:

    d = 3 (gap opening)

    e = 0.1 (gap extension)

    g = 3 (gap lenght)

    (g) = -3 - (3 -1) 0.1 = -3.2

    T A T G T G C G T A T A

    insertion / deletion


    Pairwise sequence alignment

    Modification of Gap Penalties

    Score Matrix: BLOSUM62

    1 ...VLSPADKFLTNV 12

    ||||

    1 VFTELSPAKTV.... 11

    gap opening penalty = 3

    gap extension penalty = 0.1

    score = 6.3

    1 V...LSPADKFLTNV 12

    | |||| | | |

    1 VFTELSPA.K..T.V 11

    gap opening penalty = 0

    gap extension penalty = 0.1

    score = 11.3


    Blast algorithm basic local alignment search tool
    BLAST AlgorithmBasic Local Alignment Search Tool

    • Fast alignment technique(s)

      • Similar to FASTA algorithms (not used much now)

      • There are more accurate ones, but they’re slower

      • BLAST makes a big use of lookup tables

    • Idea: statistically significant alignments (hits)

      • Will have regions of at least 3 letters same

        • Or at least high scoring with respect to BLOSUM matrix

    • Based on small local alignments

    more likely than

    CCNDHRKMTCSPNDNNRK

    TTNDHRMTACSPDNNNKH

    CCNDHRKMTCSPNDNNRK

    YTNHHMMTTYSLDNNNKK


    Blast overview
    BLAST Overview

    • Given a query sequence Q

    • Seven main stages

      • Remove (filter) low complexity regions from Q

      • Harvest k-tuples (triples) from Q

      • Expand each triple into ~50 high scoring words

      • Seed a set of possible alignments

      • Generate high scoring pairs (HSPs) from the seeds

      • Test significance of matches from HSPs

      • Report the alignments found from the HSPs


    Blast algorithm part 1 removing low complexity segments
    BLAST Algorithm Part 1 Removing Low-complexity Segments

    • Imagine matching

      • HHHHHHHHKMAY and HHHHHHHHURHD

      • The KMAY and URHD are the interesting parts

      • But this pair score highly using BLOSUM

    • It’s a good idea to remove the HHHHHHHs

      • From the query sequence (low complexity)

    • SEG program does this kind of thing

      • Comes with most BLAST implementations

      • Often doesn’t do much, and it can be turned off


    Removing low complexity segments
    Removing Low-complexity Segments

    • Given a segment of length L

      • With each amino acid occurring n1 n2 … n20 times

    • Use the following measure for “compositional complexity”:

    • To use this measure

      • Slide a “window” of ~12 residues along Query Sequence Q

      • Use a threshold to determine low complexity windows

      • Use a minimise routine to replace the segment

        • With an optimal minimised segment (or just an X)


    Blast algorithm part 2 harvesting k tuples
    BLAST Algorithm Part 2Harvesting k-tuples

    • Collect all the k-tuples of elements in Q

      • k set to 3 for residues and 11 for DNA (can vary)

      • Triples are called ‘words’. Call this set W

    S T S L S T S D K L M R

    STS

    TSL

    SLS

    LST


    Blast algorithm part 3 finding high scoring triples
    BLAST Algorithm Part 3Finding High Scoring Triples

    • Given a word w from W

      • Find all other words w’ of same length (3), which:

        • Appear in some database sequence

        • Blosum(w,w’) > a threshold T

    • Choose T to limit number to around 50

      • Call these the high scoring triples (words) for w

    • Example: letting w=PQG, set T to be 13

      • Suppose that PQG, PEG, PSG, PQA are found in database

      • Blosum(PQG,PQG) = 18, Blosum(PQG,PEG) = 15

      • Blosum(PQG,PSG) = 13, Blosum(PQG,PQA) = 12

      • Hence, PQG and PEG only are kept


    Finding high scoring triples
    Finding High Scoring Triples

    • For each w in W, find all the high scoring words

      • Organise these sets of words

        • Remembering all the places where w was found in Q

    • Each high scoring triple is going to be a seed

      • In order to generate possible alignment(s)

        • One seed can generate more than one alignment

    • End of the first half of the algorithm

      • Going to find alignments now


    Blast algorithm part 4 seeding possible alignments
    BLAST Algorithm Part 4Seeding Possible Alignments

    • Look at first triple V in query sequence Q

      • Actually from Q (not from W - which has omissions)

      • Retrieve the set of ~50 high scoring words

        • Call this set HV

      • Retrieve the list of places in Q where V occurs

        • Call this set PV

    • For every pair (word, pos)

      • Where word is from HV and pos is from PV

        • Find all the database sequences D

          • Which have an exact match with word at position pos’

        • Store an alignment between Q and D

          • With Vmatched at pos in Q and pos’ in D

    • Repeat this for the second triple in Q, and so on


    Seeding possible alignments example
    Seeding Possible AlignmentsExample

    • Suppose Q = QQGPHUIQEGQQG

    • Suppose V = QQG, HV = {QQG, QEG}

      • Then PV = {1, 11}

    • Suppose we are looking in the database at:

      • D = PKLMMQQGKQEG

    • Then the alignments seeded are:

      QQGPHUIQEGQQG word=QQG QQGPHUIQEGQQG word=QQG

      PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11

      QQGPHUIQEGQQG word=QEG QQGPHUIQEGQQG word=QEG

      PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11


    Blast algorithm part 5 generating high scoring pairs hsps
    BLAST Algorithm Part 5Generating High Scoring Pairs (HSPs)

    • For each alignment A

      • Where sequences Q and D are matched

      • Original region matching was M

    • Extend M to the left

      • Until the Blosum score begins to decrease

    • Extend M to the right

      • Until the Blosum score begins to decrease

    • Larger stretch of sequence now matches

      • May have higher score than the original triple

      • Call these high scoring pairs

    • Throw away any alignments for which the score S of the extended region M is lower than some cutoff score


    Extending alignment regions example
    Extending Alignment RegionsExample

    QQGPHUIQEGQQGKEEDPP Blosum(QQG,QQG) = 16

    PKLMMQQGKQEGM

    QQGPHUIQEGQQGKEEDPP Blosum(QQGK,QQGK) = 21

    PKLMMQQGKQEGM

    QQGPHUIQEGQQGKEEDPP Blosum(QQGKE,QQGKQ) = 23

    PKLMMQQGKQEGM

    QQGPHUIQEGQQGKEEDPP Blosum(QQGKEE,QQGKQE) = 28

    PKLMMQQGKQEGM

    QQGPHUIQEGQQGKEEDPP Blosum(QQGKEED,QQGKQEG) = 27

    PKLMMQQGKQEGM

    So, the extension to the right stops here

    HSP (before left extension) is QQGKEE, scoring 28


    Blast algorithm part 6 checking statistical significance
    BLAST Algorithm Part 6Checking Statistical Significance

    • Reason we extended alignment regions

      • Give a more accurate picture of the probability of that BLOSUM score occurring by chance

    • Question: is a HSP significant?

    • Suppose we have a HSP such that

      • It scores S for a region of length L in sequences Q & D

    • Then the probability of two random sequences Q’ and D’ scoring S in a region of length L is calculated

      • Where Q’ is same length as Q and D’ is same length as D

    • This probability needs to be low for significance


    Blast algorithm part 7 reporting the alignments
    BLAST Algorithm Part 7Reporting the Alignments

    • For each statistically significant HSP

      • The alignment is reported

    • If a sequence D has two HSPs with Query Q

      • Two different alignments are reported

    • Later versions of BLAST

      • Try and unify the two alignments