scoring the alignment of amino acid sequences n.
Skip this Video
Loading SlideShow in 5 Seconds..
Scoring the Alignment of Amino Acid Sequences PowerPoint Presentation
Download Presentation
Scoring the Alignment of Amino Acid Sequences

Loading in 2 Seconds...

play fullscreen
1 / 19

Scoring the Alignment of Amino Acid Sequences - PowerPoint PPT Presentation

  • Uploaded on

Scoring the Alignment of Amino Acid Sequences. Constructing PAM and Blosum Matrices. Quotes from page 11 of our Lab Manual:. Proteins are huge molecules made up of large numbers of amino acids. The proteins are usually 100 to 500 amino acids long

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Scoring the Alignment of Amino Acid Sequences' - ashby

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scoring the alignment of amino acid sequences

Scoring the Alignment of Amino Acid Sequences

Constructing PAM and Blosum Matrices


Quotes from page 11 of our Lab Manual:

Proteins are huge molecules made up of large numbers of amino acids.

The proteins are usually 100 to 500 amino acids long

There are 20 different amino acids that make up the proteins


Name Abbr. Linear structure formula ======================================================

Alanineala a CH3-CH(NH2)-COOH

Argininearg r HN=C(NH2)-NH-(CH2)3-CH(NH2)-COOH

Asparagineasn n H2N-CO-CH2-CH(NH2)-COOH

Aspartic acidasp d HOOC-CH2-CH(NH2)-COOH

Cysteinecys c HS-CH2-CH(NH2)-COOH

Glutaminegln q H2N-CO-(CH2)2-CH(NH2)-COOH

Glutamic acidglu e HOOC-(CH2)2-CH(NH2)-COOH

Glycinegly g NH2-CH2-COOH

Histidinehis h NH-CH=N-CH=C-CH2-CH(NH2)-COOH

Isoleucineile i CH3-CH2-CH(CH3)-CH(NH2)-

Leucineleu l (CH3)2-CH-CH2-CH(NH2)-COOH

Lysinelys k H2N-(CH2)4-CH(NH2)-COOH

Methioninemet m CH3-S-(CH2)2-CH(NH2)-COOH

Phenylalaninephe f Ph-CH2-CH(NH2)-COOH

Prolinepro p NH-(CH2)3-CH-COOH

Serineser s HO-CH2-CH(NH2)-COOH

Threoninethr t CH3-CH(OH)-CH(NH2)-COOH

Tryptophantrp w Ph-NH-CH=C-CH2-CH(NH2)-COOH

Tyrosinetyr y HO-p-Ph-CH2-CH(NH2)-COOH

Valineval v (CH3)2-CH-CH(NH2)-COOH


The number, variety, and chemical properties of the Amino Acids make the problem of scoring a pair of Amino Acids a much more complicated problem than scoring a pair of nucleotides.

In the late 1970’s Dayhoff, Schwartz, and Orcutt decided to look at a database of similar proteins having common ancestors and obtain substitution frequency data.

They looked at 71 groupings of protein data that differed by no more than 15% of their residues, i.e. at least 85% similar. They then built phylogenetic trees where each transition from generation to generation has as few changes as possible, given the data, in each ancestral sequence.

From this a value is determined for the entry Aabin a matrix giving the frequency data for each pairing.



A -> G I -> L


A -> G A -> L C -> S G -> A


Constructing a Parsimonious Phylogenetic Tree (taken from page 40 of Krane & Raymer)

Dayhoff and her team used sequences that were at least 85% similar and calculated the frequency with which each protein was substituted for each of the other proteins.


Dayhoff’s Data

NOTE: The diagonals are blank since only the changes are recorded. Also, the upper triangular half of the matrix is not shown since it is assumed that the changes α and α are symmetrical.


Calculating the Entry in The

Substitution Matrix


P(b|a,t) = Probability that a is substituted for b in t time units

adjusted for divergence time (Dayhoff time unit)

qaqb = Probability that a would randomly follow b

= (frequency if a)(freqency of b)

s(a,b | t) = an entry at position (a, b) or (b, a) in the scoring



s(a, b | t) =


The Probabilities Found By Dayhoff

The entry in cell Mab is the probability that a would be followed by b in one Dayhoff time unit multiplied by 100. Thus, for example, Alanine would be followed by Proline 0.22% of the time.


Note: The previous matrix is NOT the scoring matrix. It is used to derive the scoring matrix. Recall:

s(a, b | t) =

However, the probability matrix is the main tool for deriving a sensible scoring matrix. To find the probability that amino acid a will mutate be replaced by amino acid b at a time t time units later, we need to calculate the a,b-th entry of the matrix Mt. After calculating this entry, then we apply the “log-odds” formula given above.

The reason that the logarithm is used in the scoring formula is that it allows us, among other things, to add the scores of the aligned residues when we compute the score for an overall alignment of two sequences.


The matrix having scores found from the original probability matrix is called a

1 PAM matrix

PAM stands for Point Accepted Mutation or Percent Accepted Mutation Dayhoff’s term was Accepted Point Mutation, but PAM rolls off the tongue easier than APM.

The 1 means that given the degree of similarity between the sequences used to make up the matrix, the scores in this matrix are the frequencies for one evolutionary time unit.

Scores representing longer times and are called PAMt matrices = Mt.

The most widely used matrix is PAM250 or the log-odds matrix based on:

M250 = the 250th power of M.

This matrix shows the probability of change over a long period of time.

However, for closely related sequences, say mouse and rat MSH2, a PAM10 matrix may be more appropriate


The PAM250 Matrix

We only show the top half because the bottom half is a reflection of the top half, i.e. Sa,b = Sb,a


Discussion of PAM

  • The 1 PAM matrix was derived by constructing hypothetical phylogenetic trees relating sequences in 71 families.
  • The higher the power of the matrix, the more evolutionary time units represented by the matrix.
  • Criticism – raising M to high powers does not capture the true difference between short time substitutions and long time substitutions.
  • Note short time substitutions are dominated by amino acid substitutions that come from a single base change in the codon triplets of an Amino Acid, whereas the long time substitutions show all kinds of codon changes


(BLOck Substitution Matrix)



The criticism given at the end of the last discussion is that the large PAM matrices tend to minimize the effects of short time substitutions such as

L<->I L<->V and Y<->F

In 1991 – 1992 Henikoff and Henikoff used the BLOCKS database at the Fred Hutchison Cancer Research Center

This database contains blocks of multiple alignments of more distantly related sequences

Such a database can be used to derive scores more directly



  • Sequences from each block were clustered
  • Two sequences were placed in the same cluster if their percent differences were above some level, say α%
  • The frequency Aab is calculated from observing residue a in one clustered alignment against residue b in another clustered alignment.
  • Corrections are made for clusters of differing sizes

Calculating the Matrix Entries

Let the following be determined from the observed data:

qa = the fraction of pairings that include an a

pab = the fraction of parings of a and b



The score is calculated as

These values are then scaled and rounded to make calculations easier.


Most popular BLOSUM Matrices are BLOWSUM62 and BLOWSUM50.

  • BLOWSUM62 is used mainly for ungapped matching.
  • BLOWSUM50 is used for alignments with gaps.
  • Note: the lower the number the longer the time span in evolutionary units.

Differences Between PAM and BLOSUM

PAM assumes that substitutions probabilities for highly related proteins can be extrapolated to the probabilities for distantly related proteins.

BLOWSUM matrices are based on the observation of more distantly related protein alignments.

NOTE: Both types of matrices use log-odds values in their scoring systems.