Scoring Matrices

Scoring Matrices

Diff. Scoring Rules Lead to Diff. Alignments • Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) • Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-5) x (# gap openings) + (-2) x (total length of all gaps)

Scoring Rules/Matrices • Why are they important? • The choice of a scoring rule can strongly influence the outcome of sequence analysis • What do they mean? • Scoring matrices implicitly represent a particular theory of evolution • Elements of the matrices specify the similarity of one residue to another

The Sij in a Scoring Matrix (as log likelihood ratio)

The alignment score of aligning two sequences is the log likelihood ratio of the alignment under two models • Common ancestry • By chance

Likelihood Ratio for Aligning a Single Pair of Residues • Above: the probability that two residues are aligned by evolutionary descent • Below: the probability that they are aligned by chance • Pi, Pj are frequencies of residue i and j in all sequences (abundance)

Likelihood Ratio of Aligning Two Sequences

Two classes of widely used protein scoring matrices PAM = % Accepted Mutations:1500 changes in 71 groups w/ > 85% similarityBLOSUM = Blocks Substitution Matrix:2000 “blocks” from 500 families

PAM and BLOSUM matrices are all log likelihood matrices • More specifically: • An alignment that scores 6 means that the alignment by common ancestry is 2^(6/2)=8 times as likely as expected by chance.

Constructing BLOSUM Matrices Blocks Substitution Matrices

BLOSUM Matrices of Specific Similarities • Sequences with above a threshold similarity are clustered. • If clustering threshold is 62%, final matrix is BLOSUM62

A toy example of constructing a BLOSUM matrix from 4 training sequences

Constructing a BLOSUM matr.1. Counting mutations

2. Tallying mutation frequencies

3. Matrix of mutation probs.

4. Calculate abundance of each residue (Marginal prob)

5. Obtaining a BLOSUM matrix

Constructing the real BLOSUM62 Matrix

1.2.3.Mutation Frequency Table

4. Calculate Amino Acid Abundance

5. Obtaining BLOSUM62 Matrix

BLOSUM matrices reference • S. Henikoff and J. Henikoff (1992). “Amino acid substitution matrices from protein blocks”. PNAS 89: 10915-10919 • Training Data: ~2000 conserved blocks from BLOCKS database. Ungapped, aligned protein segments. Each block represents a conserved region of a protein family

Break • Homework

PAM Matrices (Point Accepted Mutations) Mutations accepted by natural selection

Constructing PAM Matrix: Training Data

PAM: Phylogenetic Tree

PAM: Accepted Point Mutation

Mutability of Residue j

Total Mutation Rate is the total mutation rate of all amino acids

Normalize Total Mutation Rate to 1% This defines an evolutionary period: the period during which the 1% of all sequences are mutated (accepted of course)

Mutation Probability Matrix Normalized Such that the Total Mutation Rate is 1%

Mutation Probability Matrix (transposed) M*10000

-- PAM1 mutation prob. matr. -- PAM2 Mutation Probability Matrix? -- Mutations that happen in twice the evolution period of that for a PAM1

PAM Matrix: Assumptions

In two PAM1 periods: • {AR} = {AA and AR} or {AN and NR} or {AD and DR} or … or {AV and VR}

Entries in a PAM-2 Mut. Prob. Matr.

PAM-k Mutation Prob. Matrix

PAM-k log-likelihood matrix

PAM-250

PAM60—60%, PAM80—50%, • PAM120—40% • PAM-250 matrix provides a better scoring alignment than lower-numbered PAM matrices for proteins of 14-27% similarity

PAM Matrices: Reference • Atlas of Protein Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff. ed. National Biomedical Research Foundation, 1

Choice of Scoring Matrix

PAM Based on extrapolation of a small evol. Period Track evolutionary origins Homologous seq.s during evolution BLOSUM Based on a range of evol. Periods Conserved blocks Find conserved domains Comparing Scoring Matrix

Sources of Error in PAM

Scoring Matrices