scoring matrices - PowerPoint PPT Presentation

Scoring Matrices. Diff. Scoring Rules Lead to Diff. Alignments. Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-5) x (# gap openings) + (-2) x (total length of all gaps).

PowerPoint Slideshow about 'scoring matrices' - Jimmy

Scoring Matrices

• Example Score =

5 x (# matches) + (-4) x (# mismatches) +

+ (-7) x (total length of all gaps)

• Example Score =

5 x (# matches) + (-4) x (# mismatches) +

+ (-5) x (# gap openings) + (-2) x (total length of all gaps)

• Why are they important?

• The choice of a scoring rule can strongly influence the outcome of sequence analysis

• What do they mean?

• Scoring matrices implicitly represent a particular theory of evolution

• Elements of the matrices specify the similarity of one residue to another

The Sij in a Scoring Matrix (as log likelihood ratio)

• Above: the probability that two residues are aligned by evolutionary descent

• Below: the probability that they are aligned by chance

• Pi, Pj are frequencies of residue i and j in all sequences (abundance)

PAM = % Accepted Mutations:1500 changes in 71 groups w/ > 85% similarityBLOSUM = Blocks Substitution Matrix:2000 “blocks” from 500 families

Constructing BLOSUM Matrices

Blocks Substitution Matrices

• Sequences with above a threshold similarity are clustered.

• If clustering threshold is 62%, final matrix is BLOSUM62

Constructing a BLOSUM matr. training sequences1. Counting mutations

3. Matrix of mutation probs. training sequences

5. Obtaining a BLOSUM matrix training sequences

1.2.3.Mutation Frequency Table training sequences

5. Obtaining BLOSUM62 Matrix training sequences

BLOSUM matrices reference training sequences

• S. Henikoff and J. Henikoff (1992). “Amino acid substitution matrices from protein blocks”. PNAS 89: 10915-10919

• Training Data: ~2000 conserved blocks from BLOCKS database. Ungapped, aligned protein segments. Each block represents a conserved region of a protein family

Break training sequences

• Homework

PAM Matrices training sequences(Point Accepted Mutations)

Mutations accepted by natural selection

PAM: Phylogenetic Tree training sequences

PAM: Accepted Point Mutation training sequences

Mutability of Residue training sequencesj

Total Mutation Rate training sequences

is the total mutation rate of all amino acids

Normalize Total Mutation Rate to training sequences1%

This defines an evolutionary period: the period during which the 1% of all sequences are mutated (accepted of course)

Mutation Probability Matrix Normalized training sequences

Such that the

Total Mutation Rate is 1%

-- PAM1 mutation prob. matr. -- PAM2 Mutation Probability Matrix?

-- Mutations that happen in twice the evolution period of that for a PAM1

PAM Matrix: Assumptions Probability Matrix?

In two PAM1 periods: Probability Matrix?

• {AR} = {AA and AR} or

{AN and NR} or

{AD and DR} or

… or

{AV and VR}

PAM-k Mutation Prob. Matrix Probability Matrix?

PAM-k log-likelihood matrix Probability Matrix?

PAM-250 Probability Matrix?

• PAM60—60%, PAM80—50%, Probability Matrix?

• PAM120—40%

• PAM-250 matrix provides a better scoring alignment than lower-numbered PAM matrices for proteins of 14-27% similarity

PAM Matrices: Reference Probability Matrix?

• Atlas of Protein Sequence and Structure,

Suppl 3, 1978, M.O. Dayhoff.

ed. National Biomedical Research Foundation, 1

Choice of Scoring Matrix Probability Matrix?

PAM Probability Matrix?

Based on extrapolation of a small evol. Period

Track evolutionary origins

Homologous seq.s during evolution

BLOSUM

Based on a range of evol. Periods

Conserved blocks

Find conserved domains

Comparing Scoring Matrix

Sources of Error in PAM Probability Matrix?