1 / 23

Scoring Matrices for Sequence Alignment

Scoring Matrices for Sequence Alignment. Anne Haake Rhys Price Jones. Scoring Matrices. Sequence comparisons require some scoring matrices To use the alignment algorithms to do database searches, we need some scoring schemes that are based on biological knowledge.

Download Presentation

Scoring Matrices for Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scoring Matricesfor Sequence Alignment Anne Haake Rhys Price Jones

  2. Scoring Matrices • Sequence comparisons require some scoring matrices • To use the alignment algorithms to do database searches, we need some scoring schemes that are based on biological knowledge. • Scoring matrices represent evolutionary theory • The choice of matrix can influence the outcome of the analysis • Understanding the theory can help in making an appropriate choice

  3. Nucleotide Scoring • Identity matrix (similarity) A T C G A 1 0 0 0 T 0 1 0 0 C 0 0 1 0 G 0 0 0 1

  4. Nucleotide Scoring 2. BLAST matrix A T C G A 5 -4 -4 -4 T -4 5 -4 -4 C -4 -4 5 -4 G -4 -4 -4 5

  5. Nucleotide Scoring 3. Transition/Transversion Matrix A T C G A 0 5 5 1 T 5 0 1 5 C 5 1 0 5 G 1 5 5 0

  6. Protein Scoring 1. Identity Matrix • Score 1 if equal • Score 0 if not equal • Easy, but weak 2. Genetic code Matrix • Determine the minimum number of base changes required to convert one amino acid into another • Edit distance: will be 0, 1, 2, or 3 • Is a distance matrix Matrix • Not very discriminating; consider CAU Code Table

  7. Protein Scoring 3. Hydrophobicity Matrix • Based on physical/chemical properties of the amino acids Hydrophobicity matrix 4. Log odds Matrices • Which amino acids are most likely to be seen? - In close relatives? In distant relatives? Ex. PAM and BLOSUM matrices

  8. PAM and BLOSUM Substitution Matrices for Amino Acids • Based on actual substitution rates among the various amino acids in nature • Empirically derived; huge amount of work! • General Strategy: • Select a collection of related proteins and align them • Observe the frequencies with which one amino acid is replaced by another = A • Figure out how often, given the frequencies of the amino acids in your set, the replacement would occur by chance alone = B • The ratio A/B (odds) tells us how often the replacement has occurred in evolution (as compared to a random process)

  9. PAM and BLOSUM Substitution Matrices for Amino Acids • Matrices are 20 X 20 tables of values that describe the probability of a residue pair occurring in an alignment • The scoring matrix values are logarithms of ratios of the probability of a meaningful occurrence to the probability of random occurrence.

  10. PAM Matrices • PAM stands for Point Accepted Mutation or Percent Accepted Mutation • Developed by Dayhoff et al. 1978. • Model based on empirically derived data • Groups of closely related proteins were aligned (global alignments) • So that probability of more than one replacement at a single site was negligible • 1,572 changes in 71 groups of closely related proteins 1 PAM

  11. PAM Unit • Matrix represents substitution probabilities over a fixed unit of evolutionary change • e.g. PAM1 is 1 substitution per 100 residues or one PAM unit (an amount of evolution) • 1% divergence • Start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all a.a. residues have undergone changes at time t+n. New sequence M’ • What is the probability that a.a. i in M will be replaced by a.a. j in M’? • To get your answer, look it up in the PAM-1 table (entry Rij)

  12. PAM Matrix • Matrix values are based on the model that one sequence is derived from the other by a series of independent mutations, each changing one amino acid in the first sequence to another amino acid in the second • The model is an approximation • Many assumptions • Not all of the assumptions necessarily hold

  13. PAM Matrix • PAM-1 is used to derive other PAM matrices • Why? • PAM-1 is 1 % accepted mutations • PAM-N is N% accepted mutations • To derive PAM-N, the PAM-1 matrix is multiplied by itself N times • e.g. PAM-100; PAM-250 • What does this mean for errors?

  14. Which PAM matrix do I use? • Depends on how closely the sequences are believed to be related • PAM-1 use for more closely related sequences • PAM-1000 more distant relationships • In practice, PAM-250 often used in alignment and database searching software.

  15. BLOSUM Matrix • BLOSUM is from BLOcks SUbstitution Matrix • originate with a paper by Henikoff and Henikoff (1992; PNAS 89:10915-10919)

  16. BLOSUM Matrix • derived from the BLOCKS database BLOCKS database • derived by observing substitution rates among similar protein sequences • Use families of related (distantly) protein sequence because need to do a multiple alignment • Are interested in substitutions rather than indels which tend to occur more in distantly related sequences • ungapped multiple alignments are used to identify conserved blocks of amino acids

  17. BLOSUM Matrix

  18. BLOSUM matrix • Clustering approach used to sort the sequences into closely related groups where the sequences are similar at some threshold value of percentage identity • e.g. BLOSUM62 is standard matrix for ungapped alignment.. 62 represents the cutoff value for clustering (sequences put into same cluster if more than 62% identical). • Substitution frequencies for all pairs of amino acids are then calculated between the groups and this used to calculate a log odds BLOSUM matrix

  19. BLOSUM • BLOSUM-62 matrix: appropriate for comparing sequences of approximately 62% sequence similarity • BLOSUM-80 matrix: 80% similarity

  20. PAM vs BLOSUM • Lower PAM numbers used for more closely related sequences • Lower BLOSUM numbers used for more distantly related sequences • Dayhoff-like matrices (PAM) derive their initial substitution frequencies from global alignments of very similar sequences. • The BLOSUM matrix is derived from local multiple alignments of more distantly related sequences

  21. Constructing a BLOSUM matrix • In class • In lab

  22. Constructing PAM Matrices • A multiple alignment is constructed between sequences with high identity (>85%) • A phylogenetic tree is constructed from the aligned sequences • Substitutions are identified between each pair of sequences in the tree • The substitution matrix is constructed by calculating the frequency of substitution for each amino acid, the relative mutability for each, and the mutation probability for each pair of amino acids (see example)

  23. Constructing PAM Matrix • Example

More Related