MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES

MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES Robert Hirt Institute for Cell and Molecular Biosciences, Newcastle University, UK

Inferring trees is difficult!!! 1. The method problem A Method 1 Dataset 1 B ? C A Method 2 C Dataset 1 B

Inferring trees is difficult!!! 2. The dataset problem A Method 1 B Dataset 1 ? C A Method 1 C Dataset 2 B

From DNA/protein sequences to trees * 1 Sequence data * 2 Align Sequences Phylogenetic signal? Patterns—>evolutionary processes? * 3 Distances methods Characters based methods * Distance calculation (which model?) 4 Choose a method MB ML MP Wheighting? (sites, changes)? Model? Model? Optimality criterion Single tree LS ME NJ Calculate or estimate best fit tree 5 Test phylogenetic reliability Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487

Agenda • Some general considerations • Why protein phylogenetics? • What are we comparing? Protein sequences - some basic features • Protein structure/function and its impact on patterns of mutations • Amino acid exchange matrices: where do they come from and when do we use them? • Database searches (e.g. Blast, FASTA) • Sequence alignment (e.g. ClustalX) • Phylogenetics (model based methods: distance, ML & Bayesian)

Why protein phylogenies? • For historical reasons - the first sequences • Most genes encode proteins • To study protein structure, function and evolution • Comparing DNA and protein based phylogenies can be useful • Different genes - e.g. 18S rRNA versus EF-2 protein • Protein encoding gene - codons versus amino acids

Proteins were the first molecular sequences to be used for phylogenetic inference • Fitch and Margoliash (1967). Construction of phylogenetic trees. Science 155, 279-284.

Phylogenies from proteins • Parsimony • Distance matrices • Maximum likelihood • Bayesian methods

Evolutionary models for amino acid changes • All methods have explicit or implicit evolutionary models • Can be in the form of simple formula • Kimura or poisson corrections to estimate distances • Most models for amino acid changes typically include • A 20x20 rate matrix (or reduced version of it, 6x6 rate matrix) • Correction for rate heterogeneity among sites (G [a]+ pinv) • Assume stationarity and neutrality - what if there are biases in composition, or non neutral changes such as selection? • Some more recent models include correction for compositions heterogeneity across sites (CAT model) or across the tree (NDCH - node discrete compositional heterogeneity --- see Peter’s lectures)

Character states in DNA and protein alignments • DNA sequences have four states (five): A, C, G, T, (and ± indels) • Proteins have 20 states (21): A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (and ± indels) • —> more information in DNA or protein alignments?

DNA->Protein: the code • 3 nucleotides (a codon) code for one amino acid (61 codons! 61x61 rate matrices?) • Degeneracy of the code: most amino acids are coded by several codons —> more data/information in DNA?

DNA—>Protein • The code is degenerate: 20 amino acids are encoded by 61 possible codons (3 stop codons) • Complex patterns of changes among codons: • Synonymous/non synonymous changes • Synonymous changes correspond to codon changes not affecting the coded amino acid

Codon degeneracy: protein alignments as a guide for DNA alignments Glu-Gly-Ser-Ser-Trp-Leu-Leu-Leu-Gly-Ser Glu-Gly-Ser-Ser-Tyr-Leu-Leu-Ile-Gly-Ser Asp-Gly-Ser-Ala-Trp-Leu-Leu-Leu-Gly-Ser Asp-Gly-Ser-Ala-Tyr-Leu-Leu-Ala-Gly-Ser GAA-GGA-AGC-TCC-TGG-TTA-CTC-CTG-GGA-TCC GAG-GGT-TCC-AGC-TAT-CTA-TTA-ATT-GGT-AGC GAC-GGC-AGT-GCA-TGG-TTG-CTT-TTG-GGC-AGT GAT-GGG-TCA-GCT-TAC-CTC-CTG-GCC-GGG-TCA

DNA->Protein: code usage • Difference in codon usage can lead to large base composition bias - in which case one often needs to remove the 3rd codon, the more bias prone site… and possibly the 1st • Comparing protein sequences can reduce the compositional bias problem —> more information in DNA or protein?

Models for DNA and Protein evolution • DNA: 4 x 4 rate matrices • “Easy” to estimate - can be combined with tree search in some cases • Protein: 20 x 20 matrices • More complex: time and estimation problems (rare changes?) -> • Empirical models from large datasets are typically used • One can correct for amino acid frequencies for a given dataset • Reduced alphabet (e.g. 6 x 6 matrix) allows to optimize the model for a given dataset

Proteins and their amino acids • Proteins determine shape and structure of cells and carry most catalytic processes - 3D structure • Proteins are polymers of 20 different amino acids • Amino acids sequence composition determines the structure (2ndary, 3ary…) and function of the protein • Amino acids can be categorized by their side chain physicochemical properties • Size (small versus large) • Polarity (hydrophobic versus hydrophilic, +/- charges)

Amino acid physico-chemical properties • Major factor in protein folding • Key to protein functions • —> Major influence in pattern • of amino acid mutations • As for Ts versus Tv in DNA sequences, some amino acid changes are more common than others: fundamental for sequence comparisons(alignments and phylogenetics!) • Small <—> small > small <—> big

Estimation of relative rates of residue replacement (models of evolution) • Differences/changes in protein alignments can be pooled and patterns of changes investigate. • Patterns of changes give insights into the evolutionary processes underlying protein diversification -> estimation of evolutionary models • Choice of protein evolutionary models can be important for the sequence analysis we perform (database searching, sequence alignment, phylogenetics)

Amino acid substitution matrices based on observed substitutions: “empirical models” • Summarise the substitution pattern from large amount of existing data (‘average’ models) • Based on a selection of proteins • Globular proteins, membrane proteins? • Mitochondrial proteins? • Uses a given counting method and set of recorded changes • tree dependent/independent • restriction on the sequence divergence

Amino acid physico-chemical properties • Size • Polarity • Charges (acidic/basic) • Hydrophilic (polar) • Hydrophobic (non polar)

Taylor’s Venn diagram of amino acids properties Tiny Small P A Aliphatic CS-S G N Polar S CS-H Q V D - T I E L Charged K M + Y F H R W Hydrophobic Aromatic Taylor (1986). J Theor. Biol. 119: 205-218

Hydrophylic Small Large Hydrophobic Kosiol et al. (2004). J. Theor. Biol. 228: 97-106

Amino acids categories 1:Doolittle (1985). Sci. Am. 253, 74-85. • Small polar: S, G, D, N • Small non-polar: T, A, P, C • Large polar: E, Q, K, R • Large non-polar: V, I, L, M, F • Intermediate polarity: W, Y, H

Amino acids categories 2(PAM matrix) • Sulfhydryl: C • Small hydrophilic: S, T, A, P, G • Acid, amide: D, E, N, Q • Basic: H, R, K • Small hydrophobic : M, I, L, V • Aromatic: F, Y, W

Amino acids categories 3(implemented in SEAVIEW colour coding) • Tiny 1, non-polar: C • Tiny 2, non-polar: G • Imino acid: P • Non-polar: M, V, L, I, A, F, W • Acid: D, E • Basic: R, K • Aromatic: Y, H • Uncharged polar: S, T, Q, N

Amino acids categories Changes within a category are more common then between them • Colour coding of alignments to help visualise their quality (ClustalX, SEAVIEW) • Differential weighting of cost matrices in parsimony analyses • Mutational data matrices in model based methods (e.g. ML and Bayesian framework) • Recoding of the 20 amino acids into bins to focus on changes between bins (categories) (e.g. 6x6 matrix)

—> Colour coding of different categories is useful for protein alignment visual inspection

Phylogenetic trees from protein alignments • Parsimony based methods- unweighted/weighted • Distance methods- model for distance estimation • probability of amino acid changes, site rate heterogeneity • Maximum likelihoodandBayesian methods- model for ML calculations • probability of amino acid changes, site rate heterogeneity

Trees from protein alignment:Parsimony methods - cost matrices • All changes weighted equally • Differential weighting of changes: anattempt to correct for homoplasy!: • Based on the minimal number of amino acid substitutions, the genetic code matrix (PHYLIP-PROTPARS) • Weights based on physico-chemical properties of amino acids • Weights based on observed frequency of amino acid substitutions in alignments

Parsimony: unweighted matrix for amino acid changes • Ile-> Leu cost = 1 • Trp -> Asp cost = 1 • Ser -> Arg cost = 1 • Lys -> Asp cost = 1

Parsimony: weighted matrix for amino acid changes, the genetic code matrix • Ile-> Leu cost = 1 • Trp -> Asn cost = 3 • Ser -> Arg cost = 2 • Lys -> Asp cost = 2

Weighting matrix based on minimal amino acid changes PROTPARS inPHYLIP – also in SEAVIEW A C D E F G H I K L M N P Q R 1 2 T V W Y [A] 0 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 [C] 2 0 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 1 1 [D] 1 2 0 1 2 1 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 [E] 1 2 1 0 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 [F] 2 1 2 2 0 2 2 1 2 1 2 2 2 2 2 1 2 2 1 2 1 [G] 1 1 1 1 2 0 2 2 2 2 2 2 2 2 1 2 1 2 1 1 2 [H] 2 2 1 2 2 2 0 2 2 1 2 1 1 1 1 2 2 2 2 2 1 [I] 2 2 2 2 1 2 2 0 1 1 1 1 2 2 1 2 1 1 1 2 2 [K] 2 2 2 1 2 2 2 1 0 2 1 1 2 1 1 2 2 1 2 2 2 [L] 2 2 2 2 1 2 1 1 2 0 1 2 1 1 1 1 2 2 1 1 2 [M] 2 2 2 2 2 2 2 1 1 1 0 2 2 2 1 2 2 1 1 2 3 [N] 2 2 1 2 2 2 1 1 1 2 2 0 2 2 2 2 1 1 2 3 1 [P] 1 2 2 2 2 2 1 2 2 1 2 2 0 1 1 1 2 1 2 2 2 [Q] 2 2 2 1 2 2 1 2 1 1 2 2 1 0 1 2 2 2 2 2 2 [R] 2 1 2 2 2 1 1 1 1 1 1 2 1 1 0 2 1 1 2 1 2 [1] 1 1 2 2 1 2 2 2 2 1 2 2 1 2 2 0 2 1 2 1 1 [2] 2 1 2 2 2 1 2 1 2 2 2 1 2 2 1 2 0 1 2 2 2 [T] 1 2 2 2 2 2 2 1 1 2 1 1 1 2 1 1 1 0 2 2 2 [V] 1 2 1 1 1 1 2 1 2 1 1 2 2 2 2 2 2 2 0 2 2 [W] 2 1 2 2 2 1 2 2 2 1 2 3 2 2 1 1 2 2 2 0 2 [Y] 2 1 1 2 1 2 1 2 2 2 3 1 2 2 2 1 2 2 2 2 0 W: TGG ||| N: AAC AAT A minimum of 3 changes are needed at the DNA level for W<->N

Phylogenetic trees from protein alignments • Parsimony based methods- unweighted/weighted • Distance methods- model for distance estimation • probability of amino acid changes, site rate heterogeneity • Maximum likelihoodandBayesian methods- model for ML calculations • probability of amino acid changes, site rate heterogeneity

Distance methods A two step approach - two choices! 1) Estimate all pairwise distances Choose a method (100s) - has an explicit model for sequence evolution 2) Estimate a tree from the distance matrix Choose a method: with or without an optimality criterion?

Estimation of protein pairwise distances • Simple formula • More complex models • 20 x 20 matrices (evolutionary model): • Identity matrix • Genetic code matrix • Mutational data matrices (MDMs) • Correction for rate heterogeneity between sites (G [a]+ pInv)

The Kimura and poisson corrections dij = -Ln (1 - Dij - (Dij2/5)) - Kimura dij = -Ln (1 –Dij) - poisson Dij is the observed proportion of differences between sequences i and j (1≥Dij≥0). - Can give good estimates for small Dij It can approximates the PAM matrix well However if Dij =1 (poisson) or Dij ≥ 0.8541 (Kimura), dij = infinite Does not take into account which amino acid are changing and assumes all sites change with the same rate Both are implemented in SEAVIEW

Amino acid substitution matrices (MDMs) • Sequence alignments based matrices PAM, JTT, BLOSUM, WAG, LG... • Structure alignments based matrices STR (for highly divergent sequences)

Protein distance measurements with MDM 20 x 20 matrices: • PAM, BLOSUM, WAG…matrices • Maximum likelihood calculation which takes into account: • All sites in the alignment • All pairwise rates in the matrix • Branch length dij = ML [P(n), Xij, (G, pinv)] (dodgy notation!) dij = -Ln (1 - Dij - (Dij2/5))= F(Dij)

How is an MDM inferred? • Observed raw changes are corrected for: • The amino acid relative mutability • The amino acid normalised frequency • Differences between MDM come from: • Choice of proteins used (membrane, globular) • Range of sequence similarities used • Counting methods • On a tree [MP, ML] • Pairwise comparison from an alignment -> empirical models from large datasets are typically used

How is an MDM inferred? The raw data: observed changes in pairwise comparisons in an alignment or on a tree seq.1 AIDESLIIASIATATI |*||*||*||*||*|| seq.2 AGDEALILASAATSTI

seq.1 AIDESLIIASIATATI |*||*||*||*||*|| seq.2 AGEEALILASAATSTI A S T G I L E D A 3 S 2 1 T 0 0 1 G 0 0 0 0 I 1 0 0 1 2 L 0 0 0 0 1 1 E 0 0 0 0 0 0 1 D 0 0 0 0 0 0 1 0 Raw matrix Symmetrical! -> The larger the dataset the better the estimates!

Amino Acid exchange matrices - s1,2 s1,3 … s1,20 s1,2 - s2,3 … s2,20 s1,3 s2,3 - … s3,20 … … … … … s1,20 s2,20 s3,20 … - X diag(π1, …, π20) = Q matrix Q Rate matrix Qij Instantaneous rates of change of amino acids sij Exchangeabilities of amino acid pairs ij sij = sij Time reversibility πi Stationarity of amino acid frequencies (typically the observed proportion of residues in the dataset)

Amino Acid exchange matrices R Relative rate matrix (no composition, no branch length) Q Rate matrix (with composition, not branch length) F P R Raw matrix Observed changes (counted on a MP tree or in pairwise comparisons) Probability matrix (composition + branch length) Can be estimated using ML on a tree Relatedness odd matrix Used for scoring alignments (BlastP, ClustalX) Modified from Peter Foster

The PAM and JTT matrices • PAM - Dayhoff et al. 1968 • Nuclear encoded genes, ~100 proteins • JTT - Jones et al. 1992 • 59,190 accepted point mutations for 16,300 proteins Jones, Taylor & Thornton (1992). CABIOS 8, 275-282

The BLOSUM matrices Henikoff & Henikoff (1992). Proc Natl Acad Sci USA 89, 10915-9 • BLOcks SUbstitution Matrices • The matrix values are based on 2000 conserved amino acid patterns (blocks) - pairwise comparisons —> more efficient for distantly related proteins —> more agreement with 3D structure data BLOSUM62 - 62% minimum sequence identity (BlastP default) BLOSUM50 - 50% minimum sequence identity BLOSUM42 - 42% minimum sequence identity (BlastP)

The WAG matrix Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699 • Globular protein sequences • 3,905 sequences from 182 protein families • Produced a phylogenetic trees for every family and used maximum likelihood to estimate the relative rate values in the rate matrix (overall lnL over 182 different trees) • Better fit of the model with most data (significant improvement of the tree lnL when compared to PAM or JTT matrices) • Might not be the best option in some cases such as for mitochondria encoded proteins or other membrane proteins…

Further improvements: the LG matrix Le and Gascuel (2008) Mol. Biol. Evol. 25, 1307-20 • Used the same phylogenetic approach then for WAG • Further refine the method by adding the variability of evolutionary rates across sites when estimating the matrix and increase the number of sequences used • Better fit of the model with most data (significant improvement of the tree lnL when compared to WAG and other matrices) • Might not be the best option in some cases such as for mitochondria encoded proteins or other membrane proteins… • Incorporated in ProTest that compares fit between various matrices and a given alignment and is available in PhyML – in SEAVIEW – for phylogenetic inferences

MODELS OF PROTEIN EVOLUTION: AN INTRODUCTION TO AMINO ACID EXCHANGE MATRICES