1 / 70

Molecular Evolution: Plan for week

Molecular Evolution: Plan for week. Monday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30 PAUP : Distance/Parsimony/Compatibility (JH/IH) Lecture 2 : 13.30-15 Molecular Basis and Models II (JH)

aspen
Download Presentation

Molecular Evolution: Plan for week

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular Evolution: Plan for week Monday 3.11: Basics of Molecular Evolution Lecture 1: 9-10.30 Molecular Basis and Models I (JH) Computer : 11-12.30 PAUP : Distance/Parsimony/Compatibility (JH/IH) Lecture 2 : 13.30-15 Molecular Basis and Models II (JH) Lecture 3: 15.30-17 The Origin of Life (JH/ Miklos) Tuesday 4.11:Tree of Life Lecture 1: 9-10.30 Molecular Evolution of Eukaryote Pathogens (Day/Barry) Lecture 2: 11-12.30 Molecular Evolution of Prokaryote Pathogens (Maiden) Computer: 13.30-15 Analysis of Viral Data (Taylor) Lecture 3:15.30-17 Molecular Evolution of Virus (E.Holmes) Wednesday 5.11:Stochastic Models of Evolution & Phylogenies Computer : 9-10.30 PAUP/Mr. Bayes: Likelihood (JH/IH) Lecture 1:11-12.30 The Evolution of Protein Structures (Deane) Computer: 13.30-15 PAML:Testing Evolutionary Models (JH/Lyngsoe) Lecture 2:15.30- 17 Molecular Evolution & Function/Structure/Selection(Meyer) Thursday 6.11: More Phylogenies Computer : 9-10.30 Molecular Evolution on the web (JH/Lyngsoe) Lecture 2: 11-12.30 Beyond Phylogenies: Networks & Recombination (Song/JH) Computer: 13.30-15 Beyond Phylogenies (Song) Lecture 3: 15.30-17 Molecular Evolution and the Genomes. (JH/Lunter) Friday 7.11:Results, Advanced Topics and article discussion Computer: 9-10.30 Statistical Alignment (JH/IM) Lecture: 11-12.30 Article Discussion/Presentation by students The Last Lunch

  2. Two Discussion Articles 1. Timing the ancestor of the HIV-1 pandemic strains.Korber B, Muldoon M, Theiler J, Gao F, Gupta R, Lapedes A, Hahn BH, Wolinsky S, Bhattacharya T. Science. 2000 Jun 9;288(5472):1789-96. 2. Sequencing and comparison of yeast species to identify genes and regulatory elements. Kells, M., N.Patterson, M.Endrizzi & E.Lander Nature May 15 2003 vol 423.241-

  3. The Data & its growth. • 1976/79 The first viral genome –MS2/fX174 • 1995 The first prokaryotic genome – H. influenzae • 1996 The first unicellular eukaryotic genome - Yeast • 1997 The first multicellular eukaryotic genome – C.elegans • The human genome • The Mouse Genome 1.5.03: Known >1000 viral genomes 96 prokaryotic genomes 16 Archeobacterial genomes A series multicellular genomes are coming. A general increase in data involving higher structures and dynamics of biological systems

  4. The Nucleotides Transversions Purines Pyremidines Transitions http://www.accessexcellence.org/AB/GG/

  5. The Amino Acids/Codons/Genes {nucleotides}3  amino acids, stop http://www.accessexcellence.org/AB/GG/

  6. Major Application Areas of Molecular Evolution Phylogenies and Classification Rates of Evolution & The Molecular Clock Dating Functional Constraint – Negative Selection. Positive/Diversifying Selection Structure RNA Structure Gene Finding Homing in on Important Genes Homology Searches Disease Gene Mapping

  7. Origin of Life LUCA ?? Viruses Eukaryotes Archea Prokaryotes The Tree (?) of Life Plant Fungi Animals

  8. Tree of Life. Science vol.300 June 2003

  9. The Origin of Life When did life originate? Is the present structure a necessity or is it random accident? How frequent is life in the Universe? “+”: “-”: Self replication easy Self assembly easy Many extrasolar planets Hard to make proper polymerisation No convincing scenario. No testability Increased Origin Research: In preparation of future NASA expeditions. The rise of nano biology. The ability to simulate larger molecular systems

  10. Central Principles of Phylogeny Reconstruction s1 s1 s1 s3 s3 s3 s2 s2 s2 s4 s4 s4 TTCAGT TCCAGT GCCAAT GCCAAT 1 0 2 Parsimony Distance Likelihood Total Weight: 4 0 1 0.6 1 1 2 3 2 1 0.7 1.5 0.4 0.3 L=3.1*10-7 Parameter estimates

  11. From Distance to Phylogenies What is the relationship of a, b, c, d & e? Molecular clock A b c d e A - 22 10 22 22 B 6 - 22 16 14 C 7 3 - 22 22 D 13 9 8 - 16 e 6 8 9 15 - No Molecular clock

  12. 1 2 3 1 2 1 3 1 1 1 1 1 1 2 2 2 2 2 2 4 3 4 2 3 4 4 3 3 3 3 4 4 3 4 4 5 5 5 5 5 Enumerating Trees: Unrooted & valency 3 Recursion: Tn= (2n-5) Tn-1 Initialisation: T1= T2= T3=1

  13. Heuristic Searches in Tree Space Nearest Neighbour Interchange T1 T1 T2 T3 T1 T3 T3 T4 T4 T4 T2 T2 Subtree regrafting s4 s6 s1 s4 s6 T4 s2 s5 s5 s3 s3 s1 T3 T3 T4 s2 Subtree rerooting and regrafting s4 s6 s1 s4 s6 T4 s2 s5 s5 s3 s1 s3 T3 T3 T4 s2

  14. Assignment to internal nodes: The simple way. A G T C ? ? ? ? ? ? C C C A What is the cheapest assignment of nucleotides to internal nodes, given some (symmetric) distance function d(N1,N2)?? If there are k leaves, there are k-2 internal nodes and 4k-2 possible assignments of nucleotides. For k=22, this is more than 1012.

  15. 5S RNA Alignment & Phylogeny Hein, 1990 3 5 4 6 13 11 9 7 15 17 14 10 12 16 Transitions 2, transversions 5 Total weight 843. 8 2 1 10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcgaacttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-gggggccct-gcggaaaaatagctcgatgccagga--ta 17 t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaacttggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagcccg-atggaaaaatagctcgatgccagga--t- 9 t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaacttggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagcccg-atggaaaaatagctcgacgccagga--t- 14 t----ctggtggccatggcgtagaggaaacaccccatcccataccgaactcggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagcccg-ctgggaaaataggacgctgccag-a--t- 3 t----ctggtgatgatggcggaggggacacacccgttcccataccgaacacggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcagggag-ccgggagagtaggacgtcgccag-g--c- 11 t----ctggtggcgatggcgaagaggacacacccgttcccataccgaacacggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtccg-ctgggagagtaggacgctgccag-g--c- 4 t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaacacggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtccc-ctgtgagagtaggacgctgccag-g--c- 15 g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaactcggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagaccgcctgggaaacctggatgctgcaag-c--t- 8 g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatctcggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacctcctgggaataccgggtgctgtagg-ct-t- 12 g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagaccgcctgggaatcctgggtgctgtagg-c--t- 7 g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatctggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacggcctgggaatcctggatgttgtaag-c--t- 16 g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagaccgcctgggaatcctgggtgctgtagg-c--t- 1 a----tccacggccataggactctgaaagcactgcatcccgt-ccgatctgcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggaccacgcgggaatcctgggtgctgt-gg-t--t- 18 a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatctgcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 2 a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatctgcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 5 g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacctcccgggaagtcctggtgccgcacc-c--c- 13 g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacctcctgggaagtcctgatgctgcacc-c--t- 6 g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacctcctgggaagtcctaatattgcacc-c-tt-

  16. Cost of a history - minimizing over internal states A C G T d(C,G) +wC(left subtree) A CGT A CGT

  17. Cost of a history – leaves (initialisation). A C G T Initialisation: leaves Cost(N)= 0 if N is at leaf, otherwise infinity G A Empty Cost 0 Empty Cost 0

  18. Fitch-Hartigan-Sankoff Algorithm (A,C,G,T) (9,7,7,7) (A, C, G,T) (10,2,10,2) The cost of cheapest tree hanging from this node given there is a “C” at this node (A,C,G,T) * 0 * * (A,C,G,T) * * * 0 (A,C,G,T) * * 0 * 5 C A 2 T G

  19. The Felsenstein Zone Felsenstein-Cavendar (1979) s1 s4 s2 s3 True Tree Reconstructed Tree s1 s2 s3 s4 Patterns:(16 only 8 shown) 0 1 0 0 00 0 0 0 0 1 0 01 0 1 0 0 0 1 01 1 0 0 0 0 0 10 1 1

  20. Bootstrapping Felsenstein (1985) ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT 10230101201 1 500 2 ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? ?????????? 2 3 2 3 2 3 4 4 1 1 4 1

  21. The Molecular Clock Unknown Ancestors a ?? s2 s1 s1 s3 s2 First noted by Zuckerkandl & Pauling (1964) as an empirical fact. How can one detect it? Known Ancestor,a, at Time t

  22. Rootings Purpose 1) To give time direction in the phylogeny & most ancient point 2) To be able to define concepts such a monophyletic group. 1) Outgrup: Enhance data set with sequence from a species definitely distant to all of them. It will be be joined at the root of the original data 2) Midpoint: Find midpoint of longest path in tree. 3) Assume Molecular Clock.

  23. Rooting the 3 kingdoms 3 billion years ago: no reliable clock - no outgroup Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? P P E E E Root?? MDH LDH P A A A Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? LDH/MDH LDH/MDH E P A E P A

  24. Non-contemporaneous leaves. (A.Rambaut (2000): Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16.4.395-399) time Contemporary sample no time structure Serial sample with time structure 1980 1990 2000 RNA viruses like HIV evolve fast enough that you can’t ignore the time structure From Drummond

  25. Pt.9 Pt.2 10% Viral Divergence HIV1U35926 8% 6% Pt.7 4% Patient #6 from Wolinsky et al. 2% HIVU95460 0 Pt.5 HIV1U36148 Pt.6 HIV1U36073 HIV1U36015 HIV1U35980 Pt.8 Pt.3 Pt.1 2 4 6 8 10 10% Years Post Seroconversion HIV-1 (env) evolution in nine infected individuals Shankarappa et al (1999) From Drummond

  26. A tree sampled from the posterior distribution of Shankarappa Patient ‘Ladder-like’ appearance Lineage A Lineage B • 210 sequences collected over a period of 9.5 years • 660 nucleotides from env: C2-V5 region • Only first 285 (no alignment ambiguities) were used in this analysis • Effective population size and mutation rate were co-estimated using Bayesian MCMC. Ne = [4000,6300] Mu = [0.8% – 1%] per site year From Drummond

  27. Models of Amino Acid, Nucleotide & Codon Evolution Amino Acids, Nucleotides & Codons Continuous Time Markov Processes Specific Models Special Issues Context Dependence Rate Variation

  28. The Purpose of Stochastic Models. • Molecular Evolution is Stochastic. • 2. To estimate evolutionary parameters, not observable directly: • i. Real number of events in evolutionary history. • ii. Rates of different kinds of events in evolutionary history. • iii.Strength of selection against amino acid changing nucleotide substitutions. • iv. Estimate importance of different biological factors. • Survive a goodness of fit test. • 4. Serve these purposes as simply as possible.

  29. Central Problems: History cannot be observed, only end products. ACGTC ACGTC ACGCC ACGCC AGGCC AGGCC AGGCT AGGCT AGGGC AGGCT AGGCT AGGTT AGGTT AGTGC Comment: Even if History could be observed, the underlying process couldn’t

  30. Principle of Inference: Likelihood Likelihood function L() – the probability of data as function of parameters: L(Q,D) LogLikelihood Function – l(): ln(L(Q,D)) If the data is a series of independent experiments L() will become a product of Likelihoods of each experiment, l() will become the sum of LogLikelihoods of each experiment In Likelihood analysis parameter is not viewed as a random variable.

  31. Likelihood and logLikelihood of Coin Tossing From Edwards (1991) Likelihood

  32. Principle of Inference: Bayesian Analysis In Bayesian Analysis the parameters are viewed as stochastic variables that has a prior distribution before observing data. Data depend on the parameters and after observing the data, the parameters will have a posterior distribution.

  33. Simplifying Assumptions I TCGGTA TGGTT Data: s1=TCGGTA,s2=TGGTT Probability of Data Biological setup a - unknown 1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT 2) Processes in different positions of the molecule are independent, so the probability for the whole alignment will be the product of the probabilities of the individual patterns. a5 a4 a3 a2 a1 T A T G G G G C T T

  34. Simplifying Assumptions II 3) The evolutionary process is the same in all positions 4) Time reversibility: Virtually all models of sequence evolution are time reversible. I.e. πi Pi,j(t) = πj Pj,i(t), where πi is the stationary distribution of i and Pt(i->j) the probability that state i has changed into state j after t time. This implies that Pa,N1(l1)*Pa,N2(l2) = PN1,N2(l1+l2) a l2+l1 l1 = l2 N1 N2 N2 N1

  35. Simplifying assumptions III 5) The nucleotide at any position evolves following a continuous time Markov Chain. Pi,j(t) continuous time markov chain on the state space {A,C,G,T}. t1 e A t2 C C Q - rate matrix: T O A C G T FA -(qA,C+qA,G+qA,T) qA,C qA,G qA,T RC qC,A -(qC,A+qC,G+qC,T) qC, G qC ,T OG qG,A qG,C -(qG,A+qG,C+qG,T) qG,T MT qT,A qT,C qT,G -(qT,A+qT,C+qT,G) 6) The rate matrix, Q, for the continuous time Markov Chain is the same at all times (and often all positions). However, it is possible to let the rate of events, ri, vary from site to site, then the term for passed time, t, will be substituted by ri*t.

  36. Q and P(t) What is the probability of going from i (C?) to j (G?) in time t with rate matrix Q? i. P(0) = I. ii. P(e) close to I+eQ for e small. iii. P'(0) = Q. iv. lim P(t) has the equilibrium frequencies of the 4 nucleotides in each row. v. Waiting time in state j, Tj, P(Tj > t) = e -(qjjt) vi. QE=0Eij=1 (all i,j) vii. PE=E viii If AB=BA, then eA+B=eAeB.

  37. Jukes-Cantor 69: Total Symmetry Rate-matrix, R: T O A C G T F A -3*aa aa R C a -3*aaa O G a a -3* a a M T a a a -3* a Transition prob. after time t, a = a*t: P(equal) = ¼(1 + 3e-4*a ) ~ 1 - 3a P(diff.) = ¼(1 - 3e-4*a ) ~ 3a Stationary Distribution: (1,1,1,1)/4.

  38. Geometric/Exponential Distributions The Geometric Distribution: {0,1,..} Geo(p): P{Z=j)=pj(1-p) P{Z>j)=pj E(Z)=1/p. The Exponential Distribution: R+ Exp(a) Density: f(t) = ae-at, P(X>t)= e-at Mean 2.5 Properties: X~Exp(a) Y~Exp(b) independent i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1) Markov (memoryless) process ii. E(X) = 1/a. iii. P(Z>t)=(≈)P(X>t) small a (p=e-a). iv. P(X < Y) = a/(a + b). v. min(X,Y) ~ Exp (a + b). N

  39. Comparison of Pairs of Nucleotides/Sequences Shortest Path All Evolutionary Paths: Sample Paths according to their probability: All Evolutionary Paths: C CTACGT C C G G G GTATAT ATTGTGTATATAT….CAG ATTGCGTATCTAT….CCG Chimp Mouse E.coli Higher Cells Fish

  40. From Q to P for Jukes-Cantor

  41. Kimura 2-parameter model start TO A C G T F A -2*b-a b a b R Cb -2*b-a b a O Ga b -2*b-a b M Tb a b -2*b-a a = a*t b = b*t Q: P(t):

  42. Felsenstein81 & Hasegawa, Kishino & Yano 85 Unequal base composition: (Felsenstein, 1981) Qi,j = C*πj i unequal j Transition/transversion & compostion bias (Hasegawa, Kishino & Yano, 1985) (a/b)*C*πj i- >j a transition Qi,j = C*πj i- >j a transversion

  43. Dayhoffs empirical approach (1970) Take a set of closely related proteins, count all differences and make symmetric difference matrix, since time direction cannot be observed. If qij=qji, then equilibrium frequencies, pi, are all the same. The transformation qij --> piqij/pj, then equilibrium frequencies will be pi.

  44. Measuring Selection - - ThrSer ACGTCA ThrPro ACGCCA Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest. ThrSer ACGCCG ArgSer AGGCCG The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important ThrSer ACTCTG AlaSer GCTCTG AlaSer GCACTG I

  45. i. The Genetic Code 3 classes of sites: 4 2-2 1-1-1-1 4 (3rd) 1-1-1-1 (3rd) ii. TA (2nd) Problems: i. Not all fit into those categories. ii. Change in on site can change the status of another.

  46. Possible events if the genetic code remade from Li,1997 Possible number of substitutions: 61 (codons)*3 (positions)*3 (alternative nucleotides). Substitutions Number Percent Total in all codons 549 100 Synonymous 134 25 Nonsynonymous 415 75 Missense 392 71 Nonsense 23 4 N

  47. Synonyous (silent) & Non-synonymous (replacement) substitutions Ser Thr Glu Met Cys Leu Met Gly Thr TCA ACT GAG ATG TGT TTA ATG GGG ACG *** * * * * * * ** GGG ACA GGG ATA TAT CTA ATG GGT AGC Ser Thr Gly Ile Tyr Leu Met Gly Ser Ks : Number of Silent Events in Common History Ka : Number of Replacement Events in Common History Ns : Silent positions Na : replacement positions. Rates per pos: ((Ks/Ns)/2T) Example: Ks =100 Ns = 300 T=108 years Silent rate (100/300)/2*108 = 1.66 * 10-9 /year/pos. Thr ACC * Thr ACG Ser AGC Miyata: use most silent path for calculations. * * Arg AGG

  48. b b a a b Kimura’s 2 parameter model & Li’s Model. Probabilities: Rates: start Selection on the 3 kinds of sites (a,b)(?,?) 1-1-1-1 (f*a,f*b) 2-2 (a,f*b) 4 (a, b)

  49. alpha-globin from rabbit and mouse. Ser Thr Glu Met Cys Leu Met Gly Gly TCA ACT GAG ATG TGT TTA ATG GGG GGA * * * * * * * ** TCG ACA GGG ATA TAT CTA ATG GGT ATA Ser Thr Gly Ile Tyr Leu Met Gly Ile • Sites Total Conserved Transitions Transversions • 1-1-1-1 274 246 (.8978) 12(.0438) 16(.0584) • 2-2 77 51 (.6623) 21(.2727) 5(.0649) • 4 78 47 (.6026) 16(.2051) 15(.1923) • Z(at,bt) = .50[1+exp(-2at) - 2exp(-t(a+b)] transition Y(at,bt) = .25[1-exp(-2bt )] (transversion) • X(at,bt) = .25[1+exp(-2at) + 2exp(-t(a+b)] identity • L(observations,a,b,f)= • C(429,274,77,78)* {X(a*f,b*f)246*Y(a*f,b*f)12*Z(a*f,b*f)16}* {X(a,b*f)51*Y(a,b*f)21*Z(a,b*f)5}*{X(a,b)47*Y(a,b)16*Z(a,b)15} • where a = at and b = bt. • Estimated Parameters: a = 0.3003 b = 0.1871 2*b = 0.3742 (a + 2*b) = 0.6745 f = 0.1663 • Transitions Transversions • 1-1-1-1 a*f = 0.0500 2*b*f = 0.0622 • 2-2 a = 0.3004 2*b*f = 0.0622 • 4 a = 0.3004 2*b = 0.3741 • Expected number of: replacement substitutions 35.49 synonymous 75.93 • Replacement sites : 246 + (0.3742/0.6744)*77 = 314.72 • Silent sites : 429 - 314.72 = 114.28 Ks = .6644 Ka = .1127

  50. HIV2 Analysis Hasegawa, Kisino & Yano Subsitution Model Parameters: a*t β*t pApCpGpT 0.350 0.105 0.361 0.181 0.236 0.222 0.015 0.005 0.004 0.003 0.003 Selection Factors GAG 0.385 (s.d. 0.030) POL 0.220 (s.d. 0.017) VIF 0.407 (s.d. 0.035) VPR 0.494 (s.d. 0.044) TAT 1.229 (s.d. 0.104) REV 0.596 (s.d. 0.052) VPU 0.902 (s.d. 0.079) ENV 0.889 (s.d. 0.051) NEF 0.928 (s.d. 0.073) Estimated Distance per Site: 0.194

More Related