Molecular Clocks and HIV-1 Polly R. Walker D. Phil Student Dept of Zoology, University of Oxford
Summary of Talk • Molecular clocks • Measurably Evolving Populations (MEPs) • Methods for measuring evolution • Coalescent theory • Application of the molecular clock • Estimating divergence times • Population dynamics using coalescent theory • Demonstration: HIV-1 in South Africa.
The Molecular Clock • Gene sequences accumulate substitutions at a constant rate, therefore we can use genes sequences to time divergences. This is referred to as a ‘Molecular Clock’ • • The idea of a molecular clock was initially suggested by Zuckerkandl and Pauling in 1962. They noted that rates of amino acid replacements in animal haemoglobins were roughly proportional to real time, as judged against the fossil record. • • The “constancy” of the molecular clock is particularly striking when compared to the obvious variation in the rates of morphological evolution (e.g. the existence of “living fossils”).
There is No “Universal” Molecular Clock • Sources of variation in the Clock: • Mutation rates are variable though time • - different generation times of organism • - different metabolic rates • - different genomic systems, e.g. repair mechanisms • - different region genes or sites in a molecule • (together referred to as lineage effects - a neutralist explanation) • The existence of “nearly” neutral mutations and fluctuations in population size (thenearly neutral theory). • Natural selection - species adapt to variable environments. • The molecular clock can vary over time • - how constant is the environment? • - how neutral is evolution?
Average Rates of Nucleotide Substitutionin Different Organisms Substitution Rate Organism/Genome (per site, per year) -9 Plant chloroplast DNA ~ 1 x 10 -9 Mammalian nuclear DNA 3.5 x 10 -9 Plant nuclear DNA ~ 5 x 10 -9 E. coli and Salmonella enterica bacteria ~5 x 10 -8 Drosophila nuclear DNA 1.5 x 10 -8 Mammalian mitochondrial DNA 5.7 x 10 -3 HIV-1 6.6 x 10
• For natural selection to produce a molecular clock population sizes, selection pressures, and mutation rates must be constant over evolutionary time. How true is THAT for HIV? Constant Molecular Clocks are Difficult toObtain Under Natural Selection • The rate of substitution of mutations with selective advantage depends on; i. effective population size (4Ne) ii. degree of selective advantage (s) iii. mutation rate (m) k = 4Nesm
Testing the Molecular Clock • So, is there a good molecular clock? • There are a variety of ways to test the molecular clock. i. The dispersion index, R(t) ii. The relative rate test iii. The Likelihood Ratio test using ML statistics.
Maximum Likelihood Tests of the Molecular Clock log Likelihood = -2660.61 log Likelihood = -2659.18 Human Human Chimp Chimp Gorilla Gorilla Orang-utan Orang-utan Gibbon Gibbon substitutions time •Likelihood Ratio Test: The differences in log likelihood can be compared directly LRT = Chidist 2(ABSlnL), df (n-2) (not significantly different in this case - primate mitochondrial DNA)
Measurably Evolving Populations Population is heterochronously sampled, spanning hundreds or thousands of generations, and contain a significant amount of genetic variation. Hence, this typically includes either 1. Organisms with rapid evolution and small generation time e.g, RNA viruses 2. Organisms with a wide range of sampling dates of dates e.g ancient DNA samples
• RNA viruses often have different sampling times. Small differences can have big effects. 1970 1980 1990 2000 Maximum Likelihood Estimation of Viral Substitution Rates Programme “Tip-Date” or “Rhino” • Construct rooted maximum likelihood tree • Optimise branch lengths under a single rate with relative tip positions consistent with isolation dates • Test molecular clock using a likelihood ratio test • Estimate confidence intervals substitution rate
The Coalescent Phylogeny CoalescentTheory Demographic History • •Tells us how phylogenies of sample populations are affected by changes in population size and structure (demography). • • The descent of lineages is traced backwards in time, to the point when they share common ancestral alleles. The number of lineages is reduced at each coalescent event (creating nodes on the tree). • • The probability that two sequences share a common ancestor ( a coalescent event occurs in the previous generation) is 1/2N. Therefore the probability any two sequences shared a common ancestor a number of generations (G) ago: f(G) = (1/2N)e-(G-1)/2N • Therefore the probability that sequences sampled randomly from a population share a common ancestor is dependent on population size.
The Coalescent • Changes in population size affect the distribution of coalescent times (i.e. when in time branching events occur). • In a constant sized population more coalescent events occur near the tips than the root, but in a growing population coalescent events more towards the root because the population size is smaller so that coalescent events are more likely (i.e. drift is more powerful in small populations). Big N Small N Constant (“endemic”) Growing (“epidemic”) • Therefore possible to distinguish continually large populations, from those that have only recently grown in size.
T I M E rapid growth slow growth large population small population
Models of Demographic History • Constant size (endemic) population; - 1 parameter, population size (N) • Exponentially growing (epidemic) population; 2 parameters, current (N0) and rate of growth (r) • More complex models: - logistic (growth slows down toward the present) - expansion (sudden change in growth rate) • Estimate all parameters (e.g. N0, r) from tree structure Can compare these nested models using the likelihood ratio test
Assumptions of the Model A) Lineages coalesce independently B) No more than one coalescent event can occur in a single generation C) The time-scale is so large that it can be represented as continuous • Works best for neutral mutations subject to genetic drift innon-recombiningpopulations - i.e. in this case any change in the structure of the genealogy must be due to demographic processes, rather than fitness differences (i.e. fit alleles produce more branches).
Los Alamos Sequence Database (http://hiv-web.lanl.gov) Estimating Demographic History of HIV-1 Subtype C Step 1 Sequence selection • Large range of dates e.g. 1989- 2001 • Monophyletic (to comparison group e.g. subtype B • Length of sequences available, optimise length against samples size. Example: CgagSR - ntax = 29, nchar = 1659 1986: C.ET.86.ETH2220 1993: C.IN.93.N904 C.IN.93.IN905 C.IN.93.IN101 C.IN.93.IN99, 1995: C.IN.95.IN21068 C.IN.95.IN21301 1996: C.BW.96.BW17B05 C.BW.96.BWM032 C.BW.96.BW0504 C.BW.96.BW1626 C.ZM.96.ZM651 C.ZM.96.ZM751 1997: C.ZA.97.ZA012 1998: C.TZ.98.TZ013 C.TZ.98.TZ017 C.ZA.98.TV001 C.ZA.TV002 1999: C.ZA.99.DU151 C.ZA.99.DU179 C.BW.99.BW47547 C.BW.99.BWMC168 2000: C.BW.00.BW18595 C.BW.00.BW18802 C.BW.00.BW192113 C.BW.00.BW20361 C.BW.00.BW20636
Sequences are out-of-frame Step1. Sequence Alignment Using Clustal AND manual alignment e.g. Se-Al version Remove all incomplete or codons (*, ?), and in the correct reading frame.
The closest sequence/s to the root of the tree is defined as the outgroup Return to your original tree and use this sequence to root the tree (under rooting options) Subtype B is the most distantly related sequence. Step 2. ML tree construction • Make a Neighbour Joining tree, check this tree and remove identical / almost identical sequences • Estimate all parameters under a realistic evolutionary model, e.g. GTR: gamma., derive the best ML tree. • Rooting the tree: e.g. outgroup rooting. • Add in a distantly related sequence, like another subtype.
Rhino Version 1.2 http//evolve/ox.ac.uk Macintosh version - Runs on MacOS9 and MacOSX UNIX/Linux version - could be compiled for Windows Step 3 Tip-dating the Tree • Prepare correct input format: must have sequence file in nexus format, rooted tree file, and tip dating information • Use the same evolutionary model here as you have used to generate the tree (get commmands from the manual. • Estimate the rate of evolution (absrate) and confidence intervals (interval tree) using bootstrapping. Begin RHINO; NUCMODEL TYPE=GTR; TREEMODEL TYPE=TIPDATES; SITEMODEL TYPE=GAMMA; OPTIMIZE; STATUS param; interval tree:absrate; End; • Carry out the likelihood ratio test: is it significant?
The likelihood ratio test tells us whether we are justified in assuming a molecular clock. If a clock exists then the difference is not significant. • LRT = dist (2 x (ABS (lnL (VR) - lnL (clock)) • df = n - 2 • This is a very strict measure of a molecular clock. Look at root- to tip regression lines.
Using The Clock:1. Timing the origin of the epidemicTMRCA = tree node height = years since MRCA substitution rate Not significant difference between timing of two subtypes. Subtype C has a slightly lower point estimate for rate but broader CIs Can apply the rates to other data sets, provided it is the same gene region
Population Dynamics Determine the maximum likelihood population growth model. Estimate the parameter of Rho under the best-fit growth model Scale skyline plots according to the substitution rate. 4. Estimate parameter R, which is the growth rate in units year -1, or rho/ 5. Estimate the doubling time : Doubling time (years) = LN (2) R
Results • Within subtype C confidence intervals overlap. Subtype B and C show different demographic histories. • Subtype C has a slower exponential phase than subtype B • Subtype C on a global level is showing a logistic trend, not yet significant, but in Africa it is still exponentially growing. Potential Applications: Comparing growth rates within different groups, e.g. risk group, HLA type, or the spread of different clades. Detecting decreases in epidemic growth rate.
Conclusions Molecular Clocks can be used to: a.) time the origin of an epidemic b.) determine population dynamics c.) Your estimates are only as good as your clock. d.) HIV is subject to variable rates of evolution among branches: needs new models which allow for this (relaxed clocks).