Lecture 1: Overview of Phylogenetic methods and applications
Download
1 / 43

Lecture 1: Overview of Phylogenetic methods and applications - PowerPoint PPT Presentation


  • 239 Views
  • Uploaded on
  • Presentation posted in: Pets / Animals

Lecture 1: Overview of Phylogenetic methods and applications. Allan Wilson. Charles Darwin and Alfred Russel Wallace Evolution as descent with modification, implying relationships between organisms by unbroken genetic lines Phylogenetics seeks to determine these genetic relationships.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Lecture 1: Overview of Phylogenetic methods and applications

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 1: Overview of Phylogenetic methods and applications

Allan Wilson


Charles Darwin and Alfred Russel WallaceEvolution as descent with modification, implying relationships between organisms by unbroken genetic lines

Phylogenetics seeks to determine these genetic relationships

Alfred Russel Wallace

Darwin’s sketch: the first phylogenetic tree?

Charles Darwin


Interpretation of morphological characters is often subjective, so open to personal biases

Cynodonts (0)

Morganuconodonts (1)

Eutriconodonts (1)

Spalacotheriids (2)

Eupantotheres (2)

Ji et al.

Archaic therians (2)

Hu et al.

Opalized lower jaw of the monotreme Steropodon

Modern therians (2)

e.g. Jaw rotation: weak (0), moderate (1), strong (2) as indicated by vertical wear facets on molars. Hu et al. (Nature, 1997) and Ji et al. (Nature, 1999) coded Steropodon (1) and (2) respectively, helping to account for their alternative placements of monotremes


Deoxyribonucleic acid (DNA) -Watson, Crick, Wilkins and Franklin


  • Early Molecular phylogenetics

  • - Immunological distances

  • DNA-DNA hybridization

  • Without access to the actual sequences, these are difficult to apply corrections and statistical significance testing to


Phylogenetics is now dominated by the clearly defined 4 nucleotides and 20 amino acids

Purines

AG

C T

Pyrimidines

Transitions

Transversions

Millions of years

Hominid phylogeny from DNA


Tree terminology

Rooted tree

internal edge/branch

Unrooted tree

external edge/branch

node

Taxon 1

Taxon 2

Taxon 3

Taxon 4

Taxon 5

Taxon 6

Taxon 7

Taxon 8

internode


outgroup

ingroup

polyphyly

Sister taxa

paraphyly

polytomy

bifurcating


Overview of phylogenetic procedure - by example

Biological problem (the question)

Which data to obtain (data sampling)

Finding the best tree (search strategy)

Defining the best tree (optimality criterion)


1. Biological problem (the question)

What is the relationship of the extinct American Cheetah (Miracinonyx trumani) to other cats?

Two main sister group hypotheses

Cheetahs (Acinonyx jubatus): Limb, skull, vertebrae morphology

B. Pumas (Felis concolor): Geography, early fossils less cheetah-like

See Barnett et al. (Curr. Biol., 2005)


2. Which data to obtain (data sampling)

  • Mitochondrial (mt) DNA

  • High mtDNA copy number is important because Ancient DNA is degraded

  • Inferring relatively recent (2-10 million year) divergences, so substantial sequence variation is required

mt control region best < 2 million years

mt Protein/RNA coding, best 2  25 million years

Observed divergence

Nuclear protein-coding, best > 25 million years

time


Mitochondrial partial NADH1 alignment for birds

#Nexus

Begin DATA;

Dimensions ntax=29 nchar=10692;

Format datatype=dna gap=-;

Matrix

Tinamou AACTATCTATTCATATCCTTATCATACATCATTCCTATTCTTATTGCA..

Emu AACCATCTCACTATATCACTCTCCTATGCAATCCCCATTCTAATCGCA..

Cassowary AACCACCTCACCATATCCCTGTCCTATGCAATCCCAATTCTAATCGCA..

Kiwi AACTACCTCACTATATCACTATCATATGTCATCCCAATTCTGATTGCA..

Rhea AACTACCTAATTATGTCCCTGTCATATGCTATCCCAATTCTAATCGCA..

Ostrich ACACACCTGACTATAGCACTCTCATACGCTGTTCCAATCCTAATTGCA..

Chicken AACCTTCTAATCATAACCTTATCCTATATTCTCCCCATCCTAATCGCC..

BrushTurkey AAACACCTCATCATATCCCTATCCTATGTTCTCCCAATTTTAATCGCC..

MagpieGoose AATCACCTCATTATAACCCTATCGTATGCCATCCCAATCCTAATCGCC..

Duck AGCTACCTCATTATATCCCTCCTATACGCCATCCCCATTCTAATCGCC..

Broadbill ACTAACCTTACCATATCCCTATCCTACGCCATCCCCGTCCTAGTTGCC..

Flycatcher ACCCACCTCATTATATCACTATCCTATGCCGTACCCATCCTAATTGCT..

ZebraFinch ATTAACCTCATCATAGCCCTCTCCTATGCCCTCCCAATCCTGATCGCA..

Rook GTCAACCTCATTATAGCACTTTCTTATGCTATCCCTATTCTAATCGCC..

Oystercatcher ACCTATCTCATTATATCCCTATCCTATGCCATCCCAATCCTGATCGCA..

Turnstone ACCTACTTCATCATATCCCTATCCTATGCAATCCCAATTCTAATTGCA..

Penguin GCTCACTTAGCCATATCCCTATCCTATGCCATCCCAATCCTCATTGCA..

Albatross ACCTATCTTGTCATGTCCCTATCATATGCCATCCCAATCCTAATCGCC..

;

End;


Tree reconstruction

Type of data

Distances Discrete (e.g. nucleotides)

Information loss often statistical power loss

Unweighted pair group method with arithmetic means (UPGMA)

Clustering algorithm

Neighbour-joining (NJ)

Tree-building method

Slower Faster

Maximum parsimony (MP)

Optimality criterion

Minimum evolution (ME)

Maximum likelihood (ML)


3. Finding the best tree (search strategy)

Number of possible trees (where n is the number of taxa)

Unrooted trees: (2n-5)  (2n-7)  …31

Rooted trees: (2n-3)  (2n-5)  …31

For the 11-taxon cat phylogeny

Unrooted = 17  5  13  11  9  7  5  3  1 = 34,459,425

Rooted = Unrooted  (2n-3) = 654,729,075

An exhaustive search will examine all trees, but is not practical for n > 12


Reducing the time for searching “tree space”

Heuristic search

Find an initial tree, and move within near-by tree-space, discarding worse alternatives

Only a small amount of tree-space is searched and there is no guarantee of finding the optimal tree - can be trapped in local maxima

Global optima

X

Local optima

X

X

Starting point


Branch and Bound search

As trees are built and branches added, if the addition of a taxon to a particular branch results in a tree-length greater than a previously determined upper bound for the tree, then this topology and all those derived from it are ignored and the search continues with a new placement for that taxon

Branch and bound guarantees finding globally optimal trees

Global optima

X

Local optima

X

X

Starting point


4. Defining the best tree (optimality criteria)

Distance methods

Absolute distance matrix

1 2 3 4 5 6 7 8 9 10 11

1 Mongoose -

2 Hyena 156 -

3 Sabretooth 207 147 -

4 Am.Cheetah 192 140 159-

5 Lion 186 134 148 131 -

6 Tiger 160 143 132 111 64 -

7 Puma 194 139 162 70 124 100 -

8 House.Cat 206 133 163 124 118 100 117 -

9 Cheetah 192 139 162 108 127 109 96 110 -

10 Ocelot 206 123 165 116 116 98 111 98 113 -

11 Jaguarundi 204 147 177 123 143 121 101 119 128 131 -


Early phenetics (distance/similarity) studies would note that taxon X and taxon Z are the most similar

Taxon Y TCAGCTA Taxon X ACATGTG Taxon Z ACGTCAG

XZ= 3 difference YZ= 5 differences XY= 4 differences

Taxon X

Taxon Z

Taxon Y


Cladistic methods, rather than being concerned with similarity, are concerned with the nature of changes (apomorphies)

synapomorphy

Taxon Y TC A GCTA Taxon X AC A TGTG Taxon Z AC G TCAG Outgroup AA G TCTG

autapomorphy

symplesiomorphy

Synapomorphies are shared derived characters and so are considered to define clades (relationship groupings)


Maximum Parsimony: chooses the tree topology that minimises the number of changes required

* Character 3 changes G to A

Homoplasy

synapomorphy

Taxon X

Taxon Z

*

*

Taxon Y

Taxon X

*

Taxon Z

Taxon Y

Outgroup

Outgroup

8 step sub-optimal phenetic tree

7 steps (MP tree)


Maximum Likelihood: The explanation that makes the observed outcome the most likely

L = Pr(D|H)

Probability of the data, given an hypothesis

The hypothesis is a tree topology, its branch-lengths and a model under which the data evolved

First use in phylogenetics: Cavalli-Sforza and Edwards (1967) for gene frequency data; Felsenstein (1981) for DNA sequences


A A

Model of rate change e.g. Kishino-Hasegawa (1985): 4 base frequencies, transition/transversion (ti/tv ratio)

0.5

0.5 substitutions per site

0.6

0.4

0.4

A A

A A

A A

A A

A A

A A

A GC T A G

G G

A A A A C C

G G

G G

G G

G G

G G

G G

A A

A A

A A

A A

A A

Sum the probabilities for each of the 16 internal node combinations to get the likelihood for this single nucleotide site

C T A GC

C CT T T

G G

G G

G G

G G

G G

A A

A A

A A

A A

A A

T A GCT

TG G G G

G G

G G

G G

G G

G G


The likelihood of a tree is the product of the site likelihoods. Taken as natural logs, the site likelihoods can be summed to give the log likelihood:

The tree with the highest –lnL is the ML tree

  • ML is computationally intensive (slow)

  • If branch-lengths are long, such that substitutions occur multiple times along the same branch for the same site, ML will be more consistent than MP – if the evolutionary process is sufficiently well modelled.


Bayesian Inference: The explanation with the highest posterior probability

Prior probability, the probability of the hypothesis on previous knowledge

Bayes’ Theorem

Likelihood function, probability of the data given the hypothesis

Pr(H) Pr(DH)

Pr(HD) =

Pr(D)

Posterior probability, the probability of the hypothesis given the data

Unconditional probability of the data, a normalizing constant ensuring the posterior probabilities sum to 1.00

First use in phylogenetics: Li (1996, PhD thesis), Rannala and Yang (1996)


  • Bayesian inference in phylogenetics is essentially a likelihood method, but may more closely reflect the way humans think.

  • It is Informed by prior knowledge (e.g. fossil data)

  • emphasis is placed on Pr(HD) instead of Pr(DH)

Markov chain Monte Carlo (MCMC) is used to approximate Bayesian posterior probabilities *(BPP) over 1,000s – 1,000,000s of generations

New state rejected

New state accepted

Tree 1

Tree 2

BPP(tree 1) = 4/6

Tree 3

Generation 1 2 3 4 5 6


Posterior probabilities are integrated over all trees in the posterior distribution – providing density distributions rather than the optimization of likelihood

(Flat prior)

0 0.5 1.0

0 0.5 1.0

Prior for a parameter value (e.g. proportion of invariant sites)

Posterior for the proportion of invariant sites


The American cheetah is related to the puma - morphological similarity to the cheetah is convergence

Mongoose

Mongoose

Hyena

Hyena

Sabretooth

Sabretooth

Am.Cheetah

Am.Cheetah

American felids

Puma

Puma

Jaguarundi

Jaguarundi

Cheetah

Cheetah

Cat

Cat

Ocelot

Ocelot

Lion

Lion

0.05 substitutions/site

Tiger

Tiger

Maximum parsimony and neighbour-joining (distance) cladogram

Maximum likelihood and Bayesian inference phylogram


Applications:

The tree of life and inferring our origins


146 gene phylogeny: Delsuc et al. (Nature, 2006)

Little evidence from fossils


Identifying selection

ACA GAG CGC Threonine - Glutamic acid - Arginine

ACG GAG AGC Threonine - Glutamic acid - Serine

Decreased dN/dS suggests purifying selection

Synonymous (S)

non-synonymous (N) substitutions

The dN/dS ratio can be estimated along branches of phylogenetic trees (e.g. Guindon et al. PNAS, 2004)

Here dN/dS is indicated by branch width

Increased dN/dS suggests Positive selection


Cohen (Molec. Biol. Evol., 2002) found increased positive selection at binding sites in the MHC proteins of estuarine fish Fundulus heteroclitus populations subject to severe chemical pollution.

Non-synonymous/synonymous ratios for peptide binding regions and non-peptide binding regions

MHC (Major histocompatibility complex) binds antigens and presents them to T-cells as part of the immune response.

Positive selection at binding sites provides high MHC variability with which to confront new pathogenic threats.


Fish from the Hot spot and Gloucester populations are genetically adapted to severe chemical pollution and show novel patterns of DNA substitution for Mhc class II B locus including strong signals of positive selection at inferred antigen-binding sites

Mhc class II B with inferred locations of population-specific amino acid changes for Gloucester and Hot Spot.


Stanhope et al. (Infect. Genet. Evol., 2004)

Severe Acute Respiratory Syndrome coronavirus (SARS-CoV) has a recombinant history with lineages of types I and III coronavirus


Using more sophisticated models of sequence evolution, Holmes and Rambaut (Phil. Trans. Roy. Soc. B, 2004) could not reject a single history across the SARS genome

I

II

SARS-TOR2

III

Understanding sequence evolution and the biases that may result from models (which necessarily are simplifications) are of vital importance in phylogenetic inference


  • Host-Parasite coevolution/co-speciation

  • Etherington et al. (J. Gen Virol, 2006)

Carnivoran strains

Artiodactyl strains

Caliciviruses infect diverse mammalian hosts and include Norovirus, the major cause of food-borne viral gastroenteritis in humans.

Host switching by caliciviruses is rare, although pigs have strains from co-speciation (artiodactyl strain) and host switching (carnivoran strain).


Fig (Ficus) and fig wasp mutualism is reflected by co-speciation patterns: Machado et al. (PNAS, 2006)


Biogeography: vicariance and dispersal


Most frequent Area cladoragms – mapping taxa onto landmasses

Many plants; follows wind dispersal patterns

Many land animals: follows continental break-up

Africa

S. South America

Australia

midges

New Zealand

Southern beech

Cushion herb

Marsupial mammals

From: SanMartin and Ronquist (Syst. Biol. 2004)


Conservation genetics : Amur leopard (Panthera pardus orientalis)

Relict population of 25-40 individuals in the Russian Far East.

  • Nuclear microsatellites and mtDNA: Uphyrkina et al. (J. Hered., 2002)

  • validates subspecies distinctiveness

  • extreme reduction in genetic diversity in the wild

  • captive population genetically mixed with the Chinese subspecies


Macroevolutionary inference

Cretaceous

Tertiary

65 Ma

Present

Does the 65 Ma meteor impact (Alvarez et al. Science, 1980) fully explain the “great reptile extinction” and the rise of modern birds and mammals?


Molecular clock: DNA/protein divergence between organisms is a function of time

K/T boundary

71-68 Ma

144-83 Ma

83-71 Ma

68-65 Ma

95Ma 65Ma


Megafaunal extinctions (human induced or climate change)

Macrauchenia

Bison (Lascaux, France)


Arrival of humans in North America

The distribution of coalescence events over time on the tree allow inference of relative population size

Last glacial maximum


ad
  • Login