slide1
Download
Skip this Video
Download Presentation
Fredj Tekaia Institut Pasteur [email protected]

Loading in 2 Seconds...

play fullscreen
1 / 38

Fredj Tekaia Institut Pasteur tekaiapasteur.fr - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Molecular Phylogeny. Fredj Tekaia Institut Pasteur [email protected] Examples of phylogenetic trees. Pace (2001) described a tree of life based on small subunit rRNA sequences. Pace, N. R. (1997) Science 276 , 734-740 This tree shows the main three branches described

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fredj Tekaia Institut Pasteur tekaiapasteur.fr' - sonja


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Molecular Phylogeny

Fredj Tekaia

Institut Pasteur

[email protected]

slide3

Pace (2001) described a tree

of life based on small subunit

rRNA sequences.

Pace, N. R. (1997) Science276, 734-740

This tree shows the main

three branches described

by Woese and colleagues.

slide4

Chlamydiae

Fig. 1. Phylogeny of chlamydiae. 16S rRNA-based neighbor-joining tree showing the affiliation of environmental and pathogenic chlamydiae with major bacterial phyla. Arrow, to outgroup. Scale bar, 10% estimated evolutionary distance.

Science 304:728-30.2004.

slide5

Eukaryotes

(Baldauf et al., 2000)

slide6

Phylogeny*

Expansion*

genesis

duplication

HGT

HGT

Exchange*

Deletion*

loss

Evolutionary processes include:

Ancestor

species genome

slide7

Original version

Actual version

Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.

slide8

Homolog - Paralog - Ortholog

O

Homologs: A1, B1, A2, B2

Paralogs: A1 vs B1 and A2 vs B2

Orthologs: A1 vs A2 and B1 vs B2

B

A

B2

B2

A2

A1

A2

A1

B1

B1

Sequence analysis

Species-2

S1

S2

Species-1

a

b

slide9

GACGACCATAGACCAGCATAG

GACTACCATAGA-CTGCAAAG

*** ******** * *** **

Two possible positions for the indel

GACGACCATAGACCAGCATAG

GACTACCATAGACT-GCAAAG

*** ********* *** **

Molecular evolution

slide10

Molecular Phylogenetic Analysis

Study of evolutionary relationships between genes and species

• The actual pattern of evolutionary history is the phylogeny or evolutionary tree which we try to estimate.

• A tree is a mathematical structure which is used to model the actual evolutionary history of a group of sequences or organisms.

slide11

Molecular Phylogeny Analysis

• Specifying the history of gene evolution is one of the most important aims of the current study of molecular evolution;

• Molecular phylogeny methods allow, from a given set of aligned sequences, the suggestion of phylogenetic trees (inferred trees) which aim at reconstructing the history of successive divergence which took place during the evolution, between the considered sequences and their common ancestor. These trees may not be the same as the true tree.

• Reconstruction of phylogenetic trees is a statistical problem, and a reconstructed tree is an estimate of a true tree with a given topology and given branch length;

• The accuracy of this estimation should be statistically established;

• In practice, phylogenetic analyses usually generate phylogenetic trees with accurate parts and imprecise parts.

slide12

Gly Ala Ile Leu asp Arg

-GGAGCCATATTAGATAGA-

-GGAGCAATTTTTGATAGA-

Gly Ala Ile Phe asp Arg

Nucleotide, amino-acid sequences

• 3 different DNA positions but only one different amino acid position:

2 of the nucleotide substitutions are therefore synonymous and one is non-synonymous.

DNA yields more phylogenetic information than proteins. The nucleotide sequences of a pair of homologous genes have a higher information content than the amino acid sequences of the corresponding proteins, because mutations that result in synonymous changes alter the DNA sequence but do not affect the amino acid sequence. (But amino-acid sequences are more efficiently aligned)

slide13

Phenetics and Cladistics

Phenetics (Michener and Sokal, 1957): Pheneticists argued that classifications should encompass as many variable characters as possible, these characters being analysed by rigorous mathematical methods.

Such methods (exp. distance based) place a greater emphasis on the relationships among data sets than the paths they have taken to arrive at their current states.

Cladistics (Hennig 1966): emphasizes the need for large datasets but differs from phenetics in that it does not give equal weight to all characters.

Cladists, are generally more interested in evolutionary pathways than in relationships (exp. maximum parsimony).

slide14

C

A

branches

external nodes

external nodes

B

D

internal nodes

Hypothetical ancestor

• Rooted trees

B

C

D

A

D

D

C

C

B

B

A

A

B

A

C

D

A

B

D

C

3

1

2

4

5

Key features of DNA-based phylogenetic trees

• An unrooted tree

slide15

C

A

B

D

D

A

C

B

Rooted and Unrooted trees

•An important distinction in phylogenetics between trees that make an inference about a common ancestor and the direction of evolution and those that do not.

•In rooted trees a single node is designated as a common ancestor, and a unique path leads from it through evolutionary time to any other node.

•Unrooted trees only specify the relationship between nodes and say nothing about the direction in which evolution occured.

•Roots can usually be assigned to unrooted trees through the use of an outgroup.

slide16

Key features of DNA-based phylogenetic trees

The numbers of possible rooted (NR) and unrooted (NU) trees for n sequences are given by:

NR = (2n-3)!/2n-2(n-2)!

NU = (2n-5)!/2n-3(n-3)!

n NR NU

1 1

3 1

15 3

105 15

34459425 2027025

• Note that only one of all possible trees can represent the true tree that represents phylogenetic relationships among the sequences.

slide17

Gene A

Species A

Mutation events

Speciation events

Gene B

Species B

Gene C

Species C

Gene D

Species D

Gene E

Species E

Species tree

Gene tree

Gene tree - Species tree

These two events - mutation and speciation- are not expected to occur at the same time. So gene trees cannot represent species tree.

slide18

Gene tree - Species tree

Duplication

Time

Duplication

C

A

B

Species tree

Speciation

Speciation

A

B

C

B

C

A

Gene tree

slide19

Tree construction: how to proceed?

1. Consider the set of sequences to analyse ;

2. Align "properly" these sequences ;

3. Apply phylogenetic making tree methods ;

4. Evaluate statistically the obtained phylogenetic tree.

Methodology :

1- Multiple alignment;

2- Bootstrapping;

3- Consensus tree construction and evaluation;

slide20

GACGACCATAGACCAGCATAG

GACTACCATAGA-CTGCAAAG

*** ******** * *** **

Two possible positions for the indel

GACGACCATAGACCAGCATAG

GACTACCATAGACT-GCAAAG

*** ********* *** **

Alignment is essential preliminary to tree construction

• If errors in indel placement are made in a multiple alignment then the tree reconstructed by phylogenetic analysis is unlikely to be correct.

slide21

Steps in Multiple Sequence Alignments

A common strategy of several popular multiple sequence alignment algorithms is to:

1- generate a pairwise distance matrix based on all possible pairwise alignments between the sequences being considered;

2- use a statistically based approach to construct an initial tree;

3- realign the sequences progressively in order of their relatedness according to the inferred tree;

4- construct a new tree from the pairwise distances obtained in the new multiple alignment;

5- repeat the process if the new tree is not the same as the previous one.

slide23

Procedure

1. Alignment of a family protein sequences using clustalW

2. Alignment of corresponding DNA sequences using as template their corresponding amino acid alignment obtained in step 1

•An efficient procedure consists of aligning amino-acid sequences and use the resulting alignment as template for corresponding nucleotide sequences.

Alignment is garanteed at the codon level.

Note: clean multiple alignment from gaps common to the majority of considered sequences

slide24

Phylogenetic tree construction methods

• A phylogenetic tree is characterised by its topology (form) and its length (sum of its branch lengths) ;

• Each node of a tree is an estimation of the ancestor of the elements included in this node;

• There are 3 main classes of phylogenetic methods for constructing phylogenies from sequence data :

Methods directly based on sequences :

•Maximum Parsimony : find a phylogenetic tree that explains the data, with as few evolutionary changes as possible.

•Maximum likelihood : find a tree that maximizes the probability of the genetic data given the tree.

Methods indirectly based on sequences :

• Distance based methods (Neighbour Joining (NJ)): find a tree such that branch lengths of paths between sequences (species) fit a matrix of pairwise distances between sequences.

slide25

Parsimony

The concept of parsimony is at the heart of all character-based methods of phylogenetic reconstruction.

The 2 fundamental ideas of biological parsimony are:

1- Mutations are exceedingly rare events (?) ;

2- the more unlikely events a model invokes, the less likely the model is to be correct.

As a result, the relationship that requires the fewest number of mutations to explain the current state of the sequences being considered, is the relationship that is most likely to be correct.

slide26

Parsimony

Informative and Uninformative Sites:

Multiple sequence alignment, for a parsimony approach, contains positions that fall into two categories in terms of their information content : those that have information (are informative) and those that do not (are uninformative).

Example:

seq 1 2 3 4 56

1 G G G G G G

2 G G G A G T

3 G G A T A G

4 G A T C A T

Position 1 is said invariant and therefore uninformative, because all treesinvoke the same number of mutations (0);

Position 2 is uninformative because 1 mutation occurs in all three possible trees;

Position 3 idem, because 2 mutations occur; Position 4 requires 3 mutations in all possible trees.

Positions 5 and 6 are informative, because one of the trees invokes only one mutation and the other 2 alternative trees both require 2 mutations.

In general, for a position to be informative regardless of how many sequences are aligned, it has to have at least 2 different nucleotides, and each of these nucleotides has to be present at least twice.

Krane & Raymer 2002

slide27

1G

1G

1G

G3

G2

G3

1

G

G

G

G

G

G

4G

3G

2G

G4

G3

G4

1G

1G

G2

G3

1G

G2

2

G

G

G

G

G

G

G

G

A

T

G

G

G

G

3G

2G

A4

A4

4A

G3

1G

A3

1G

G2

1G

G2

3

G

A

G

G

2G

T4

1G

T3

3A

T4

1G

A2

4T

1G

A2

A3

4

G

T

G

A

2A

C4

3T

C4

4C

T3

1G

A3

1G

G2

1G

1G

1G

G3

G2

T2

1G

T2

5

6

G

A

G

G

G

T

G

G

2T

T4

3G

T4

2G

A4

3A

A4

4T

4A

A3

T3

slide28

Maximum Parsimony (Fitch, 1977)

Parsimony criterion consists of determining the minimum number of changes (substitutions) required to transform a sequence to its nearest neighbor.

The maximum parsimony algorithm searches for the minimum number of genetic events (nucleotide substitutions or amino-acid changes) to infer the most parsimonious tree from a set of sequences.

The best tree is the one which needs the fewest changes.

Problems :

1. within practical computational limits, this often leads to the generation of tens or more "equally most parsimonious trees" which makes it difficult to justify the choice of a particular tree ;

2. long computation time is needed to construct a tree.

slide29

Maximum Parsimony (Fitch, 1977),...

The Maximum parsimony method takes account of information pertaining to character variation in each position of the sequence multiple alignment, to recreate the series of nucleotide changes. The assumption, possibly erroneous, is that evolution follows the shortest possible route and that the correct phylogenetic tree is therefore the one that requires the minimum number of nucleotide changes to produce the observed differences between the sequences.

Trees are therefore constructed at random and the nucleotide changes that they involve calculated until all possible topologies have been examined and the one requiring the smallest number of steps identified.

This is presented as the most likely inferred tree.

slide30

Maximum likelihood

This approach is a purely statistically based method. Probabilities are considered for every individual nucleotide substitution in a set of sequence alignment.

Exp.

Since transitions (exchanging purine for a purine and pyrimidine for a pyrimidine) are observed roughly 3 times as often as transversions (exchanging a purine for a pyrimidine or vice versa); it can be reasonably argued that a greater likelihood exists that the sequence with C and T are more closely related to each other than they are to the sequence with G.

• Calculation of probabilities is complicated by the fact that the sequence of the common ancestor to the sequences considered being unknown.

• Furthermore multiple substitutions may have occurred at one or more sites and that all sites are not necessarily independent or equivalent.

.. C..

..T..

..G..

Still, objective criteria can be applied to calculating the probability for every site and for every possible tree that describes the relationships of the sequences in a multiple alignment.

slide31

Distance matrix methods (NJ,...)

Convert sequence data into a set of discrete pairwise distance values, arranged into a matrix. Distance methods fit a tree to this matrix.

Di,j = the distance between i and j sequences;

di,j = sum of branches on the tree path from i to j;

The phylogeny makes an estimation of the distance for each pair as the sum of branch lengths in the path from one sequence to another through the tree.

A measure of how close is the tree to D is given by the least square criterion  :

∑( Di,j -di,j )2/ D2ij

i,j

The phylogenetic topology tree is constructed by using a cluster analysis method (like the NJ method).

1. easy to perform ; 2. fast calculation ; 3. fit for sequences having high similarity scores ;

drawbacks :

1. all sites are generally equally treated (do not take into account differences of substitution rates ) ; 2. not applicable to distantly related sequences; 3. Some of the information is lost, particularly those pertaining to the identities of the ancestral and derived nucleotides at each position in the multiple alignment

slide32

The choice of the outgroup

• Most of phylogenetic methods construct unrooted trees.

• It is best to root such trees on biological grounds.

• The most used technique consists of including in the sequence data set to be analysed, a sequence which has some relation with the considered sequences without belonging to the same family.

• The aim is to normalize the branches of the unrooted tree relatively to the length of the branch related to the outgroup.

slide33

Evaluation of different methods

• None of the previous methods of phylogenetic reconstruction makes any garantee that they yield the one true tree that describes the evolutionary history of a set of aligned sequences

• There is at present no statistical method allowing comparisons of trees obtained from different phylogenetic methods; nevertheless many attempts have been made to compare the relative consistency of the existing methods.

• The consistency depends on many factors, including the topology and branch lengths of the real tree, the transition/transversion rate and the variability of the substitution rates.

• In practice, one infers phylogeny between sequences which do not generally meet the specified hypothesis.

• One expects that if sequences have strong phylogenetic relationships, different methods will result in the same phylogenetic tree.

slide34

Statistical evaluation of the obtained phylogenetic tree

• The accuracy is dependent on the considered multiple sequence alignments ;

• ML estimates branch lengths, their degree of significance and their confidence limits ;

• At present only sampling techniques allow to test the topology of a phylogenetic tree :

Bootstrapping

It consists of drawing columns from a sample of aligned sequences,

with replacement, until one gets a data set of the same size as the

original one (usually some columns are sampled several times and

others left out).

slide35

Bootstrapping

• Constructs a new multiple alignment at random from the real alignment, with the same size. Note that the same column can be sampled more than once, and consequently some columns are not sampled.

ATAGCCATA

ATACCCATG

ATACCCATA

ATAGCCATA

ATCCCCCAT

TCAAATGCA

TCGAATCCA

TCAAATCCA

TCAAATGCA

TCAACACCC

slide36

Methodology

1. Consider the set of sequences to analyse ;

2. Align "properly" these sequences ;

3. Apply phylogenetic making tree methods ;

4. Evaluate statistically the obtained phylogenetic tree.

1- Multiple alignment;

2- Bootstrapping (100 samples);

3. Apply phylogenetic making tree methods ;

4- Consensus tree construction and evaluation;

slide37

Example: The tree of life

Pace (2001) described a tree

of life based on small subunit

rRNA sequences.

Pace, N. R. (1997) Science276, 734-740

This tree shows the main

three branches described

by Woese and colleagues.

slide38

References

•Phylogeny programs :

http://evolution.genetics.washington.edu/phylip/sftware.html

• MEGA: http://www.megasoftware.net/

• PAML: http://abacus.gene.ucl.ac.uk/software/paml.html

Books:

• Fundamental concepts of Bioinformatics.

Dan E. Krane and Michael L. Raymer

• Genomes 2 edition. T.A. Brown

• Molecular Evolution; A phylogenetic Approach

Page, RDM and Holmes, EC

Blackwell Science

ad