Combining genes in phylogeny And How to test phylogeny methods … - PowerPoint PPT Presentation

Slide1 l.jpg
Download
1 / 47

  • 367 Views
  • Updated On :
  • Presentation posted in: Pets / Animals

Combining genes in phylogeny And How to test phylogeny methods …. Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University talp@post.tau.ac.il. Multiple sequence alignment (vWF). RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

Related searches for Combining genes in phylogeny And How to test phylogeny methods …

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Combining genes in phylogeny And How to test phylogeny methods …

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Slide1 l.jpg

Combining genes in phylogeny

And

How to test phylogeny methods…

Tal Pupko

Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University

talp@post.tau.ac.il


Slide2 l.jpg

Multiple sequence alignment (vWF)

RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR


Slide3 l.jpg

From sequences to a phylogenetic tree

VWF

RatQEPGGLVVPPTDA

RabbitQEPGGMVVPPTDA

GorillaQEPGGLVVPPTDA

CatREPGGLVVPPTEG


Slide4 l.jpg

Multiplemultiple sequence alignment

RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR


Slide5 l.jpg

Phylogenetic studies are now based

on the analysis of multiple genes

Murphy et al. (2001b)

19 nuclear genes +

3 mitochondrial genes

(16,400 bp)


Slide6 l.jpg

Consensus trees


Slide7 l.jpg

Consensus tree

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

A consensus tree summarizes information common to two or more trees.


Slide8 l.jpg

Strict consensus

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

Strict consensus

Strict consensus includes only those groups that occur in all the trees being considered.


Slide9 l.jpg

Strict consensus

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

Strict consensus

Problem: the split {ab} is found 2 out of 3 times, and this is not shown in the strict consensus.


Slide10 l.jpg

Majority-rule consensus

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

Majority-rule consensus

Majority-rule consensus: splits that are found in the majority of the trees are shown.


Slide11 l.jpg

Majority-rule consensus

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

Majority-rule consensus

67

100

67

The percentage of the trees supporting each splits are indicated


Slide12 l.jpg

Problem with Majority-rule consensus

a

b

c

d

e

e

b

c

d

a

Majority-rule consensus=

Strict consensus =

a

b

c

d

e

However in both trees if we consider only {b,c,d}, then in both trees b is closer to c than b to d, or c to d.


Slide13 l.jpg

Adams consensus

a

b

c

d

e

e

b

c

d

a

b

c

d

a

e

Adams consensus=

Adams consensus will give the subtrees that are common to all trees. Adams consensus is useful where there is one or more sequences with unclear positions but there’s a subset of sequences that are common to all trees.


Slide14 l.jpg

Networks

a

b

c

d

e

A network is sometimes used to represent tree in which recombination occurred.


Slide15 l.jpg

Maximum Likelihood

A

t1

t3

S

t2

X

C


Slide16 l.jpg

Gene 1 +Gene 2 + Gene 3

Sp1: TCTGT…AACTCTTT…GAATCGTT…GCC

Sp2: TCTGC…GACTCGCT…GGAACGCT…CCC

Sp3: CTTAT…GATCTATT…GGAATATT…CGA

Sp4: CCTAT…GATCCATT…GGACCATT…CCA

Sp1

Sp2

Sp3

Sp4

e.g., Murphy et al. (2001)

Multiple genes analysis

concatenate analysis

Evolutionary

model


Slide17 l.jpg

Evolutionary

model

Evolutionary

model

Evolutionary

model

Sp1

Sp1

Sp1

Sp2

Sp2

Sp2

Sp3

Sp3

Sp3

Sp4

Sp4

Sp4

e.g., Murphy et al. (2001)

Multiple genes analysis

concatenate analysis

Gene 1

Gene 2

Gene 3

Sp1: TCTGT…AAC

Sp2: TCTGC…GAC

Sp3: CTTAT…GAT

Sp4: CCTAT…GAT

Sp1: TCTTT…GAA

Sp2: TCGCT…GGA

Sp3: CTATT…GGA

Sp4: CCATT…GGA

Sp1: TCGTT…GCC

Sp2: ACGCT…CCC

Sp3: ATATT…CGA

Sp4: CCATT…CCA


Slide18 l.jpg

What are branch lengths

Branch lengths correspond to evolutionary distance:

d = AA replacements/site=

[AA replacements/(site*year)]*year= Evolutionary rate * year


Slide19 l.jpg

Evolutionary

model1

Evolutionary

model3

Evolutionary

model2

Sp1

Sp1

Sp1

Sp2

Sp2

Sp2

Sp3

Sp3

Sp3

Sp4

Sp4

Sp4

e.g., Nikaido et al. (2001)

Multiple genes analysis

separate analysis

Gene 1

Gene 2

Gene 3

Sp1: TCTGT…AAC

Sp2: TCTGC…GAC

Sp3: CTTAT…GAT

Sp4: CCTAT…GAT

Sp1: TCTTT…GAA

Sp2: TCGCT…GGA

Sp3: CTATT…GGA

Sp4: CCATT…GGA

Sp1: TCGTT…GCC

Sp2: ACGCT…CCC

Sp3: ATATT…CGA

Sp4: CCATT…CCA


Slide20 l.jpg

Example

n= 44 ; g = 22

m = 0

85

1870

Multiple genes analysis

Number of parameters

Number of species = n

Number of gene = g

Number of parameters in the model = m

Concatenate

analysis

Separate

analysis

Number of

parameter

m+(2n-3)

g*(m+(2n-3))


Slide21 l.jpg

Multiple genes analysis

Number of parameters

Both oversimplified model and

over-parameterization may lead to the wrong phylogenetic conclusions


Slide22 l.jpg

Evolutionary

model1

Evolutionary

model3

Evolutionary

model2

Sp1

Sp1

Sp1

Sp2

Sp2

Sp2

Sp3

Sp3

Sp3

Sp4

Sp4

Sp4

Rate=1

Rate=0.5

Rate=1.5

Multiple genes analysis

proportional analysis

Gene 1

Gene 2

Gene 3

Sp1: TCTGT…AAC

Sp2: TCTGC…GAC

Sp3: CTTAT…GAT

Sp4: CCTAT…GAT

Sp1: TCTTT…GAA

Sp2: TCGCT…GGA

Sp3: CTATT…GGA

Sp4: CCATT…GGA

Sp1: TCGTT…GCC

Sp2: ACGCT…CCC

Sp3: ATATT…CGA

Sp4: CCATT…CCA


Slide23 l.jpg

Example

n= 44

g = 22

m = 0

85

1870

106

Multiple genes analysis

Number of parameters

Number of species = n

Number of gene = g

Number of parameters in the model = m

Concatenate

analysis

Separate

analysis

Proportional

analysis

Number of

parameter

g-1+gm+(2n-3)

m+(2n-3)

g*(m+(2n-3))


Slide24 l.jpg

Aims of our study

To compare 3 types of multiple-genes analysis:

Concatenate analysis

Separate analysis

Proportional analysis

3 protein datasets:

Mitochondrial data set [56 species, 12 genes]

Nuclear dataset (“short genes”) [46 species, 6 genes]

Nuclear dataset (“long genes”) [28 species, 4 genes]

(Short genes- based on Murphy dataset)


Slide25 l.jpg

Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Slow loris

Tree shrew

Japanese pipistrelle

Long-tailed bat

Jamaican fruit-eating bat

Horseshoe bat

Little red flying fox

Ryukyu flying fox

Mouse

Rat

Glires

Vole

Cane-rat

Guinea pig

Squirrel

Dormouse

Rabbit

Pika

Pig

Hippopotamus

Sheep

Cow

Alpaca

Blue whale

Fin whale

Sperm whale

Donkey

Horse

Indian rhino

White rhino

Elephant

Carnivora

Aardvark

Grey seal

Harbor seal

Dog

Cat

Asiatic shrew

Insectivora

Long-clawed shrew

Small Madagascar hedgehog

Hedgehog

Gymnure

Mole

Armadillo

Xenarthra

Bandicoot

Wallaroo

Opossum

Platypus

Comparing topologies

Morphological topology

(Based on Mc Kenna and Bell, 1997)

Archonta

Ungulata


Slide26 l.jpg

Perissodactyla

Donkey

Horse

Carnivora

Indian rhino

White rhino

Grey seal

Harbor seal

Dog

Cetartiodactyla

Cat

Blue whale

Fin whale

Sperm whale

Hippopotamus

Sheep

Cow

Chiroptera

Alpaca

Pig

Little red flying fox

Ryukyu flying fox

Moles+Shrews

Horseshoe bat

Japanese pipistrelle

Long-tailed bat

Afrotheria

Jamaican fruit-eating bat

Asiatic shrew

Long-clawed shrew

Mole

Small Madagascar hedgehog

Xenarthra

Aardvark

Elephant

Armadillo

Rabbit

Lagomorpha

+ Scandentia

Pika

Tree shrew

Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Primates

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Rodentia 1

Slow loris

Squirrel

Dormouse

Cane-rat

Rodentia 2

Guinea pig

Mouse

Rat

Vole

Hedgehog

Hedgehogs

Gymnure

Bandicoot

Wallaroo

Opossum

Platypus

Aims of our study

Mitochondrial topology


Slide27 l.jpg

Chiroptera

Round Eared Bat

Eulipotyphla

Flying Fox

Hedgehog

Pholidota

Mole

Pangolin

Whale

1

Cetartiodactyla

Hippo

Cow

Carnivora

Pig

Cat

Dog

Perissodactyla

Horse

Rhino

Glires

Rat

Capybara

2

Scandentia+

Dermoptera

Rabbit

Flying Lemur

Tree Shrew

3

Human

Primate

Galago

Sloth

Xenarthra

4

Hyrax

Dugong

Elephant

Afrotheria

Aardvark

Elephant Shrew

Opossum

Kangaroo

Aims of our study

Nuclear topology

(Madsenl tree)


Slide28 l.jpg

Comparing different models using

AKAIKE INFORMATION CRITERION

A model which minimizes the AIC is considered to be the most appropriate model.


Slide29 l.jpg

Results: the best multiple gene analysis

The proportional analysis is the best for the mitochondrial dataset

Separate

analysis

Concatenate

analysis

Proportional

analysis

df

121

132

1320

Ln(L)

-89921.78

-91188.71

-90999.30

182483.55

182619.42

182262.60

AIC

(Mitochondrial tree, N-Gamma rate model)


Slide30 l.jpg

Results: the best multiple gene method

The Proportional analysis is the best for the Nuclear dataset (“Short genes”)

Separate

analysis

Concatenate

analysis

Proportional

analysis

df

95

100

540

Ln(L)

-11192.12

-11618.67

-11543.87

23464.23

23427.33

23287.74

AIC

(Murphy dataset, Madsenl tree, N-Gamma rate model)


Slide31 l.jpg

Results: the best multiple gene method

The Separate analysis is the best for the Nuclear dataset (“Long genes”)

Separate

analysis

Concatenate

analysis

Proportional

analysis

df

57

60

216

Ln(L)

-31153.28

-31519.10

-31406.81

62738.56

63152.21

62933.63

AIC

(Madsen dataset, Murphyl tree, N-Gamma rate model)


Slide32 l.jpg

Conclusion: the best multiple gene method

1- The concatenate model is always the worst way to analyze multiple genes.

2- Selecting between the separate analysis or the proportional analysis depends on the data considered:

The proportional model is more adapted for short genes, the separate model for longer sequences


Slide33 l.jpg

Results: mammalian phylogeny

  • The morphological tree is always rejected

  • P(K-H test) < 0.05

  • whatever the model used

  • whatever the dataset


Slide34 l.jpg

Results: mammalian phylogeny

  • The mitochondrial tree is the best tree for the mitochondrial dataset. But we cannot reject the nuclear tree.

  • The nuclear tree is the best for the nuclear datasets, and we can reject the mitochondrial tree.

Conclusion (Topology): It seems that the nuclear tree is the best tree among the 3 alternative trees.


Slide35 l.jpg

Modelisation of site rate variation

The gamma distribution:

Homogenous model:

F(t+x) = F(t).P(x)

Site proportions f(r)

Gamma model:

F(t+x) =

S (1/n).F(t).P(x.Rn)

c

n=1

Substitution rates (R)


Slide36 l.jpg

Likelihoods with rate variation

A

d2

d1

G

Continuous

d3

C

A

d2

d1

G

Discrete

d3

C


Slide37 l.jpg

Results: the best site-rate variation model

Mitochondrial data set

(Mitochondrial tree, proportional analysis)

Homogenous

model

1-Gamma

model

N-Gamma

model

df

121

132

120

Ln(L)

-98998.68

-91094.30

-90999.30

198237.37

182430.61

182262.60

AIC


Slide38 l.jpg

Conclusion:

the best site-rate variation model

The N-Gamma model is always the best site-rate variation model.


Slide39 l.jpg

Combining Multiple Genes

Collaborations

Dorothee Huchon (Florida State University)

Masami Hasegawa (Institute of Statistical Mathematics)

Norihiri Okada (Tokyo Institute of Technology)

Ying Cao (ISM).


Slide40 l.jpg

Known phylogenies


Slide41 l.jpg

Known phylogenies

Best way to test different methods of phylogenetic reconstruction is on trees that are known to be true from other resources…

Problem: known phylogenies are very rare.

Known phylogeny: laboratory animals, crop plants (and even those are often suspect). Also their evolutionary rate is very small…


Slide42 l.jpg

Known phylogenies

David Hillis and colleagues have created “experimental” phylogenies in the lab.


Slide43 l.jpg

Known phylogenies

They have used bacteriophage T7 and subdivided cultures of it, in the present of a mutagen. They then sequenced a marker gene from the final cultures and gave the sequences as input to few phylogenetic methods. The output of the tree building methods was compared to the true tree.


Slide44 l.jpg

Known phylogenies

In fact, they used restriction sites method to infer the phylogeny, using MP, NJ, UPGMA and others.

All methods reconstructed the true tree.


Slide45 l.jpg

Known phylogenies

They also compared outputs of ancestral sequence reconstruction, using MP.

97.3% of the ancestral states were correctly reconstructed.

Encouraging!


Slide46 l.jpg

Known phylogenies

Criticism: (1) The true tree was very easy to infer, because it was well balances, and all nodes are accompanied by numerous changes.

(2) The mutations by a single mutagen do not reflect reality.


Thank you l.jpg

Thank You…


  • Login