slide1
Download
Skip this Video
Download Presentation
Combining genes in phylogeny And How to test phylogeny methods …

Loading in 2 Seconds...

play fullscreen
1 / 47

Combining genes in phylogeny And How to test phylogeny methods … - PowerPoint PPT Presentation


  • 388 Views
  • Uploaded on

Combining genes in phylogeny And How to test phylogeny methods …. Tal Pupko Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University [email protected] Multiple sequence alignment (vWF). Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Combining genes in phylogeny And How to test phylogeny methods …' - Donna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
Combining genes in phylogeny

And

How to test phylogeny methods…

Tal Pupko

Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel-Aviv University

[email protected]

slide2
Multiple sequence alignment (vWF)

Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

slide3
From sequences to a phylogenetic tree

VWF

Rat QEPGGLVVPPTDA

Rabbit QEPGGMVVPPTDA

Gorilla QEPGGLVVPPTDA

Cat REPGGLVVPPTEG

slide4
Multiplemultiple sequence alignment

Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

Rat QEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK

Rabbit QEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN

Gorilla QEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR

Cat REPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR

slide5
Phylogenetic studies are now based

on the analysis of multiple genes

Murphy et al. (2001b)

19 nuclear genes +

3 mitochondrial genes

(16,400 bp)

slide7
Consensus tree

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

A consensus tree summarizes information common to two or more trees.

slide8
Strict consensus

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

Strict consensus

Strict consensus includes only those groups that occur in all the trees being considered.

slide9
Strict consensus

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

Strict consensus

Problem: the split {ab} is found 2 out of 3 times, and this is not shown in the strict consensus.

slide10
Majority-rule consensus

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

Majority-rule consensus

Majority-rule consensus: splits that are found in the majority of the trees are shown.

slide11
Majority-rule consensus

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

a

b

c

d

e

Majority-rule consensus

67

100

67

The percentage of the trees supporting each splits are indicated

slide12
Problem with Majority-rule consensus

a

b

c

d

e

e

b

c

d

a

Majority-rule consensus=

Strict consensus =

a

b

c

d

e

However in both trees if we consider only {b,c,d}, then in both trees b is closer to c than b to d, or c to d.

slide13
Adams consensus

a

b

c

d

e

e

b

c

d

a

b

c

d

a

e

Adams consensus=

Adams consensus will give the subtrees that are common to all trees. Adams consensus is useful where there is one or more sequences with unclear positions but there’s a subset of sequences that are common to all trees.

slide14
Networks

a

b

c

d

e

A network is sometimes used to represent tree in which recombination occurred.

slide16
Gene 1 +Gene 2 + Gene 3

Sp1: TCTGT…AACTCTTT…GAATCGTT…GCC

Sp2: TCTGC…GACTCGCT…GGAACGCT…CCC

Sp3: CTTAT…GATCTATT…GGAATATT…CGA

Sp4: CCTAT…GATCCATT…GGACCATT…CCA

Sp1

Sp2

Sp3

Sp4

e.g., Murphy et al. (2001)

Multiple genes analysis

concatenate analysis

Evolutionary

model

slide17
Evolutionary

model

Evolutionary

model

Evolutionary

model

Sp1

Sp1

Sp1

Sp2

Sp2

Sp2

Sp3

Sp3

Sp3

Sp4

Sp4

Sp4

e.g., Murphy et al. (2001)

Multiple genes analysis

concatenate analysis

Gene 1

Gene 2

Gene 3

Sp1: TCTGT…AAC

Sp2: TCTGC…GAC

Sp3: CTTAT…GAT

Sp4: CCTAT…GAT

Sp1: TCTTT…GAA

Sp2: TCGCT…GGA

Sp3: CTATT…GGA

Sp4: CCATT…GGA

Sp1: TCGTT…GCC

Sp2: ACGCT…CCC

Sp3: ATATT…CGA

Sp4: CCATT…CCA

slide18
What are branch lengths

Branch lengths correspond to evolutionary distance:

d = AA replacements/site=

[AA replacements/(site*year)]*year= Evolutionary rate * year

slide19
Evolutionary

model1

Evolutionary

model3

Evolutionary

model2

Sp1

Sp1

Sp1

Sp2

Sp2

Sp2

Sp3

Sp3

Sp3

Sp4

Sp4

Sp4

e.g., Nikaido et al. (2001)

Multiple genes analysis

separate analysis

Gene 1

Gene 2

Gene 3

Sp1: TCTGT…AAC

Sp2: TCTGC…GAC

Sp3: CTTAT…GAT

Sp4: CCTAT…GAT

Sp1: TCTTT…GAA

Sp2: TCGCT…GGA

Sp3: CTATT…GGA

Sp4: CCATT…GGA

Sp1: TCGTT…GCC

Sp2: ACGCT…CCC

Sp3: ATATT…CGA

Sp4: CCATT…CCA

slide20
Example

n= 44 ; g = 22

m = 0

85

1870

Multiple genes analysis

Number of parameters

Number of species = n

Number of gene = g

Number of parameters in the model = m

Concatenate

analysis

Separate

analysis

Number of

parameter

m+(2n-3)

g*(m+(2n-3))

slide21
Multiple genes analysis

Number of parameters

Both oversimplified model and

over-parameterization may lead to the wrong phylogenetic conclusions

slide22
Evolutionary

model1

Evolutionary

model3

Evolutionary

model2

Sp1

Sp1

Sp1

Sp2

Sp2

Sp2

Sp3

Sp3

Sp3

Sp4

Sp4

Sp4

Rate=1

Rate=0.5

Rate=1.5

Multiple genes analysis

proportional analysis

Gene 1

Gene 2

Gene 3

Sp1: TCTGT…AAC

Sp2: TCTGC…GAC

Sp3: CTTAT…GAT

Sp4: CCTAT…GAT

Sp1: TCTTT…GAA

Sp2: TCGCT…GGA

Sp3: CTATT…GGA

Sp4: CCATT…GGA

Sp1: TCGTT…GCC

Sp2: ACGCT…CCC

Sp3: ATATT…CGA

Sp4: CCATT…CCA

slide23
Example

n= 44

g = 22

m = 0

85

1870

106

Multiple genes analysis

Number of parameters

Number of species = n

Number of gene = g

Number of parameters in the model = m

Concatenate

analysis

Separate

analysis

Proportional

analysis

Number of

parameter

g-1+gm+(2n-3)

m+(2n-3)

g*(m+(2n-3))

slide24
Aims of our study

To compare 3 types of multiple-genes analysis:

Concatenate analysis

Separate analysis

Proportional analysis

3 protein datasets:

Mitochondrial data set [56 species, 12 genes]

Nuclear dataset (“short genes”) [46 species, 6 genes]

Nuclear dataset (“long genes”) [28 species, 4 genes]

(Short genes- based on Murphy dataset)

slide25
Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Slow loris

Tree shrew

Japanese pipistrelle

Long-tailed bat

Jamaican fruit-eating bat

Horseshoe bat

Little red flying fox

Ryukyu flying fox

Mouse

Rat

Glires

Vole

Cane-rat

Guinea pig

Squirrel

Dormouse

Rabbit

Pika

Pig

Hippopotamus

Sheep

Cow

Alpaca

Blue whale

Fin whale

Sperm whale

Donkey

Horse

Indian rhino

White rhino

Elephant

Carnivora

Aardvark

Grey seal

Harbor seal

Dog

Cat

Asiatic shrew

Insectivora

Long-clawed shrew

Small Madagascar hedgehog

Hedgehog

Gymnure

Mole

Armadillo

Xenarthra

Bandicoot

Wallaroo

Opossum

Platypus

Comparing topologies

Morphological topology

(Based on Mc Kenna and Bell, 1997)

Archonta

Ungulata

slide26
Perissodactyla

Donkey

Horse

Carnivora

Indian rhino

White rhino

Grey seal

Harbor seal

Dog

Cetartiodactyla

Cat

Blue whale

Fin whale

Sperm whale

Hippopotamus

Sheep

Cow

Chiroptera

Alpaca

Pig

Little red flying fox

Ryukyu flying fox

Moles+Shrews

Horseshoe bat

Japanese pipistrelle

Long-tailed bat

Afrotheria

Jamaican fruit-eating bat

Asiatic shrew

Long-clawed shrew

Mole

Small Madagascar hedgehog

Xenarthra

Aardvark

Elephant

Armadillo

Rabbit

Lagomorpha

+ Scandentia

Pika

Tree shrew

Bonobo

Chimpanzee

Man

Gorilla

Sumatran orangutan

Primates

Bornean orangutan

Common gibbon

Barbary ape

Baboon

White-fronted capuchin

Rodentia 1

Slow loris

Squirrel

Dormouse

Cane-rat

Rodentia 2

Guinea pig

Mouse

Rat

Vole

Hedgehog

Hedgehogs

Gymnure

Bandicoot

Wallaroo

Opossum

Platypus

Aims of our study

Mitochondrial topology

slide27
Chiroptera

Round Eared Bat

Eulipotyphla

Flying Fox

Hedgehog

Pholidota

Mole

Pangolin

Whale

1

Cetartiodactyla

Hippo

Cow

Carnivora

Pig

Cat

Dog

Perissodactyla

Horse

Rhino

Glires

Rat

Capybara

2

Scandentia+

Dermoptera

Rabbit

Flying Lemur

Tree Shrew

3

Human

Primate

Galago

Sloth

Xenarthra

4

Hyrax

Dugong

Elephant

Afrotheria

Aardvark

Elephant Shrew

Opossum

Kangaroo

Aims of our study

Nuclear topology

(Madsenl tree)

slide28
Comparing different models using

AKAIKE INFORMATION CRITERION

A model which minimizes the AIC is considered to be the most appropriate model.

slide29
Results: the best multiple gene analysis

The proportional analysis is the best for the mitochondrial dataset

Separate

analysis

Concatenate

analysis

Proportional

analysis

df

121

132

1320

Ln(L)

-89921.78

-91188.71

-90999.30

182483.55

182619.42

182262.60

AIC

(Mitochondrial tree, N-Gamma rate model)

slide30
Results: the best multiple gene method

The Proportional analysis is the best for the Nuclear dataset (“Short genes”)

Separate

analysis

Concatenate

analysis

Proportional

analysis

df

95

100

540

Ln(L)

-11192.12

-11618.67

-11543.87

23464.23

23427.33

23287.74

AIC

(Murphy dataset, Madsenl tree, N-Gamma rate model)

slide31
Results: the best multiple gene method

The Separate analysis is the best for the Nuclear dataset (“Long genes”)

Separate

analysis

Concatenate

analysis

Proportional

analysis

df

57

60

216

Ln(L)

-31153.28

-31519.10

-31406.81

62738.56

63152.21

62933.63

AIC

(Madsen dataset, Murphyl tree, N-Gamma rate model)

slide32
Conclusion: the best multiple gene method

1- The concatenate model is always the worst way to analyze multiple genes.

2- Selecting between the separate analysis or the proportional analysis depends on the data considered:

The proportional model is more adapted for short genes, the separate model for longer sequences

slide33
Results: mammalian phylogeny
  • The morphological tree is always rejected
  • P(K-H test) < 0.05
  • whatever the model used
  • whatever the dataset
slide34
Results: mammalian phylogeny
  • The mitochondrial tree is the best tree for the mitochondrial dataset. But we cannot reject the nuclear tree.
  • The nuclear tree is the best for the nuclear datasets, and we can reject the mitochondrial tree.

Conclusion (Topology): It seems that the nuclear tree is the best tree among the 3 alternative trees.

slide35
Modelisation of site rate variation

The gamma distribution:

Homogenous model:

F(t+x) = F(t).P(x)

Site proportions f(r)

Gamma model:

F(t+x) =

S (1/n).F(t).P(x.Rn)

c

n=1

Substitution rates (R)

slide36
Likelihoods with rate variation

A

d2

d1

G

Continuous

d3

C

A

d2

d1

G

Discrete

d3

C

slide37
Results: the best site-rate variation model

Mitochondrial data set

(Mitochondrial tree, proportional analysis)

Homogenous

model

1-Gamma

model

N-Gamma

model

df

121

132

120

Ln(L)

-98998.68

-91094.30

-90999.30

198237.37

182430.61

182262.60

AIC

slide38
Conclusion:

the best site-rate variation model

The N-Gamma model is always the best site-rate variation model.

slide39
Combining Multiple Genes

Collaborations

Dorothee Huchon (Florida State University)

Masami Hasegawa (Institute of Statistical Mathematics)

Norihiri Okada (Tokyo Institute of Technology)

Ying Cao (ISM).

slide41
Known phylogenies

Best way to test different methods of phylogenetic reconstruction is on trees that are known to be true from other resources…

Problem: known phylogenies are very rare.

Known phylogeny: laboratory animals, crop plants (and even those are often suspect). Also their evolutionary rate is very small…

slide42
Known phylogenies

David Hillis and colleagues have created “experimental” phylogenies in the lab.

slide43
Known phylogenies

They have used bacteriophage T7 and subdivided cultures of it, in the present of a mutagen. They then sequenced a marker gene from the final cultures and gave the sequences as input to few phylogenetic methods. The output of the tree building methods was compared to the true tree.

slide44
Known phylogenies

In fact, they used restriction sites method to infer the phylogeny, using MP, NJ, UPGMA and others.

All methods reconstructed the true tree.

slide45
Known phylogenies

They also compared outputs of ancestral sequence reconstruction, using MP.

97.3% of the ancestral states were correctly reconstructed.

Encouraging!

slide46
Known phylogenies

Criticism: (1) The true tree was very easy to infer, because it was well balances, and all nodes are accompanied by numerous changes.

(2) The mutations by a single mutagen do not reflect reality.

ad