Loading in 5 sec....

The bootstrap, consenus-trees, and super-treesPowerPoint Presentation

The bootstrap, consenus-trees, and super-trees

- 282 Views
- Updated On :
- Presentation posted in: Pets / Animals

Phylogenetics Workhop, 16-18 August 2006. The bootstrap, consenus-trees, and super-trees. Barbara Holland. What is the bootstrap?. Like in many other areas where statistical inference is applied, in phylogenetics it is not just of interest to get a point estimate of the phylogenetic tree.

The bootstrap, consenus-trees, and super-trees

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Phylogenetics Workhop,

16-18 August 2006

The bootstrap,consenus-trees, and super-trees

Barbara Holland

- Like in many other areas where statistical inference is applied, in phylogenetics it is not just of interest to get a point estimate of the phylogenetic tree.
- We would also like some measure of confidence in our point estimate.
- Is our tree likely to change if we got more data, or if we had used slightly different data?
- How robust is our result to sampling error?

- The bootstrap is a useful tool for answering these sorts of questions.

- In 1985 Felsenstein introduced the idea of the bootstrap to phylogenetics.
- For each boostrap sample
- Create a new alignment by resampling the columns of the observed alignment
- Construct a tree for the ‘bootstrap’ alignment

- Can be applied to any method that starts from a sequence alignment, e.g., parsimony, likelihood, clustering methods if the distances are derived from an alignment…
- The bootstrap support for each edge is the number of bootstrap trees that edge appears in.

1234567

a ATATAAA

bATTATAA

cTAAAATA

dTATAAAT

1224567

a ATTTAAA

bATTATAA

cTAAAATA

dTAAAAAT

1334567

a AAATAAA

bATTATAA

cTAAAATA

dTTTAAAT

1234567

a ATATAAA

bATTATAA

cTAAAATA

dTATAAAT

1244567

a ATTTAAA

bATAATAA

cTAAAATA

dTAAAAAT

c

a

c

a

a

c

a

b

d

b

c

d

d

b

b

d

c

a

0.75

b

d

0.01

0.2

a

b

c

d

- Simulate data on the four taxon tree below (JC model)
- Use sequence lengths of 100, 1000, and 10000

0.05

0.05

0.1

0.1

d

d

a

a

b

c

c

b

- Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences
- Use total sequence lengths of 100, 1000, and 10000

55%

45%

- Consensus trees attempt to summarise the information contained in a set of trees, where each tree in the set is on the same taxa.
- Some consensus tree methods are specific to rooted trees.

- Many phylogenetic methods produce a collection of trees rather than a single best tree.
- Monte Carlo Markov Chain (MCMC)
- Bootstrapping.
- Equally parsimonious trees

- Sometimes trees for different genes produce a collection of trees.

- Each edge in an unrooted tree corresponds to a split or bipartition of the taxa set.
- Each edge in a rooted tree corresponds to a clade.

mouse

dog

turtle

cat, dog, mouse, parrot | turtle

parrot

cat

dog, cat | mouse, turtle, parrot

cat, dog, mouse | turtle, parrot

dog

cat

mouse

parrot

turtle

dog

cat

mouse

parrot

turtle

dog

cat

mouse

parrot

turtle

- The strict consensus tree contains only those splits/clades that appear in all trees

mouse

mouse

turtle

dog

dog

mouse

dog

turtle

turtle

parrot

parrot

cat

cat

cat

parrot

mouse

dog

turtle

parrot

cat

- The semi-strict consensus tree also contains those splits/clades that don’t conflict with any of the input trees

mouse

mouse

dog

dog

turtle

turtle

parrot

cat

cat

parrot

mouse

dog

turtle

cat

parrot

- The majority-rule consensus tree contains only those splits/clades that appear in more than 50% of the input trees

dog

mouse

mouse

turtle

turtle

dog

mouse

dog

turtle

parrot

cat

parrot

cat

cat

parrot

turtle

dog

mouse

parrot

cat

dog

cat

mouse

parrot

turtle

- 3-taxon statements are triples of three species that show two species to be more closely related than is the third.
- E.g. the tree below displays the 3-taxonstatements
((dog,cat),mouse)

((dog,mouse),parrot)

((mouse,parrot),turtle)

…and others…

Hierarchy of clusters

Partitions

abcd | ef

{a,b,c,d}

a | bcd | ef

{b,c,d}

a | b | cd | ef

{e,f}

{c,d}

a

b

c

d

e

f

{a}

{b}

{c}

{d}

{e}

{f}

- Given k partitions p1, p2, p3,…, pk of the same set of taxa, the product of these partitions is the partition where a and b are in the same block if and only if the are in the same block for each pi
- Example: The product of abc|de and ad|bce is a|bc|d|e

- Adams consensus method only applies to rooted trees.
- It preserves all the 3-taxon statements that are common to all of the input trees.
- Recursive algorithm that looks at the product of the maximal partitions of each of the input trees

Procedure AdamsTree(T1,…Tk)

ifT1 contains only 1 leaf then

returnT1

else

construct the product of the maximal partitions of the input trees

For each block B in the partition do

construct AdamsTree(T1|B, …Tk|B)

Attach the roots of these trees to a new node v

return this tree

end

e

b

c

d

a

f

a

b

c

d

e

f

Maximal partition

abcd | ef

Maximal partition

bcde | af

Product of maximal partitions

a|bcd|e|f

{f}

{a}

{b,c,d}

{e}

Restrict to b,c,d

e

b

c

d

a

f

a

b

c

d

e

f

Maximal partition

b | cd

Maximal partition

b | cd

Product of maximal partitions

b | cd

{f}

{a}

{b,c,d}

{e}

{b}

{c,d}

Restrict to c,d

e

b

c

d

a

f

a

b

c

d

e

f

Maximal partition

c | d

Maximal partition

c | d

{f}

{a}

{b,c,d}

{e}

Product of maximal partitions

c | d

{b}

{c,d}

{c}

{d}

- Instead of triples we would need to consider statements about quartets of taxa.
- If a quartet ((a,b),(c,d)) appeared in all the input trees it should be displayed in the output.
- Easy enough?

- Relabelling of the species at the tip of the tree should yeild the same answer relabelled in the appropriate way
- The input order of the trees should not matter
- A quartet that appears in all the input trees should appear in the output tree

- Counter example

f

a

b

a

b

c

e

f

c

d

d

e

- Super-tree methods take a set of trees on overlapping taxa sets and return a tree (or sometimes a ‘fail’ message)
- Biological relevance
- Not all genes are present in all species
- Not all genes are easy to sequence for all species

- Assembling the Tree of Life
- Computationally impossible to try and build a tree for all taxa
- Use a divide and conquer approach
- And then use supertree methods to piece the Tree of Life together

c

c

b

b

d

refines

d

a

a

e

e

The trees below are also refinements

d

e

b

b

c

d

a

a

e

c

T

c

b

d

e

a

f

h

g

The label set X = {a,b,c,d,e,f,g,h}

We can restrict T to any subset of the labels X’

E.g. The restriction to {a,c,e,g}

T

c

c

e

b

d

e

a

a

g

f

h

g

Find the subtree and then supress the degree two vertices

A tree T (on label set X) displays a tree T’ (on label set X’ subset of X) if Trestricted to the labels X’ is a refinement of T’

E.g.

d

c

c

e

d

b

a

f

displays

and

d

b

a

e

f

a

f

BUT

d

c

c

e

d

b

a

f

Does not display

or

b

d

a

e

f

a

c

- Polynomial-time algorithm due to Aho et al (1981)
- Takes a set of rooted input trees and either outputs a supertree that displays all of the input trees or returns a fail message.

- Recursive algorithm, at each step it constructs a graph associated with the triples displayed by the input trees.
- Depending on whether this associated graph is connected or disconnected the algorithm either terminates or subdivides the problem.
- What is this associated graph?

- Nodes of the graph are the complete label set, i.e. all the labels that appear in any of the input trees
- Put an edge between two nodes a and b if there is at least one input tree that displays the rooted triple ((a,b),c) for some c.
- If this graph is connected stop and report a fail message
- Otherwise call the algorithm again once for each connected component, restricting the input to the labels in that component.

d

a

b

c

e

c

b

e

a

b

f

d

b

c

d

a

{a,b,c,f}

{d,e}

e

f

Subproblem 1: Restrict input to {a,b,c,f}

a

b

c

c

b

a

b

f

b

c

a

{a,b,c,f}

{d,e}

f

{f}

{a,b}

{c}

Subproblem 2 and 3 on {d,e}, and {a,b} are trivial so the final tree is

{a,b,c,f}

{d,e}

{f}

{e}

{a,b}

{d}

{c}

{a}

{b}

a

b

c

f

d

e

- If the input trees are not compatible BUILD will return a fail message.
- It is also of interest to have methods that will return some output even if the input trees cannot all be displayed by a single supertree.
- Matrix representation with parsimony (MRP) is one such method…

- Supertree method invented independently by Baum and Ragan (1992).
- Recode the input trees as a character matrix where each edge in each input tree defines a character.
- Do a parsimony analysis of the resulting character matrix.
- Take the strict consensus of the most parsimonious trees.

c

d

e

b

f

a

d

c

e

b

g

a

g

d

e

h

4

6

2

4

2

8

4

6

2

8

3

5

7

3

1

9

3

5

7

5

1

9

1

12345678912345678912345

a101010100101010000?????

b011010100011010000?????

c000110100000001000?????

d00000110000011000001100

e00000001000000011010100

f000000001??????????????

g?????????00000010100010

h??????????????????00001

c

d

e

b

f

a

d

c

e

b

g

a

g

d

e

h

10 most parsimonious trees

Strict consensus:

e

c

b

f

g

a

d

h