- By
**Jims** - Follow User

- 303 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'The bootstrap' - Jims

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

What is the bootstrap?

- Like in many other areas where statistical inference is applied, in phylogenetics it is not just of interest to get a point estimate of the phylogenetic tree.
- We would also like some measure of confidence in our point estimate.
- Is our tree likely to change if we got more data, or if we had used slightly different data?
- How robust is our result to sampling error?
- The bootstrap is a useful tool for answering these sorts of questions.

Assessing confidence in trees

- In 1985 Felsenstein introduced the idea of the bootstrap to phylogenetics.
- For each boostrap sample
- Create a new alignment by resampling the columns of the observed alignment
- Construct a tree for the ‘bootstrap’ alignment
- Can be applied to any method that starts from a sequence alignment, e.g., parsimony, likelihood, clustering methods if the distances are derived from an alignment…
- The bootstrap support for each edge is the number of bootstrap trees that edge appears in.

a ATATAAA

b ATTATAA

c TAAAATA

d TATAAAT

1224567

a ATTTAAA

b ATTATAA

c TAAAATA

d TAAAAAT

1334567

a AAATAAA

b ATTATAA

c TAAAATA

d TTTAAAT

1234567

a ATATAAA

b ATTATAA

c TAAAATA

d TATAAAT

1244567

a ATTTAAA

b ATAATAA

c TAAAATA

d TAAAAAT

c

a

c

a

a

c

a

b

d

b

c

d

d

b

b

d

c

a

0.75

b

d

0.2

a

b

c

d

Example where the bootstrap is useful- Simulate data on the four taxon tree below (JC model)
- Use sequence lengths of 100, 1000, and 10000

0.05

0.1

0.1

d

d

a

a

b

c

c

b

Example where the bootstrap is not so useful- Simulate data on the two four-taxon trees below (JC model) in the proportion 55%, 45% and concatenate the sequences
- Use total sequence lengths of 100, 1000, and 10000

55%

45%

Consensus trees

- Consensus trees attempt to summarise the information contained in a set of trees, where each tree in the set is on the same taxa.
- Some consensus tree methods are specific to rooted trees.

Why are consensus methods required?

- Many phylogenetic methods produce a collection of trees rather than a single best tree.
- Monte Carlo Markov Chain (MCMC)
- Bootstrapping.
- Equally parsimonious trees
- Sometimes trees for different genes produce a collection of trees.

Terminology: Splits and clades

- Each edge in an unrooted tree corresponds to a split or bipartition of the taxa set.
- Each edge in a rooted tree corresponds to a clade.

Splits

mouse

dog

turtle

cat, dog, mouse, parrot | turtle

parrot

cat

dog, cat | mouse, turtle, parrot

cat, dog, mouse | turtle, parrot

Strict Consensus

- The strict consensus tree contains only those splits/clades that appear in all trees

mouse

mouse

turtle

dog

dog

mouse

dog

turtle

turtle

parrot

parrot

cat

cat

cat

parrot

mouse

dog

turtle

parrot

cat

Semi-strict

- The semi-strict consensus tree also contains those splits/clades that don’t conflict with any of the input trees

mouse

mouse

dog

dog

turtle

turtle

parrot

cat

cat

parrot

mouse

dog

turtle

cat

parrot

Majority-Rule

- The majority-rule consensus tree contains only those splits/clades that appear in more than 50% of the input trees

dog

mouse

mouse

turtle

turtle

dog

mouse

dog

turtle

parrot

cat

parrot

cat

cat

parrot

turtle

dog

mouse

parrot

cat

cat

mouse

parrot

turtle

Terminolgy: 3-taxon statements- 3-taxon statements are triples of three species that show two species to be more closely related than is the third.
- E.g. the tree below displays the 3-taxon statements

((dog,cat),mouse)

((dog,mouse),parrot)

((mouse,parrot),turtle)

…and others…

Terminology: Rooted trees, hierarchies, clusters, and partitions

Hierarchy of clusters

Partitions

abcd | ef

{a,b,c,d}

a | bcd | ef

{b,c,d}

a | b | cd | ef

{e,f}

{c,d}

a

b

c

d

e

f

{a}

{b}

{c}

{d}

{e}

{f}

Products of partitions

- Given k partitions p1, p2, p3,…, pk of the same set of taxa, the product of these partitions is the partition where a and b are in the same block if and only if the are in the same block for each pi
- Example: The product of abc|de and ad|bce is a|bc|d|e

Adams Consensus

- Adams consensus method only applies to rooted trees.
- It preserves all the 3-taxon statements that are common to all of the input trees.
- Recursive algorithm that looks at the product of the maximal partitions of each of the input trees

AdamsTree algorithm (from Bryant 2003)

Procedure AdamsTree(T1,…Tk)

ifT1 contains only 1 leaf then

returnT1

else

construct the product of the maximal partitions of the input trees

For each block B in the partition do

construct AdamsTree(T1|B, …Tk|B)

Attach the roots of these trees to a new node v

return this tree

end

b

c

d

a

f

a

b

c

d

e

f

Adams consensus exampleMaximal partition

abcd | ef

Maximal partition

bcde | af

Product of maximal partitions

a|bcd|e|f

{f}

{a}

{b,c,d}

{e}

Adams consensus example cont.

Restrict to b,c,d

e

b

c

d

a

f

a

b

c

d

e

f

Maximal partition

b | cd

Maximal partition

b | cd

Product of maximal partitions

b | cd

{f}

{a}

{b,c,d}

{e}

{b}

{c,d}

Adams consensus example cont.

Restrict to c,d

e

b

c

d

a

f

a

b

c

d

e

f

Maximal partition

c | d

Maximal partition

c | d

{f}

{a}

{b,c,d}

{e}

Product of maximal partitions

c | d

{b}

{c,d}

{c}

{d}

What about an “Adams” like method for unrooted trees?

- Instead of triples we would need to consider statements about quartets of taxa.
- If a quartet ((a,b),(c,d)) appeared in all the input trees it should be displayed in the output.
- Easy enough?

Three requirements (Steel, Dress and Böcker 2000)

- Relabelling of the species at the tip of the tree should yeild the same answer relabelled in the appropriate way
- The input order of the trees should not matter
- A quartet that appears in all the input trees should appear in the output tree

Supertree methods

- Super-tree methods take a set of trees on overlapping taxa sets and return a tree (or sometimes a ‘fail’ message)
- Biological relevance
- Not all genes are present in all species
- Not all genes are easy to sequence for all species
- Assembling the Tree of Life
- Computationally impossible to try and build a tree for all taxa
- Use a divide and conquer approach
- And then use supertree methods to piece the Tree of Life together

Concept: Restriction

T

c

b

d

e

a

f

h

g

The label set X = {a,b,c,d,e,f,g,h}

We can restrict T to any subset of the labels X’

Concept: Restriction

E.g. The restriction to {a,c,e,g}

T

c

c

e

b

d

e

a

a

g

f

h

g

Find the subtree and then supress the degree two vertices

Concept: Displaying

A tree T (on label set X) displays a tree T’ (on label set X’ subset of X) if Trestricted to the labels X’ is a refinement of T’

E.g.

d

c

c

e

d

b

a

f

displays

and

d

b

a

e

f

a

f

The BUILD algorithm

- Polynomial-time algorithm due to Aho et al (1981)
- Takes a set of rooted input trees and either outputs a supertree that displays all of the input trees or returns a fail message.

BUILD algorithm

- Recursive algorithm, at each step it constructs a graph associated with the triples displayed by the input trees.
- Depending on whether this associated graph is connected or disconnected the algorithm either terminates or subdivides the problem.
- What is this associated graph?

The associated graph

- Nodes of the graph are the complete label set, i.e. all the labels that appear in any of the input trees
- Put an edge between two nodes a and b if there is at least one input tree that displays the rooted triple ((a,b),c) for some c.
- If this graph is connected stop and report a fail message
- Otherwise call the algorithm again once for each connected component, restricting the input to the labels in that component.

BUILD example continued

Subproblem 1: Restrict input to {a,b,c,f}

a

b

c

c

b

a

b

f

b

c

a

{a,b,c,f}

{d,e}

f

{f}

{a,b}

{c}

BUILD example continued

Subproblem 2 and 3 on {d,e}, and {a,b} are trivial so the final tree is

{a,b,c,f}

{d,e}

{f}

{e}

{a,b}

{d}

{c}

{a}

{b}

a

b

c

f

d

e

What if the trees don’t agree?

- If the input trees are not compatible BUILD will return a fail message.
- It is also of interest to have methods that will return some output even if the input trees cannot all be displayed by a single supertree.
- Matrix representation with parsimony (MRP) is one such method…

Matrix Representation with Parsimony (MRP)

- Supertree method invented independently by Baum and Ragan (1992).
- Recode the input trees as a character matrix where each edge in each input tree defines a character.
- Do a parsimony analysis of the resulting character matrix.
- Take the strict consensus of the most parsimonious trees.

d

e

b

f

a

d

c

e

b

g

a

g

d

e

h

MRP example4

6

2

4

2

8

4

6

2

8

3

5

7

3

1

9

3

5

7

5

1

9

1

123456789 123456789 12345

a 101010100 101010000 ?????

b 011010100 011010000 ?????

c 000110100 000001000 ?????

d 000001100 000110000 01100

e 000000010 000000110 10100

f 000000001 ????????? ?????

g ????????? 000000101 00010

h ????????? ????????? 00001

Download Presentation

Connecting to Server..