solving phylogenetic trees
Download
Skip this Video
Download Presentation
Solving Phylogenetic Trees

Loading in 2 Seconds...

play fullscreen
1 / 54

Solving Phylogenetic Trees - PowerPoint PPT Presentation


  • 196 Views
  • Uploaded on

Solving Phylogenetic Trees. Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO. Table of Contents. Problem & Term Definitions A DCM*-NJ Solution Performance Measurements Possible Improvements. Phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Solving Phylogenetic Trees' - bryce


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
solving phylogenetic trees

Solving Phylogenetic Trees

Benjamin Loyle

March 16, 2004

Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

table of contents
Table of Contents
  • Problem & Term Definitions
  • A DCM*-NJ Solution
  • Performance Measurements
  • Possible Improvements

Benjamin Loyle 2004 Cse 397

phylogeny
Phylogeny

From the Tree of the Life Website,University of Arizona

Orangutan

Human

Gorilla

Chimpanzee

Benjamin Loyle 2004 Cse 397

dna sequence evolution

-3 mil yrs

AAGACTT

-2 mil yrs

AAGGCCT

AAGGCCT

TGGACTT

TGGACTT

-1 mil yrs

AGGGCAT

TAGCCCT

AGCACTT

AGGGCAT

TAGCCCT

AGCACTT

today

AGGGCAT

TAGCCCA

TAGACTT

AGCACAA

AGCGCTT

DNA Sequence Evolution

Benjamin Loyle 2004 Cse 397

problem definition
Problem Definition
  • The Tree of Life
    • Connecting all living organisms
    • All encompassing
    • Find evolution from simple beginnings
  • Even smaller relations are tough
  • Impossible
    • Infer possible ancestral history.

Benjamin Loyle 2004 Cse 397

so what
So what….
  • Genome sequencing provides entire map of a species, why link them?
  • We can understand evolution
  • Viable drug testing and design
  • Predict the function of genes
  • Influenza evolution

Benjamin Loyle 2004 Cse 397

why is that a problem
Why is that a problem?
  • Over 8 million organisms
  • Current solutions are NP-hard
  • Computing a few hundred species takes years
  • Error is a very large factor

Benjamin Loyle 2004 Cse 397

what do we want
What do we want?
  • Input
    • A collection of nodes such as taxa or protein strings to compare in a tree
  • Output
    • A topological link to compare those nodes to each other
  • When do we want it?
    • FAST!

Benjamin Loyle 2004 Cse 397

preparing the input
Preparing the input
  • Create a distance matrix
  • Sum up all of the known distances into a matrix sized n x n
    • N is the number of nodes or taxa
  • Found with sequence comparison

Benjamin Loyle 2004 Cse 397

distance matrix
Distance Matrix

Take 5 separate DNA strings

A : GATCCATGA

B : GATCTATGC

C : GTCCCATTT

D : AATCCGATC

E : TCTCGATAG

The distance between A and B is 2

The distance between A and C is 4

This is subjective based on what your criteria are.

Benjamin Loyle 2004 Cse 397

distance matrix1
Distance Matrix
  • Lets start with an example matrix

A

B

C

D

E

A

B

C

D

E

Benjamin Loyle 2004 Cse 397

lets make it simple constrain the input
Lets make it simple (constrain the input)
  • Lets keep the distance between nodes within a certain limit
    • From F -> G
    • F and G have the largest distance; they are the most dissimilar of any nodes.
    • This is called the diameter of the tree
  • Lets keep the length of the input (length of the strings) polynomial.

Benjamin Loyle 2004 Cse 397

error
ERROR?!?!!?
  • All trees are inferred, how do you ever know if you’re right?
  • How accurate do we have to be?
  • We can create data sets to test trees that we create and assume that it will then work in the real world

Benjamin Loyle 2004 Cse 397

data sets
Data Sets
  • JC Model
    • Sites evolve independent
    • Sites change with the same probability
    • Changes are single character changes
      • Ie. A -> G or T -> C
    • The expectation of change is a Poisson variable (e)

Benjamin Loyle 2004 Cse 397

more data sets
More Data Sets
  • K2P Model
    • Based on JC Model
    • Allows for probability of transitions to tranversions
      • It’s more likely for A and T to switch and G and C to switch
      • Normally set to twice as likely

Benjamin Loyle 2004 Cse 397

data use
Data Use
  • Using these data sets we can create our own evolution of data.
  • Start with one “ancestor” and create evolutions
  • Plug the evolutions back and see if you get what you started with

Benjamin Loyle 2004 Cse 397

aspects of trees
Aspects of Trees
  • Topology
      • The method in which nodes are connected to each other
      • “Are we really connected to apes directly, or just linked long before we could be considered mammals?”
  • Distance
      • The sum of the weighted edges to reach one node from another

Benjamin Loyle 2004 Cse 397

what can distance tell us
What can distance tell us?
  • The distance between nodes IS the evolutionary distance between the nodes
  • The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.

Benjamin Loyle 2004 Cse 397

current techniques
Current Techniques
  • Maximum Parsimony
    • Minimize the total number of evolutionary events
    • Find the tree that has a minimum amount of changes from ancestors
  • Maximum Likelihood
    • Probability based
    • Which tree is most probable to occur based on current data

Benjamin Loyle 2004 Cse 397

more techniques
More Techniques
  • Neighbor Joining
    • Repeatedly joins pairs of leaves (or subtrees) by rules of numerical optimization
    • It shrinks the distance matrix by considering two ‘neighbors’ as one node

Benjamin Loyle 2004 Cse 397

learning neighbor joining
Learning Neighbor Joining
  • It will become apparent later on, but lets learn how to do Neighbor Joining (NJ)

A

B

C

D

E

A

B

C

D

E

Benjamin Loyle 2004 Cse 397

nj part 1
NJ Part 1
  • First start with a “star tree”

E

A

D

B

C

Benjamin Loyle 2004 Cse 397

nj part 2
NJ Part 2
  • Combine the closest two nodes (from distance matrix)
      • In our case it is node A and B at distance 3

E

A

D

B

C

Benjamin Loyle 2004 Cse 397

nj part 3
NJ Part 3
  • Repeat this until you have added n-2 nodes (3)
      • N-2 will make it a binary tree, so we only have to include one more node.

E

A

D

B

C

Benjamin Loyle 2004 Cse 397

are we done
Are we done?
  • ML and MP, even in heuristic form take too long for large data sets
  • NJ has poor topological accuracy, especially for large diameter trees
  • We need something that works for large diameter trees and can be run fast.

Benjamin Loyle 2004 Cse 397

here s what we want
Here’s what we want
  • Our Goal
    • An “Absolute Fast Converging” Method
      •  is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €.
      • Simply: Lets make it in polynomial time within a degree of error.

Benjamin Loyle 2004 Cse 397

a dcm nj solution
A DCM* - NJ Solution
  • 2 Phase construction of a final phylogenetic tree given a distance matrix d.
  • Phase 1 : Create a set of plausible trees for the distance matrix
  • Phase 2 : Find the best fitting tree

Benjamin Loyle 2004 Cse 397

phase 1
Phase 1
  • For each q in {dij}, compute a tree tq
  • Let T = { tq : q in {dij} }

Benjamin Loyle 2004 Cse 397

finding t q
Finding tq
  • Step 1: Compute Thresh(d,q)
  • Step 2: Triangulate Thresh(d,q)
  • Step 3: Compute a NJ Tree for all maximal cliques
  • Step 4: Merge the subtrees into a supertree

Benjamin Loyle 2004 Cse 397

what does that mean
What does that mean
  • Breaking the problem up
    • Create a threshold of diameters to break the problem into
        • A bunch of smaller diameter trees (cliques)
    • Apply NJ to those cliques
    • Merge them back

Benjamin Loyle 2004 Cse 397

finding t q terms
Finding tq (terms)
  • Threshold Graph
    • Thresh(d,q) is the threshold graph where (i,j) is an edge if and only if dij <= q.

Benjamin Loyle 2004 Cse 397

threshold
Threshold
  • Lets bring back our distance matrix and create a threshold with q equal to d15 or the distance between A and E
    • So q = 67

Benjamin Loyle 2004 Cse 397

distance matrix2
Distance Matrix
  • Our old example matrix

A

B

C

D

E

A

B

C

D

E

Benjamin Loyle 2004 Cse 397

with q d 15 67
With q = D15 = 67

C

47

A

67

D

63

B

E

16

Benjamin Loyle 2004 Cse 397

triangulating
Triangulating
  • A graph is triangulated if any cycle with four or more vertices has a chord
    • That is, an edge joining two nonconsecutive vertices of the cycle.
  • Our example is already triangulated, but lets look at another

Benjamin Loyle 2004 Cse 397

triangulating1

5

W

X

5

5

Y

Z

5

Triangulating

Lets say this is for q = 5

10 and 15 would

Not be in the graph

10

To triangulate this

graph you add the

edge length 10.

15

Benjamin Loyle 2004 Cse 397

maximal cliques
Maximal Cliques
  • A clique that cannot be enlarged by the addition of another vertex.
  • Recall our original threshold graph which is triangulated:

Benjamin Loyle 2004 Cse 397

triangulated threshold graph
Triangulated Threshold Graph
  • Our old Graph

C

47

A

67

D

63

B

E

16

Benjamin Loyle 2004 Cse 397

clique
Clique

Our maximal cliques would be:

{A, B, E}

{C, D}

Benjamin Loyle 2004 Cse 397

create trees for the cliques
Create Trees for the Cliques
  • We have two maximal cliques, so we make two trees; {A, B, E} and {C, D}
    • How do we make these trees?
    • Remember NJ?

Benjamin Loyle 2004 Cse 397

tree a b e and c d
Tree {A, B, E} and {C,D}

A

E

B

C

D

Benjamin Loyle 2004 Cse 397

merge your separate trees together
Merge your separate trees together.
  • Create one Supertree
  • This is done by creating a minimum set of edges in the trees and calling that the “backbone”
  • This is it’s own doctorial thesis, so lets do a little hand waving

Benjamin Loyle 2004 Cse 397

that sounds like np hard
That sounds like NP-hard!
  • Computing Threshold is Polynomial
  • Minimally triangulating is NP-hard, but can be obtained in polynomial time using a greedy heuristic without too much loss in performance.
  • Maximal cliques is only polynomial if the data input is triangulated (which it is!).
  • If all previous are done, creating a supertree can be done in polynomial time as well.

Benjamin Loyle 2004 Cse 397

where are we now
Where are we now?
  • We now have a finalized phylogeny created for from smaller trees in our matrix joined together
  • Remember we started from all possible size of smaller trees.

Benjamin Loyle 2004 Cse 397

phase 2
Phase 2
  • Which one is right?
    • Found using the SQS (Short Quartet Support) method
    • Let T be a tree in S (made from part 1)
    • Break the data into sets of four taxa
      • {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc
      • Reduce the larger tree to only hold “one set”
      • These are called Quartets

Benjamin Loyle 2004 Cse 397

sqs a guide
SQS - A Guide
  • Q(T) is the set of trees induced by T on each set of four leaves.
  • Let Qw (different Q) be a set of quartets with diameter less than or equal to w
  • Find the maximum w where the quartets are inclusive of the nodes of the tree
  • This w is the “support” of that tree

Benjamin Loyle 2004 Cse 397

sqs refrased
SQS - Refrased
  • Qw is the set of quartet trees which have a diameter <= w
  • Support of T is the max w where Qw is a subset of Q(T)
    • Support is our “quality measure”
    • What are we exactly measuring?,

Benjamin Loyle 2004 Cse 397

slide48

Qw =

A

B

D

D

E

C

A

B

A

B

C

D

E

A

B

C

D

E

Benjamin Loyle 2004 Cse 397

sqs method
SQS Method
  • Return the tree in which the support of that tree is the maximum.
    • If more than one such tree exists return the tree found first.
    • This is the tree with the smallest original diameter (remember from phase 1)

Benjamin Loyle 2004 Cse 397

how do we know we re right
How do we know we’re right?
  • Compare it to the data set we created
  • Look at Robinson-Foulds accuracy
    • Remove one edge in the tree we’ve created.
      • We now have two trees
    • Is there anyway to create the same set of leaves by removing one edge in our data set?
      • If no, add a ‘point’ of error.
    • Repeat this for all edges
    • When the value is not zero then the trees are not identical

Benjamin Loyle 2004 Cse 397

performance of dcm nj
Performance of DCM * - NJ
  • Outperforms NJ method at sequence lengths above 4000 and with more taxa.

0.8

NJ

DCM-NJ

0.6

Error Rate

0.4

0.2

0

0

400

800

1200

1600

No. Taxa

Benjamin Loyle 2004 Cse 397

improvements
Improvements
  • Improvement possibilities like in Phase 2
  • Include test of Maximum Parsimony (MP)
    • Try and minimize the overall size of the tree
  • Test using statistical evidence
    • Maximum Likelihood (ML)

Benjamin Loyle 2004 Cse 397

performance gains
Performance gains
  • Simply changing Phase 2 has massive gains in accuracy!
  • DCM - NJ + MP and DCM -NJ + ML are VERY accurate for data sets greater than 4000 and are NOT NP hard.
  • DCM - NJ + MP finished its analysis on a 107 taxon tree in under three minutes.

Benjamin Loyle 2004 Cse 397

comparing improvements
Comparing Improvements

DCM-NJ+SQS

0.8

NJ

DCM-NJ+MP

HGT-FP

0.6

Error Rate

0.4

0.2

0

0

800

400

1200

1600

# leaves

Benjamin Loyle 2004 Cse 397

ad