Solving phylogenetic trees
This presentation is the property of its rightful owner.
Sponsored Links
1 / 54

Solving Phylogenetic Trees PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on
  • Presentation posted in: General

Solving Phylogenetic Trees. Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO. Table of Contents. Problem & Term Definitions A DCM*-NJ Solution Performance Measurements Possible Improvements. Phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human.

Download Presentation

Solving Phylogenetic Trees

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Solving phylogenetic trees

Solving Phylogenetic Trees

Benjamin Loyle

March 16, 2004

Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397


Table of contents

Table of Contents

  • Problem & Term Definitions

  • A DCM*-NJ Solution

  • Performance Measurements

  • Possible Improvements

Benjamin Loyle 2004 Cse 397


Phylogeny

Phylogeny

From the Tree of the Life Website,University of Arizona

Orangutan

Human

Gorilla

Chimpanzee

Benjamin Loyle 2004 Cse 397


Dna sequence evolution

-3 mil yrs

AAGACTT

-2 mil yrs

AAGGCCT

AAGGCCT

TGGACTT

TGGACTT

-1 mil yrs

AGGGCAT

TAGCCCT

AGCACTT

AGGGCAT

TAGCCCT

AGCACTT

today

AGGGCAT

TAGCCCA

TAGACTT

AGCACAA

AGCGCTT

DNA Sequence Evolution

Benjamin Loyle 2004 Cse 397


Problem definition

Problem Definition

  • The Tree of Life

    • Connecting all living organisms

    • All encompassing

    • Find evolution from simple beginnings

  • Even smaller relations are tough

  • Impossible

    • Infer possible ancestral history.

Benjamin Loyle 2004 Cse 397


So what

So what….

  • Genome sequencing provides entire map of a species, why link them?

  • We can understand evolution

  • Viable drug testing and design

  • Predict the function of genes

  • Influenza evolution

Benjamin Loyle 2004 Cse 397


Why is that a problem

Why is that a problem?

  • Over 8 million organisms

  • Current solutions are NP-hard

  • Computing a few hundred species takes years

  • Error is a very large factor

Benjamin Loyle 2004 Cse 397


What do we want

What do we want?

  • Input

    • A collection of nodes such as taxa or protein strings to compare in a tree

  • Output

    • A topological link to compare those nodes to each other

  • When do we want it?

    • FAST!

Benjamin Loyle 2004 Cse 397


Preparing the input

Preparing the input

  • Create a distance matrix

  • Sum up all of the known distances into a matrix sized n x n

    • N is the number of nodes or taxa

  • Found with sequence comparison

Benjamin Loyle 2004 Cse 397


Distance matrix

Distance Matrix

Take 5 separate DNA strings

A : GATCCATGA

B : GATCTATGC

C : GTCCCATTT

D : AATCCGATC

E : TCTCGATAG

The distance between A and B is 2

The distance between A and C is 4

This is subjective based on what your criteria are.

Benjamin Loyle 2004 Cse 397


Distance matrix1

Distance Matrix

  • Lets start with an example matrix

A

B

C

D

E

A

B

C

D

E

Benjamin Loyle 2004 Cse 397


Lets make it simple constrain the input

Lets make it simple (constrain the input)

  • Lets keep the distance between nodes within a certain limit

    • From F -> G

    • F and G have the largest distance; they are the most dissimilar of any nodes.

    • This is called the diameter of the tree

  • Lets keep the length of the input (length of the strings) polynomial.

Benjamin Loyle 2004 Cse 397


Error

ERROR?!?!!?

  • All trees are inferred, how do you ever know if you’re right?

  • How accurate do we have to be?

  • We can create data sets to test trees that we create and assume that it will then work in the real world

Benjamin Loyle 2004 Cse 397


Data sets

Data Sets

  • JC Model

    • Sites evolve independent

    • Sites change with the same probability

    • Changes are single character changes

      • Ie. A -> G or T -> C

    • The expectation of change is a Poisson variable (e)

Benjamin Loyle 2004 Cse 397


More data sets

More Data Sets

  • K2P Model

    • Based on JC Model

    • Allows for probability of transitions to tranversions

      • It’s more likely for A and T to switch and G and C to switch

      • Normally set to twice as likely

Benjamin Loyle 2004 Cse 397


Data use

Data Use

  • Using these data sets we can create our own evolution of data.

  • Start with one “ancestor” and create evolutions

  • Plug the evolutions back and see if you get what you started with

Benjamin Loyle 2004 Cse 397


Aspects of trees

Aspects of Trees

  • Topology

    • The method in which nodes are connected to each other

    • “Are we really connected to apes directly, or just linked long before we could be considered mammals?”

  • Distance

    • The sum of the weighted edges to reach one node from another

  • Benjamin Loyle 2004 Cse 397


    What can distance tell us

    What can distance tell us?

    • The distance between nodes IS the evolutionary distance between the nodes

    • The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.

    Benjamin Loyle 2004 Cse 397


    Current techniques

    Current Techniques

    • Maximum Parsimony

      • Minimize the total number of evolutionary events

      • Find the tree that has a minimum amount of changes from ancestors

    • Maximum Likelihood

      • Probability based

      • Which tree is most probable to occur based on current data

    Benjamin Loyle 2004 Cse 397


    More techniques

    More Techniques

    • Neighbor Joining

      • Repeatedly joins pairs of leaves (or subtrees) by rules of numerical optimization

      • It shrinks the distance matrix by considering two ‘neighbors’ as one node

    Benjamin Loyle 2004 Cse 397


    Learning neighbor joining

    Learning Neighbor Joining

    • It will become apparent later on, but lets learn how to do Neighbor Joining (NJ)

    A

    B

    C

    D

    E

    A

    B

    C

    D

    E

    Benjamin Loyle 2004 Cse 397


    Nj part 1

    NJ Part 1

    • First start with a “star tree”

    E

    A

    D

    B

    C

    Benjamin Loyle 2004 Cse 397


    Nj part 2

    NJ Part 2

    • Combine the closest two nodes (from distance matrix)

      • In our case it is node A and B at distance 3

    E

    A

    D

    B

    C

    Benjamin Loyle 2004 Cse 397


    Nj part 3

    NJ Part 3

    • Repeat this until you have added n-2 nodes (3)

      • N-2 will make it a binary tree, so we only have to include one more node.

    E

    A

    D

    B

    C

    Benjamin Loyle 2004 Cse 397


    Are we done

    Are we done?

    • ML and MP, even in heuristic form take too long for large data sets

    • NJ has poor topological accuracy, especially for large diameter trees

    • We need something that works for large diameter trees and can be run fast.

    Benjamin Loyle 2004 Cse 397


    Here s what we want

    Here’s what we want

    • Our Goal

      • An “Absolute Fast Converging” Method

        •  is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €.

        • Simply: Lets make it in polynomial time within a degree of error.

    Benjamin Loyle 2004 Cse 397


    A dcm nj solution

    A DCM* - NJ Solution

    • 2 Phase construction of a final phylogenetic tree given a distance matrix d.

    • Phase 1 : Create a set of plausible trees for the distance matrix

    • Phase 2 : Find the best fitting tree

    Benjamin Loyle 2004 Cse 397


    Phase 1

    Phase 1

    • For each q in {dij}, compute a tree tq

    • Let T = { tq : q in {dij} }

    Benjamin Loyle 2004 Cse 397


    Finding t q

    Finding tq

    • Step 1: Compute Thresh(d,q)

    • Step 2: Triangulate Thresh(d,q)

    • Step 3: Compute a NJ Tree for all maximal cliques

    • Step 4: Merge the subtrees into a supertree

    Benjamin Loyle 2004 Cse 397


    What does that mean

    What does that mean

    • Breaking the problem up

      • Create a threshold of diameters to break the problem into

        • A bunch of smaller diameter trees (cliques)

    • Apply NJ to those cliques

    • Merge them back

    Benjamin Loyle 2004 Cse 397


    Finding t q terms

    Finding tq (terms)

    • Threshold Graph

      • Thresh(d,q) is the threshold graph where (i,j) is an edge if and only if dij <= q.

    Benjamin Loyle 2004 Cse 397


    Threshold

    Threshold

    • Lets bring back our distance matrix and create a threshold with q equal to d15 or the distance between A and E

      • So q = 67

    Benjamin Loyle 2004 Cse 397


    Distance matrix2

    Distance Matrix

    • Our old example matrix

    A

    B

    C

    D

    E

    A

    B

    C

    D

    E

    Benjamin Loyle 2004 Cse 397


    With q d 15 67

    With q = D15 = 67

    C

    47

    A

    67

    D

    63

    B

    E

    16

    Benjamin Loyle 2004 Cse 397


    Triangulating

    Triangulating

    • A graph is triangulated if any cycle with four or more vertices has a chord

      • That is, an edge joining two nonconsecutive vertices of the cycle.

    • Our example is already triangulated, but lets look at another

    Benjamin Loyle 2004 Cse 397


    Triangulating1

    5

    W

    X

    5

    5

    Y

    Z

    5

    Triangulating

    Lets say this is for q = 5

    10 and 15 would

    Not be in the graph

    10

    To triangulate this

    graph you add the

    edge length 10.

    15

    Benjamin Loyle 2004 Cse 397


    Maximal cliques

    Maximal Cliques

    • A clique that cannot be enlarged by the addition of another vertex.

    • Recall our original threshold graph which is triangulated:

    Benjamin Loyle 2004 Cse 397


    Triangulated threshold graph

    Triangulated Threshold Graph

    • Our old Graph

    C

    47

    A

    67

    D

    63

    B

    E

    16

    Benjamin Loyle 2004 Cse 397


    Clique

    Clique

    Our maximal cliques would be:

    {A, B, E}

    {C, D}

    Benjamin Loyle 2004 Cse 397


    Create trees for the cliques

    Create Trees for the Cliques

    • We have two maximal cliques, so we make two trees; {A, B, E} and {C, D}

      • How do we make these trees?

      • Remember NJ?

    Benjamin Loyle 2004 Cse 397


    Tree a b e and c d

    Tree {A, B, E} and {C,D}

    A

    E

    B

    C

    D

    Benjamin Loyle 2004 Cse 397


    Merge your separate trees together

    Merge your separate trees together.

    • Create one Supertree

    • This is done by creating a minimum set of edges in the trees and calling that the “backbone”

    • This is it’s own doctorial thesis, so lets do a little hand waving

    Benjamin Loyle 2004 Cse 397


    That sounds like np hard

    That sounds like NP-hard!

    • Computing Threshold is Polynomial

    • Minimally triangulating is NP-hard, but can be obtained in polynomial time using a greedy heuristic without too much loss in performance.

    • Maximal cliques is only polynomial if the data input is triangulated (which it is!).

    • If all previous are done, creating a supertree can be done in polynomial time as well.

    Benjamin Loyle 2004 Cse 397


    Where are we now

    Where are we now?

    • We now have a finalized phylogeny created for from smaller trees in our matrix joined together

    • Remember we started from all possible size of smaller trees.

    Benjamin Loyle 2004 Cse 397


    Phase 2

    Phase 2

    • Which one is right?

      • Found using the SQS (Short Quartet Support) method

      • Let T be a tree in S (made from part 1)

      • Break the data into sets of four taxa

        • {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc

        • Reduce the larger tree to only hold “one set”

        • These are called Quartets

    Benjamin Loyle 2004 Cse 397


    Sqs a guide

    SQS - A Guide

    • Q(T) is the set of trees induced by T on each set of four leaves.

    • Let Qw (different Q) be a set of quartets with diameter less than or equal to w

    • Find the maximum w where the quartets are inclusive of the nodes of the tree

    • This w is the “support” of that tree

    Benjamin Loyle 2004 Cse 397


    Sqs refrased

    SQS - Refrased

    • Qw is the set of quartet trees which have a diameter <= w

    • Support of T is the max w where Qw is a subset of Q(T)

      • Support is our “quality measure”

      • What are we exactly measuring?,

    Benjamin Loyle 2004 Cse 397


    Solving phylogenetic trees

    Qw =

    A

    B

    D

    D

    E

    C

    A

    B

    A

    B

    C

    D

    E

    A

    B

    C

    D

    E

    Benjamin Loyle 2004 Cse 397


    Sqs method

    SQS Method

    • Return the tree in which the support of that tree is the maximum.

      • If more than one such tree exists return the tree found first.

      • This is the tree with the smallest original diameter (remember from phase 1)

    Benjamin Loyle 2004 Cse 397


    How do we know we re right

    How do we know we’re right?

    • Compare it to the data set we created

    • Look at Robinson-Foulds accuracy

      • Remove one edge in the tree we’ve created.

        • We now have two trees

      • Is there anyway to create the same set of leaves by removing one edge in our data set?

        • If no, add a ‘point’ of error.

      • Repeat this for all edges

      • When the value is not zero then the trees are not identical

    Benjamin Loyle 2004 Cse 397


    Performance of dcm nj

    Performance of DCM * - NJ

    • Outperforms NJ method at sequence lengths above 4000 and with more taxa.

    0.8

    NJ

    DCM-NJ

    0.6

    Error Rate

    0.4

    0.2

    0

    0

    400

    800

    1200

    1600

    No. Taxa

    Benjamin Loyle 2004 Cse 397


    Improvements

    Improvements

    • Improvement possibilities like in Phase 2

    • Include test of Maximum Parsimony (MP)

      • Try and minimize the overall size of the tree

    • Test using statistical evidence

      • Maximum Likelihood (ML)

    Benjamin Loyle 2004 Cse 397


    Performance gains

    Performance gains

    • Simply changing Phase 2 has massive gains in accuracy!

    • DCM - NJ + MP and DCM -NJ + ML are VERY accurate for data sets greater than 4000 and are NOT NP hard.

    • DCM - NJ + MP finished its analysis on a 107 taxon tree in under three minutes.

    Benjamin Loyle 2004 Cse 397


    Comparing improvements

    Comparing Improvements

    DCM-NJ+SQS

    0.8

    NJ

    DCM-NJ+MP

    HGT-FP

    0.6

    Error Rate

    0.4

    0.2

    0

    0

    800

    400

    1200

    1600

    # leaves

    Benjamin Loyle 2004 Cse 397


  • Login