Http creativecommons org licenses by sa 2 0
This presentation is the property of its rightful owner.
Sponsored Links
1 / 65

creativecommons/licenses/by-sa/2.0/ PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

http://creativecommons.org/licenses/by-sa/2.0/. CIS 786, Lecture 2. Usman Roshan. Phylogenetics. Study of how species relate to each other “Nothing in biology makes sense, except in the light of evolution”, Theodosius Dobzhansky, Am. Biol. Teacher (1973) Rich in computational problems

Download Presentation

creativecommons/licenses/by-sa/2.0/

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Http creativecommons org licenses by sa 2 0

http://creativecommons.org/licenses/by-sa/2.0/


Cis 786 lecture 2

CIS 786, Lecture 2

Usman Roshan


Phylogenetics

Phylogenetics

  • Study of how species relate to each other

  • “Nothing in biology makes sense, except in the light of evolution”, Theodosius Dobzhansky, Am. Biol. Teacher (1973)

  • Rich in computational problems

  • Fundamental tool in comparative bioinformatics


Why phylogenetics

Why phylogenetics?

  • Study of evolution

    • Origin and migration of humans

    • Origin and spead of disease

  • Many applications in comparative bioinformatics

    • Sequence alignment

    • Motif detection (phylogenetic motifs, evolutionary trace, phylogenetic footprinting)

    • Correlated mutation (useful for structural contact prediction)

    • Protein interaction

    • Gene networks

    • Vaccine devlopment

    • And many more…


Phylogeny problem

Phylogeny Problem

U

V

W

X

Y

AGGGCAT

TAGCCCA

TAGACTT

TGCACAA

TGCGCTT

X

U

Y

V

W


Bipartitions

Bipartitions

  • Phylogenies are equivalent to bipartitions


Topological differences

Topological differences


Phylogeny problem1

Phylogeny Problem

  • Two main methodologies:

    • Alignment first and phylogeny second

      • Construct alignment using one of the MANY alignment programs in the literature

      • Do manual (eye) adjustments if necessary

      • Apply a phylogeny reconstruction method

      • Fast but biologically not realistic

      • Phylogeny is highly dependent on accuracy of alignment (but so is the alignment on the phylogeny!)

    • Simultaneously alignment and phylogeny reconstruction

      • Output both an alignment and phylogeny

      • Computationally much harder

      • Biologically more realistic as insertions, deletions, and mutations occur during the evolutionary process


First methodology

First methodology

  • Compute alignment (for now we assume we are given an alignment)

  • Construct a phylogeny (two approaches)

  • Distance-based methods

    • Input: Distance matrix containing pairwise statistical estimation of aligned sequences

    • Output: Phylogenetic tree

    • Fast but less accurate

  • Character-based methods

    • Input: Sequence alignment

    • Output: Phylogenetic tree

    • Accurate but computationally very hard


Distance based methods

Distance-based methods


Evolution on a single edge

Evolution on a single edge

  • Poisson process

    • Number of changes in a fixed time interval t is independent of changes in any other non-overlapping time interval u

    • Number of changes in time interval t is proportional to the length of the interval

    • No changes in time interval of length 0

  • Let X be the number of nucleotide changes on a single edge. We assume X is a Poisson process

  • Probability dictates that


Evolution on a single edge1

Evolution on a single edge

  • We want to compute (the probability of a nucleotide change on edge e)

  • The probability of observing a change is just the sum of probabilities of observing k changes over all possible values of k (excluding even ones because those changes cannot be seen)


Evolution on a single edge2

Evolution on a single edge

  • Expected number of nucleotide changes on a given edge is given by

  • Key: is additive


Additivity

Additivity

  • Assume we have a path of k edges and that p1, p2,…, pk are the probabilities of change on each edge of the path

  • Using induction we can show that

  • Multiplicative term is hard to deal with and does not easily decompose into a product or sum of pi’s


Additivity1

Additivity

  • But the expected number of nucleotide changes on the path p is elegant


Evolutionary models

Evolutionary models

  • Simple 0,1 alphabet evolutionary model

    • i.i.d. model

    • uniformly random root sequence

  • Jukes-Cantor:

    • Uniformly random root sequence

    • i.i.d. model


Evolutionary models1

Evolutionary models

  • General Markov Model

    • Uniformly random root sequence

    • i.i.d. model

    • For time reversible models


Variation across sites

Variation across sites

  • Standard assumption of how sites can vary is that each site has a multiplicative scaling factor

  • Typically these scaling factors are drawn from a Gamma distribution (or Gamma plus invariant)


Special issues

Special issues

  • Molecular clock: the expected number of changes for a site is proportional to time

  • No-common-mechanism model: there is a random variable for every combination of edge and site


Evolutionary distance estimation

Evolutionary distance estimation


Estimating evolutionary distances

Estimating evolutionary distances

  • For sequences A and B what is the evolutionary distance under the Jukes-Cantor model?

    • ACCTGTGGGTAACCACCC

    • ACCTGAGGGATAGGTCCG

  • But we don’t know what is


Estimating evolutionary distances1

Estimating evolutionary distances

  • Assume nucleotide changes are Bernoulli trials (i.i.d. trials of success or failure)

  • is probability of head in n Bernoulli trials (n is sequence length)

  • Compute a maximum likelihood estimate for

  • ACCTGTGGGTAACCACCC

  • ACCTGAGGGATAGGTCCG

  • 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1


Estimating evolutionary distance

Estimating evolutionary distance

  • We want to find the value of p that maximizes the probability:

  • Set dP/dp to 0 and solve for p to get


Estimating evolutionary distances2

Estimating evolutionary distances

  • = 5/18

  • Continuing in this manner we estimate for all pairs of sequences in the alignment

  • We now have a distance matrix under a biologically sound evolutionary model

  • ACCTGTGGGTAACCACCC

  • ACCTGAGGGATAGGTCCG

  • 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1


Distance methods

Distance methods


Distance methods1

Distance methods

  • UPGMA: similar to hierarchical clustering but not additive

  • Neighbor-joining: more sophisticated and additive

  • What is additivity?


Additivity2

Additivity


Upgma

UPGMA

UPGMA is not additive but works for

ultrametric trees. Takes O(n^2) time

B

A

C

D

A

6

26

26

10

10

26

26

B

6

C

3

3

3

3

D

A

C

D

B


Upgma1

UPGMA

  • Initialize n clusters where each cluster i contains the sequence i

  • Find closest pair of clusters i, j, using distances in matrix D

  • Make them neighbors in the tree by adding new node (ij), and set distance from (ij) to i and j as Dij/2

  • Update distance matrix D: for all clusters k do the following (ni and nj are size of clusters i and j respectively)

  • Delete columns and rows for i and j in D and add new ones corresponding to cluster (ij) with distances as computed above

  • Goto step 2 until only one cluster is left


Upgma2

UPGMA

B

A

C

D

13

13

A

6

26

26

26

26

B

3

6

3

C

3

3

D

A

C

D

B


Upgma3

UPGMA

Doesn’t work (in general) for non-ultrametric

trees

B

A

C

D

3

3

A

13

16

26

3

3

12

19

B

10

10

B

C

13

C

D

D

A


Upgma4

UPGMA

UPGMA constructs incorrect tree here

7.25

B

A

C

D

7.25

A

13

16

26

7.25

7.25

12

19

B

6

6

13

C

B

A

D

C

D


Upgma5

UPGMA

Bipartition (BC,AD) is not in true tree

7.25

3

3

3

3

7.25

7.25

7.25

10

10

B

C

6

6

D

A

B

A

D

C

True tree

UPGMA tree


Neighbor joining

Neighbor joining

  • Additive and O(n^2) time

  • Initialization: same as UPGMA

  • For each species compute

  • Select i and j for which is minimum

  • Make them neighbors in the tree by adding new node (ij), and set distance from (ij) to i and j as


Neighbor joining1

Neighbor joining

  • Update distance matrix D: for all clusters k do the following

  • Delete columns and rows for i and j in D and add new ones corresponding to cluster (ij) with distances as computed above

  • Go to 3 until two nodes/clusters are left


Creativecommons licenses by sa 2 0

NJ

NJ constructs the correct tree for additive

matrices

B

A

C

D

3

3

A

13

16

26

3

3

12

19

B

10

10

B

C

13

C

D

D

A


Simulation studies

Simulation studies


Simulation studies1

Simulation studies

  • The true evolutionary tree is never known in practice. Simulation allows us to study accuracy of methods under biologically realistic scenarios

  • Mathematics behind the phylogenetics is often complex and challenging. Simulation allows us to study algorithms when not possible theoretically and also examine algorithm performance under various conditions such as different evolutionary rates, sequence lengths, or numbers of taxa


Statistical consistency

Statistical consistency

  • As sequence lengths tend to infinity the distance estimation improves and eventually leads to the true additive matrix

  • If a method like NJ is then applied we get the true tree.

  • In practice, however, we have limited sequence length. Therefore we want to know how much sequence length a method requires to achieve low error


Convergence rates

Can be studied experimentally or theoretically

Theoretical results offer loose bounds

Experiments (under simulation) provide more realistic bounds on sequence lengths

Convergence rates


Sequence length requirements

Sequence length requirements


Sequence length requirements1

Sequence length requirements


Typical performance study

Typical performance study


Sequence lengths for nj

Sequence lengths for NJ

Sequence lengths required to obtain 90% accuracy


Error rate of nj

Error rate of NJ


Improving sequence length requirements

Improving sequence length requirements

  • Later we will look at Disk-Covering Methods and study sequence length requirements of other methods (in addition to NJ)


Maximum parsimony

Maximum Parsimony

  • Character based method

  • NP-hard (reduction to the Steiner tree problem)

  • Widely-used in phylogenetics

  • Slower than NJ but more accurate

  • Faster than ML

  • Assumes i.i.d.


Maximum parsimony1

Maximum Parsimony

  • Input: Set S of n aligned sequences of length k

  • Output: A phylogenetic tree T

    • leaf-labeled by sequences in S

    • additional sequences of length k labeling the internal nodes of T

      such that is minimized.


Maximum parsimony example

Maximum parsimony (example)

  • Input: Four sequences

    • ACT

    • ACA

    • GTT

    • GTA

  • Question: which of the three trees has the best MP scores?


Maximum parsimony2

Maximum Parsimony

ACT

ACT

ACA

GTA

GTT

GTT

ACA

GTA

GTA

ACA

ACT

GTT


Maximum parsimony3

Maximum Parsimony

ACT

ACT

ACA

GTA

GTT

GTA

ACA

ACT

2

1

1

3

3

2

GTT

GTT

ACA

GTA

MP score = 7

MP score = 5

GTA

ACA

ACA

GTA

2

1

1

ACT

GTT

MP score = 4

Optimal MP tree


Maximum parsimony computational complexity

Optimal labeling can be

computed in linear time O(nk)

GTA

ACA

ACA

GTA

2

1

1

ACT

GTT

MP score = 4

Finding the optimal MP tree is NP-hard

Maximum Parsimony: computational complexity


Local search strategies

Local optimum

Cost

Global optimum

Phylogenetic trees

Local search strategies


Local search for mp

Local search for MP

  • Determine a candidate solution s

  • While s is not a local minimum

    • Find a neighbor s’ of s such that MP(s’)<MP(s)

    • If found set s=s’

    • Else return s and exit

  • Time complexity: unknown---could take forever or end quickly depending on starting tree and local move

  • Need to specify how to construct starting tree and local move


Starting tree for mp

Starting tree for MP

  • Random phylogeny---O(n) time

  • Greedy-MP


Greedy mp

Greedy-MP

Greedy-MP takes O(n^2k^2) time


Local moves for mp nni

For each edge we get two different topologies

Neighborhood size is 2n-6

Local moves for MP: NNI


Local moves for mp spr

Neighborhood size is quadratic in number of taxa

Computing the minimum number of SPR moves between two rooted phylogenies is NP-hard

Local moves for MP: SPR


Local moves for mp tbr

Local moves for MP: TBR

  • Neighborhood size is cubic in number of taxa

  • Computing the minimum number of TBR moves between two rooted phylogenies is NP-hard


Local optima is a problem

Local optima is a problem


Iterated local search escape local optima by perturbation

Iterated local search: escape local optima by perturbation

Local search

Local optimum


Iterated local search escape local optima by perturbation1

Iterated local search: escape local optima by perturbation

Local search

Local optimum

Perturbation

Output of perturbation


Iterated local search escape local optima by perturbation2

Iterated local search: escape local optima by perturbation

Local search

Local optimum

Perturbation

Local search

Output of perturbation


Ils for mp

ILS for MP

  • Ratchet

  • Iterative-DCM3

  • TNT


Next time

Next time

  • Performance studies on local search for MP

  • Maximum likelihood

  • Alignment


  • Login