Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood

Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood Authors: Adam Siepel David Haussler Presented By: Keerthika Venugopal

outline • Definitions • Problem Statement • Phylogenetic Model • Likelihood • Estimation of maximum likelihood • Markov dependence • Test Data and Results

BASIC DEFINITIONS • Phylogenetics: study of evolutionary relationships among a group of organisms • Phylogenetic Tree or Evolutionary tree : A branching diagram or tree showing the inferred evolutionary relationships among various biological species called taxa /taxon and their phylogeny • Phylogenetic Tree can be a rooted or unrooted tree

DNA • CODING REGION- composed of exons, that codes for protein • NON-CODING REGION- Do not encode protein sequences. And is generally considered to be junk • Non-functional DNA are a major line of evidence of common descent. • Eg: genomes of human and mice. Both have a common ancestor where in 20% of conserved regions are present in exon and the rest are present in introns (noncoding regions).

DNA include Nucleocide : A- adenine , T- thymine, C-cytosine and G-Guanine. • Adenine (A) and guanine(G) - are called as purines(R). • Thymine (T) and cytosine (C) are called as pyrimidines(Y) • Transition: Substution of the form R->R or Y->Y • Transversion : Substitution of the form R->Y or Y -> R • CpG Island: -In a YpRdinucleotides -cformed by a purine followed by a pyrimidine. The base pair could be CpG, TpA, TpG, and CpA. -CpG island is a GC rich region(>50%) with at least 200 bp in length ,and with an observed-to-expected CpG ratio that is greater than 60% -The "observed-to-expected CpG ratio" is calculated by formula ((Num of CpG/(Num of C × Num of G)) × Total number of nucleotides in the sequence).

PROBLEM STATMENT • Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. • Context-dependent substitution rates in noncoding DNA have used simple counting methods and pairwise alignments of highly similar sequences. This does not allow for multiple substitutions per site • The Authors provide extensions to standard phylogenetic models that allow for better handling of context-dependent substitution with reasonable computational cost.

PHYLOGENETIC MODEL • ψ = (Q, π, τ, β) - four-tuple model - substitution rate matrix Q, - a vector of equilibrium frequencies π, -a (rooted) binary tree τ = (V,E ) V-node & E-branch - a set of branch lengths β • The model is defined with respect to an alphabet Σ of size d, e.g., Σ = (A, C, G, T).

SUBSITUTION MATRIX • Describes the rate at which one character in a sequence changes to other character states over time. ACGT AGCT • A N*N matrix – (i,j) denotes the ith element transformed jth element • Commonly used Substution matrix - Identity matrix - Log-odd matrix

SUBSTUTION MODELS HKY (Hasegawa-Kishino-Yano 1985 ) Model: • Each Substitution rate is defined as the product of - mean substitution rate β - transition (α)-transversion (β) rate ratio (α/β=k) - Frequencies [ πA, πC, πG, πT ] REV (yang.94) Model : • General model for nucletodesubstituion. • Five substution parameters- [α₁,α₂,α₃,α₄,α₅] • Frequencies- [πA, πC, πG, π T ]

N-TH Order context dependent models UN Model ( Tavare 86) • No restriction are made on the rate matrix • 12 free parameters • Allows free parameters on every non-diagonal position RN MODEL • Similar to UN model under the constrain on rates wAvG =wGvA ; wTv C =w CvT ie,. A<->G & C<->T • The strand-symmetric models UNS and RNS are versions of UN and RN • U1 is equivalent to UNR

LIKELIHOOD • L = P(datatree, model) • Likelihood (L) is the probability of the data (alignment), given a tree (with topology and branch lengths specified) and a probabilistic model of evolution. • Assumptions : • The probability that a position has a certain state at time 1 depends only on the state at time 0, knowing that it had some state prior to time 0 is irrelevant--this is called a Markov process • Data (individual sites) are independent

Unobserved • The parameters of a phylogenetic model are estimated with alignment of multiple n sequences of Length L • {Xu,i} be a set of random variables, such that Xu,irepresents the ith Character (1 ≤ i ≤ L) in the sequence corresponding to node u ∈ • X• -set of observed variables, X• = {Xu,i|u ∈ L}, • X∘ - set of latent or unobserved variables, X∘ = {Xu,i|u ∈ I}. • x• and x∘ instances of X•andX∘. ACGT ACGT ACGT Observed The likelihood of a phylogenetic model ψ with respect to an alignment x• is obtained by summing over all possible values of the latent variables.

PRUNING FELSENSTEIN ALGORITHM Many calculations can be done just once, and then reused many times

MAXIMUM LIKELIHOOD • The aim of maximum likelihood is to choose the value of the parameter that maximizes the probability of finding the data. • For phylogenetic model, obtain the m.l.e for Q, β conditional on π and τ using EM(Expectation-Maximization) algorithm. • Why Not Newton Method? - Newton- Quasi/ Rapson Method is a numerical method - Not well suited for models like phylogenetic which have high rate matrix. - Used to solve optimization problem on each iteration of EM.

EM ALGORITYM • Iterative approach • Starting from some initial guess, each iteration consists of • an E step (Expectation step) • an M step (Maximization step) • Alter between E and M step until convergence or maximum value is obtained.

E- Step: - computing the expected numbers of substitutions of b ∈ ΣN for a ∈ ΣN E[S(b, a, u)|z•, Q̂t, π, τ, t] • M-Step: - The maximization (M) step consists of finding the parameter values (Q̂,β ) that maximize the function along each edge of the tree Cb,a,u = E[S(b, a, u)|z•, Q̂t, π, τ, t] -a constant. - Use quasi-newton algorithm given f(Q,β)

MARKOV DEPENDENCE • In case of Overlapping tuples of columns, Markov dependence between sites is combined with context-dependent substitution models, to distinguish between sites of different types • The Likelihood is given by and the probability P(x•,i|x•,i−N+1, … , x•,i−1, ψ) can be computed as

TEST DATA • Non-Coding data set- 1.8-megabase region of human chromosome 7 containing the CFTR gene and homologous sequences from eight other eutherian mammals. • Coding Data set- mRNA sequences from various mammalian species. • The sequences were aligned with a new program called TBA ( “Threaded Blockset Aligner”) which was designed specifically for alignment of megabase-sized regions of multiple mammalian genome.

RESULT – LIKELIHOOD of non-coding data • Improvements in likelihood obtained by allowing rate variation and overlapping column tuples. • Allowing rate variation with this data set makes only a small difference in likelihood. • When Markov dependence between columns is introduced, the improvement over the single-nucleotide models is nearly doubled for dinucleotide models and increases by about half for nucleotide-triplet models, despite no change in the number of parameters of the model.

RESULT- CpG EFFECTS • The substitution matrixQ,spanned a wide range of values and tend to cluster into three main groups, corresponding to transversions, transitions, and CpG transitions.

RESULT – LIKELIHOOD of coding data • When Markov dependence between sites was introduced with the U3 and R3. models, another large improvement in likelihoods occurred, indicating that context effects that cross codon boundaries are important. • It remains to be determined to what extent the CpG effect is responsible for the observed improvements in likelihood.

SUMMARY • The observed improvements -Ability to capture context-effects within independent N-tuples of sites -Ability to allow the N-tuples to overlap -Rich parameterization of the substitution process • In noncoding regions, the CpG effect is the single, strongest, clearly identifiable example of context-dependence, but it does not appear to be sufficient to explain the advantage of context- dependent models. • Rich parameterization of the substitution process indicates that, in coding regions - a richly parameterized substitution model allows for significantly higher likelihoods than those of Goldman and Yang's codon model. • This models rely on a large number of parameters—larger than in any phylogenetic model currently in wide use, thus increases the computational burden of parameter estimation.

SUMMARY • Thus Context-dependent models might be helpful in certain cases -Cases with hard-to-resolve branching's near the root of the tree. -To discriminate between different types of sites, such as in gene finding. -To understand the evolution of isochores, in which the CpG effect has been suggested to play an important role.

THANK YOU

Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood

Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood

Presentation Transcript

Maximum likelihood estimation

LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION

LECTURE 09: MAXIMUM LIKELIHOOD ESTIMATION

Maximum Likelihood Estimation

LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION

Introduction to Maximum Likelihood Estimation

Maximum Likelihood Estimation

Inferring phylogenetic trees: Maximum likelihood methods

Maximum likelihood estimation of relative transcript abundances

LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION

Maximum likelihood estimation of intrinsic dimension

Maximum Likelihood Estimation

Phylogenetic Estimation using Maximum Likelihood

Maximum likelihood estimation

Maximum-Likelihood estimation

Maximum Likelihood Estimation

PAML: Phylogenetic Analysis by Maximum Likelihood

PAML: Phylogenetic Analysis by Maximum Likelihood

PAML (Phylogenetic Analysis by Maximum Likelihood)

Maximum Likelihood (ML) Estimation

Maximum Likelihood Estimation

Maximum Likelihood Estimation