1 / 101

Towards optimal distance functions for stochastic substitution models

This paper discusses the development and performance of distance-based reconstruction algorithms for phylogenetic reconstruction in stochastic substitution models. The goal is to reconstruct the "true" tree as accurately as possible by minimizing the effect of noise introduced by sampling. The paper explores the Kimura 2-Parameter (K2P) model, substitution rate functions, and methods for optimizing distances in the K2P model. Simulation results and performance evaluations are presented.

bennyk
Download Presentation

Towards optimal distance functions for stochastic substitution models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards optimal distance functionsfor stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel

  2. PreviewThePhylogenetic Reconstrutction Problem

  3. Evolution is modeled by a Tree ACGGTCA (All our sequences are DNA sequences, consisting of {A,G,C,T}) AAAGTCA ACGGATA ACGGGTA AAAGGCG AAACACA AAAGCTG GGGGATT TCTGGTA ACCCGTG GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG

  4. Phylogenetic Reconstruction GGGGATT GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG

  5. A I J B (root) reconstruct F C D F D G B G A H E H I J E C Phylogenetic Reconstruction A :AATGGGC B :AATCCTG C :ATAGCTG D :GAACGTA E :AAACCGA F :GGGGATT G :TCTGGGA H :TCCGGAA I :AGCCGTG J :ACCGTTG Goal: reconstruct the ‘true’ tree as accurately as possible

  6. Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

  7. edge-weighted ‘true’ tree reconstructed tree D D E E 2 C C 2 5 3 0.3 F 0.4 F 4 6 6 5 B A B G A G reconstruction Distance Based Phylogenetic Reconstruction:Exact vs. Noisy distances Challange: minimize the effect of noise Introduced by the sampling Distance estimation using finite Sampling Exact (additive) distances Between species Estimated distances

  8. Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of known distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

  9. The Kimura 2 Parameter (K2P) model [Kimura80]:each edge corresponds to a “Rate Matrix” Transitions K2P generic rate matrix u Transversions Transitions v

  10. K2P standard distance:Δtotal =Total substitution rate The total substitution rate of a K2P rate matrix R is This is the expected number of mutations per site. It is an additivedistance. + α + 2β α’ + 2β’ u v w (α+α’) + 2(β+ β’)

  11. Estimation of Δtotal(Ruv) = dK2P(u,v) is a noisy stochastic process K2P total rate “distance correction” procedure

  12. Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

  13. wsep Check performance of K2P “standard” distances in resolving quartet-splits There are 3 possible quartet topologies: A C A B A C B D C D B D • Distance methods reconstruct the true split by 4-point condition: The 4-point condition for noisy distances is:

  14. We evaluate the accuracy of theK2P distance estimation by Split Resolution Test: root t is “evolutionary time” The diameter of the quartet is 22t D A C B

  15. Phase A: simulate evolution D A C B

  16. ç ÷ ç ÷ Apply the 4p condition. Was the correct split found? ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ è ø D C A B Phase B: reconstruct the split by the 4p condition estimate distances between sequences, Repeat this process 10,000 times, count number of failures

  17. the split resolution test was applied on the model quartet with various diameters … … • For each diameter, mark the fraction (percentage) of the simulations in which the 4p condition failed (next slide)

  18. Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2 Template quartet

  19. “site saturation” Performance for larger diameters

  20. When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δtv , which counts only transversions: Transitions α Transversions This is actually the CFN model [Cavendar78, Farris73, Neymann71] β α Transitions

  21. Apply the same split resolution test on the transversions only distance: Transversions only Distance correction procedure

  22. transversions only performs better on large, worse on small rates Transversions only total K2P rate

  23. æ ö ç ÷ 1 5 2 4 6 ç ÷ 10 1 ç ÷ 2 7 ç ÷ = ç ÷ ç ÷ ç ÷ Find a distance function d which is good for the input ç ÷ è ø Conclusion: Distance based reconstruction methods should be adaptive: We do a small step in this direction:Input: An alignment of the sequences at u, v.Output: a (near)-optimal distance function, which minimizes the expected noise in the estimation procedure. .

  24. Example: An adaptive distance method (max-optimal) based on this talk:

  25. Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and Substitution Rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

  26. Steps in finding optimal distance functions: • Define substitution model. • Characterize the available distance functions. • Select a function which is optimal for the input sequences. least sensitive to stochastic noise

  27. From Rate matrices to Substitution matrices Rate matrices imply stochastic substitution matrices: u Evolution of a finite sequence by unknown model parameters α, β A stochastic substitution matrix Puv v

  28. Also required P>0, 0<det(P)<1 for allP∈M u v w A substitution model M: A set of stochastic substitution matrices, closed under matrix product: P,Q∈M⇒ PQ ∈M Motivation to the definition:

  29. Model tree over M =<Tree Topology> + <DNA distribution at the root> + <M-substitution matrices at the edges> Uniform distribution r Prv P.. P.. v P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P..

  30. u v w Distances for a given model are defined bySubstitution Rate functions: • Δ:M  ℝ is an SR function for M iff for all P,Q inM: • Δ(PQ) = Δ(P)+ Δ(Q)(additivity) • Δ(P)>0 (positivity)

  31. Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

  32. 1st question:Given a model M, what are its SR functions? X additive SR functions are additive functions which are strictly positive

  33. Example 1: The logdet function [Lake94, Steel93] is an SR function for the most general model, Muniv : Muniv= {P: P is a stochastic 4╳4 matrix, 0<det(P)<1}.

  34. Example 2: The log eigenvalue function

  35. Both “logdet” and the “log eigenvalue” functions are special cases of a general technique:Generalized logdetwhich is given below:

  36. Linearity of additive functions: • If Δ1 and Δ2 are additive functions for M, so is c1Δ1 + c2Δ2 The set of additive functions for M forms a vector space, to be denoted ADM. Dimension(ADM) is the dimension of this vector space. Large dimension implies more “independent” distance functions If dimension(ADM) = 1, then M admits a single distance function (up to product by scalar). Selecting best SR function in such a model is trivial. Thus, the adaptive approach is useful only when dimension(ADM) > 1.

  37. Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models: Models which the adaptive approach is potentially useful. • Optimizing Distances in the K2P model • Simulation results

  38. Unified Substitution Models: Def: A model M is unified if there is a matrix U s.t. for each P∈M it holds that: U-1 PU = Using Lemma GLD, we have:

  39. Strongly Unified Substitution Models Def: A model M is stronglyunified if there is a matrix U s.t. for each P∈M it holds that: U-1 PU =

  40. A simple strongly unified model: The Jukes Cantor model [1969] :0< p <0.25 MJC= MJCis strongly unified by U= For all P∈ MJC , U-1 PU = Claim dimension(ADMJC)=1 Hence the adaptive approach is irrelevant to this model.

  41. Another model M for which dimension(ADM)=1 Recall: Muniv consists of all DNA transition matrices. Claim 2:dimension(ADMuniv) = 1 This meansthat all the additive functions of Munivare proportional to logdet. Hence the adaptive approach is irrelevant also to this model. Luckily, the additive functions of “intermediate” unified models have dimensions > 1, hence the adaptive approach is useful for them. Next we return to the Kimura 2 parameter model.

  42. Back to K2P: For every K2P Substitution Matrix P: U of the JC model U-1 PU = P = Where: λP = 1 - 4Pβ= e-4β μP = 1 - 2Pβ- 2Pα= e-2α-2β 0 < λP <1 0 < μP < 1 Conclusion: dimension(ADMK2P)=2.

  43. u v The functions: Δλ(P)= -ln(λP) , Δμ (P)=-ln(μP) Form a basis of ADK2P The standard “total rate” distance is: ΔK2P(P)=-(ln(λP)+2ln(μP))/4=-Δlogdet(P)/4. The “transversion only” distance is: Δtr(P)=-ln(λP )/4.

  44. Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results

  45. K2P distance estimation: where the noise comes from inherent noise impliednoise propagation “user controlled” noise propagation

  46. Selection of c1, c2 Estimated distance u True distance Expected error + = v

  47. Expected Relative Error Expected error = = True distance

  48. Minimizing the expected relative error

  49. A basic property of Normalized Mean Square Error: This means that equivalent SR functions have the same NMSE

  50. A Proper Disclosure on our optimal functions:

More Related