230 likes | 393 Views
SuperTriplets: a triplet-based supertree approach to phylogenomics. Vincent Ranwez , Alexis Criscuolo and Emmanuel J.P. Douzery. Introduction: inferring phylogeny (1 gene). Introduction: inferring phylogeny (3 genes). Gene 1. Gene 2. Gene 3. ?????????????????? ??????????????????.
E N D
SuperTriplets: a triplet-based supertree approach to phylogenomics Vincent Ranwez, Alexis Criscuolo and Emmanuel J.P. Douzery
Introduction: inferring phylogeny (1 gene) SuperTriplets: ISBM 2010
Introduction: inferring phylogeny (3 genes) Gene 1 Gene 2 Gene 3 ?????????????????? ?????????????????? SuperMatrix ?????????????????? ?????????????????? ?????????????????? ?????????????????? ?????????????????????????????????? ?????????????????????????????????? ?????????????????????????????????? ??????????????????????? ??????????????????????? ??????????????????????? ?????????????????? ?????????????????? ?????????????????? ?????????????????????????????????? ?????????????????????????????????? SuperTree SuperTriplets: ISBM 2010
SNP / Morpho/ biblio Introduction: inferring phylogeny (more data) Gene 2 Gene 1000 ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ……………………….. ………………………. ……………………….. ?????????????????? ?????????????????? SuperMatrix ?????????????????? ?????????????????? ?????????????????? ?????????????????? ??????????????????????? ??????????????????????? ??????????????????????? ?????????????????? ?????????????????? ?????????????????? SuperTree SuperTriplets: ISBM 2010
[Goloboff and Pol, 2002] • Relation contradicted by all source trees C D E F B A A B C D E F C D E A B F MRP Supertree overview: MRP • MRP [Baum 1992, Ragan 1992] • 1 binary sequence per taxon • 1 site per clade (1=in the clade; 0 outside; ? missing) MR P 0100101001?11?0100 01??0?011?0???0010 ??0011010??001???? 0100010??00??001?0 111??0101000????01 SuperTriplets: ISBM 2010
Supertree overview: intuitive approach • The Supertree problem (intuitive formulation) • Input: a collection of overlapping trees (a forest) • Output: the tree that best represents this collection • A major question is: how to define "bestrepresents" ? • Vizualizing supertree candidates within the tree space • Median supertree • Intuitive solution • Generalization of the consensus tree • Good theoretical properties [Steel and Rodriguo, 2008] SuperTriplets: ISBM 2010
Supertree oveview: median tree Initial trees Tree restriction • Tree decomposition as: • split set • quartet set • triplet set d( , ) = + - SuperTriplets: ISBM 2010
E D C B A T1 T2 T3 F G H B A C G F H B A C Supertree overview: MRP and median tree 0100101001?11?0100 01??0?011?0???0010 ??0011010??001???? 0100010??00??001?0 111??0101000????01 MR P Input forest AB|CAB|D … GH|F … FH|G … ABCDEFGH 110?????0 11?0????0 ……………………… ……………………… ?????1010 ?????0110 ……………………… Triplet MR Rooting SuperTriplets: ISBM 2010
Supertree overview: MRP and median tree • The parsimony value is related to the triplet distance: • 1 parsimony step for triplets within the supertree • 2 parsimony steps for others • parsimony score = nbSites + (triplet distance)/2 • The MRP approach is unadapted to triplet encoding • for 100 taxa 97% of « ? » • for 1000 taxa 99.7% of « ? » • unnecessary huge matrices SuperTriplets: ISBM 2010
asymmetric Supertriplets: few notations • Given a forest F of input trees • N+(xy|z): number of occurrences of xy|zin F • N-(xy|z) = N+(xz|y) + N+(yz|x) (alternive resolutions in F) • Input trees are then useless (little impact of forest size) • Searching for the (asymmetric) triplet median tree T: • median : SuperTriplets: ISBM 2010
Supertriplets: general overview O(n3 |F| ) O(n3) + consistency triplet decompostion O(n3) to test all branches once first sketch NJ-like strategy improvementNNI local search N-(homo pan|mus) N+(homo pan|mus) N-(pan bos|mus) N+(pan bos|mus) N-(homo pan|bos) N+(homo pan|bos) N-(mus pan| bos) N+(mus pan|bos) … … O(n3) branch support and collapse SuperTriplets: ISBM 2010
E D C B A T0 T1 T2 T3 E D C B A E D C B A E D C B A C1={A} C2={B} C1={A,B} C2={C} C1={D} C2={E} AC|D BC|D AC|E BC|E AB|C AB|D AB|E DE|A DE|B DE|C Triplets(T3 ) Supertriplets: agglomerative process SuperTriplets: ISBM 2010
Supertriplets: agglomerative process • Agglomeration of (CA,CB ) • Transform T into T’ • Resolve some new triplets (AB|X) with ACA, BCB, X{CACB} • d3( T’,F ) = d3( T,F ) - ( ∑ N+(AB|X) - ∑ N-(AB|X) ) • We select the pair maximizing • Score (CA, CB) = (∑ N+(AB|X) - ∑ N- (AB|X)) / (∑ N+(AB|X) + ∑ N-(AB|X) ) • The whole process is O(n3) : when CA and CB are agglomerated • score(CD , CE )is unchanged • score(C{AB} ,CD ) is easily derived from Score (CA, CD ) andScore (CB, CD ) SuperTriplets: ISBM 2010
Supertriplets: NNI optimisation • The variation d3(T’,F) - d3(T,F) • depends on few triplets (here ) • All these variations are initially evaluated in O(n3) • Once a NNI is done • few NNI have to be re-evaluated (4 adjacent edges) • NNI optimisation is therefore very fast T’ T 2 possible NNI per edge SuperTriplets: ISBM 2010
Supertriplets: edge supports • Local support • ∑ N+() / [ ∑ N+( ) + ∑ N-() ] • If <0.5 collapsing the edgeimproved3(T,F) • Global support • Alsotakeintoaccount • N+() and N- ( )impact twoedges • Final edge support: min (local, global) T SuperTriplets: ISBM 2010
Supertriplets: simulation protocol [Eulenstein et al. 2004] [Criscuolo et al. 2006] Are they similar? Triplet/split measure SuperTriplets: ISBM 2010
Supertriplets: simulation results triplets Splits Contain errors Less resolved Very few errors perfect lack of resolution SuperTriplets: ISBM 2010
Supertriplets: phylogenomic case study • Supertree of 33 mammals • Species: complete genomes ( EnsEMBL v54) • Sequences: orthologous CDS (orthoMaM v5) • Gene trees: 13 000 ML trees (inferred using PAUP) • Output supertree • Computed in 30s • Congruent with [Prasad et al. 2008] SuperTriplets: ISBM 2010
Conclusion & prospects • (Asymmetric) median supertree • Easy to understand • Makes tree weighting natural • MRP, triplets and median supertree • Understanding the criteria optimized by MRP • Design a dedicated algorithm to optimize it • http://www.supertriplets.univ-montp2.fr/ • Supertrees & supermatrix are complementary • 1 000 vertebrate genome project • Divide and conquer approachi) trees based on multiple CDSs (supermatrix)ii) assembling those trees (supertree) SuperTriplets: ISBM 2010
Supertriplets: http://www.supertriplets.univ-montp2.fr/ O(n3 |F| ) O(n3) + consistency triplet decompostion O(n3) to test all branches once first sketch NJ-like strategy improvementNNI local search N-(homo pan|mus) N+(homo pan|mus) N-(pan bos|mus) N+(pan bos|mus) N-(homo pan|bos) N+(homo pan|bos) N-(mus pan| bos) N+(mus pan|bos) … … O(n3) branch support and collapse Less resolved Very few errors SuperTriplets: ISBM 2010
Supertree overview: asymmetric median tree F1 E D C B A E D C B A E D C B A E D C B A d(F1, ) = d( + ) d(F1, ) = 3 * d( + ) F2 E D C B A E D C B A E D C B A E D C B A d(F2, ) = d( + ) d(F2, ) = 3*d( + ) REF SuperTriplets: ISBM 2010