1 / 47

Selecting Genomes for Reconstruction of A ncestral Genomes

Selecting Genomes for Reconstruction of A ncestral Genomes. Louxin Zhang Department of Mathematics National University of Singapore. Boreoeutherian Ancestor. The Genome Selection for Reconstruction problem.

hnazario
Download Presentation

Selecting Genomes for Reconstruction of A ncestral Genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore

  2. Boreoeutherian Ancestor

  3. The Genome Selection for Reconstruction problem Instance: Given a phylogeny P of a set of genomes, an integer k and a reconstruction method T (say parsimony). Solution: k genomes in the phylogeny that gives the highest accuracy of reconstructing the ancestral genome at the root of the phylogeny, using method T.

  4. Two reasons • It is often impossible to sequence all descendent genomes below an ancestor; • More taxa do not necessarily give a higher accuracy for the reconstruction of ancestral character states in general (examples will be given below)

  5. Outline • Introduction to reconstruction accuracy analysis • More genomes are not necessarily better for reconstruction • Greedy algorithms for genome selection A joint work with G. Li, J. Ma and M. Steel

  6. 1. Reconstruction and its Accuracy There are different methods for reconstructing the ancestral character states • Parsimony • Maximum likelihood methods (Koshi & Golstein’96, Yang et al’95) • Bayesian methods (Yang et al’95) In this work, we study the problem with the Fitch parsimony and maximum likelihood in the Jukes-Cantor evolutionary model.

  7. Jukes-Cantor Model • Characters evolve by a symmetric, reversible Markov process. • Probability of a substitution change of any sort is the same on a branch. • For simplicity, we assume there are two states 0 and 1.

  8. Reconstruction Accuracy In the symmetric Jukes-Cantor model, • the reconstruction accuracy of a method is independent of the prior distribution of the states at the root.

  9. D denotes a state configuration at leaves: • it has one state for each leaf. • There are state configurations since there • are 2 possible states at each leaf. I(0, D, K)is 1 if the method K reconstructs state 0 from D and 0 otherwise. Pr(D|0) is the probability that 0 at the root evolves into D.

  10. D denotes a state configuration at leaf nodes: • it has one state for each leaf. • There are state configurations since there • are 2 possible states at each leaf. The reconstruction accuracy is the sum of generating Prob. of state configurations which allow the true state 0 to be recovered by the method K.

  11. Previous Analysis Works • Simulation study (Martins’99, Mooers’04, Salisbury & Kim’01, Zhang & Nei’97, Yang et al’95); • Theoretical study (Mossel’01, Lucena and Haussler’05, Maddison’95)

  12. {0} {0, 1} 0 0 1 Fitch Method Given a state configuration of the leaves, the Fitch method reconstructs a subset of states at each internal node (from leaves to the root ) recursively: A C B

  13. Calculating the Reconstruction Accuracy of Fitch Method The unambiguous reconstruction accuracy: PAccuracy= P[{1}|1]=P[{0}|0] and the reconstruction accuracy P[{1}|1]= the probability that Fitch method outputs true state at the root. P[{0}|1], P[{1}|1], and P[{0, 1}|1] can be calculated by a dynamic approach (Maddison, 1995)

  14. Outline • Introduction to reconstruction accuracy analysis • More genomes are not necessarily better for reconstruction accuracy • Greedy algorithms for genome selection

  15. 2. Reconstruction accuracy is not a monotone function of the size of taxon sampling • There is a large clade with • a long stem • A short single sister lineage umbalanced tree Such a phylogeny is used when both fossil record and data at extant species are used for reconstruction (Finarelli and Flynn, 2006)

  16. p1 is the conservation probability on AY p2 is the conservation probability on AZ A p1 p2 Y Z Theorem 1: Aparsimony < p1 if ½< p2<= p1 {0} Proof. {0} 0 {0} 0 {0, 1} PA[{0}|0] = PrAY[00] x (PrAZ[00] PZ[{0} or {0, 1}| 0] + PrAZ[01] PZ[{0} or {0, 1}| 1] = p1 (p2 (1- PZ[{1}| 0] ) + (1-p2) (1-PZ[{1}|1]) = p1 ( 1- p2 PZ[{1}|0] – (1-p2) PZ[{1}|1] )

  17. p1 is the conservation probability on AY p2 is the conservation probability on AZ A p1 p2 Y ½ < p2 <= p1 Z {0, 1} {0, 1} 0 {1} 1 {0} PA[{0, 1}|0] = [p1p2+(1-p1)(1-p2)] x PZ[{1}|0] + [ p1(1-p2)+p2(1-p1)] X PZ[{0}|0] Aparsimony = PA[{0}|0] + ½ PA[{0, 1}|1] = p1 + ½ (1-p1-p2) PZ[{1}|0] + ½(p2-p1) PZ[{0}|0] < p1

  18. The reconstruction accuracy oncomb-shaped trees in the limit case

  19. p1 is the conservation probability on AY p2 is the conservation probability on AZ A p1 p2 DZ : a state configuration below Z. Y Z Theorem 2: AML = p1 if ½< p2<= p1 0 Marginal ML method: 0, 1? PrA( 0DZ |s): the probability that s at A evolves into state configuration 0DZ, s=0,1. 0 DZ PrA(0DZ|0) = p1 x [ p2PrZ(DZ|0) + (1-p2)PrZ(DZ| 1)] PrA(0DZ|1) = (1-p1) x [ (1-p2)PrZ(DZ|0) + p2PrZ(DZ| 1)] PrA(0DZ|0)-PrA(0DZ|1)=(p1+p2-1)PrZ(DZ|0) + (p1-p2)PrZ(DZ|1) >0 The marginal ML outputs 0 at A iff the state at Y is 0.

  20. Another Example showing the Non-monotone Property of Reconstruction Accuracy

  21. Simulation • Experiment setup • Yule birth-death model • Conservation probability along branches: 0.5~1 • Count the number of random trees in which the ambiguous accuracy of using a single (longest or shortest) path is better than that from the full phylogeny

  22. Simulation results:Counting the bad trees +: using the shortest path

  23. Comparison of Parsimony, Joint ML and Marginal ML • 500 random trees with 12 leaves generated: • Yule birth-death model • branch length is uniform from 0 to 1 • MML outperforms JML, MP. • In 80% of instances, MML is strictly better than JML • In 99% of instances, JML is strictly better than MP.

  24. Outline • Introduction to reconstruction accuracy analysis • More genomes are not necessarily better for reconstruction accuracy • Greedy algorithms for genome selection problem

  25. Genome selection for reconstructionthe problem • Instance: A phylogeny P over n genomes, integer k and a reconstruction method T • Question: Find k genomes that allows the ancestral genome of the root of P to be reconstructed with the maximum accuracy, using method T.

  26. Our approaches • The genome selection problem is unlikely polynomial-time solvable (no hardness proof yet) • As a result, we propose two greedy algorithms for the problem: Forward greedy algorithm & Backward greedy algorithm

  27. Forward Greedy Algorithm • 1. Set S ← φ; • 2. For i = 1, 2, · · · , k do 2.1) for each genome g not in S, compute the accuracyA(g)of the reconstruction by applying method Tto S ∪ {g}; 2.2) add g with the max accuracy A(g)to S ; • 3. Output S S is the set of selected genomes

  28. Backward Greedy Algorithm • 1. Let S contain all the given genomes; • 2. For i = 1, 2, · · · , n − k do 2.1) for each genome g in S, compute the accuracy A’(g) of the reconstruction by applying Tto S − {g}; 2.2) remove g from S if A’(g)is the max over all g’s; • 3. Output S

  29. Validation test– Trees with the same height • Experiment setup • Random trees with N (9, or 16) leaves generated by program Evolver in PAML with the following parameters: • Birth rate=10; Death rate=5; Sampling fraction=1. • Tree height = 0.1, 0.2, 0.5, 1, 2, or 5.

  30. Performance of the selection method for reconstruction with Parsimony

  31. Performance of the selection method for reconstruction with Marginal Maximum Likelihood

  32. Performance of the selection method for reconstruction with Joint Maximum Likelihood

  33. Marginal Maximum Likelihood

  34. Parsimony Method

  35. Concluding remarks • Reconstruction accuracy is not monotone increasing with the taxon sampling size in unbalanced trees for Parsimony method --- Another kind of “inconsistency” 1. One implication of this observation is that Parsimony, ML method might not explore the full power of incorporating fossil record into current data. Hence, modification might probably be needed. 2. Caution should be used in drawing conclusion on testing hypothesis with ancestral state reconstruction.

  36. 3. Is the reconstruction accuracy function monotone in ultrametric phylogeny? It seems true when the number of taxa is large. Consider the complete binary tree when conservation prob on each branch is less than 7/8, (The ambiguous reconstruction accuracy) = (the accuracy of using just one taxa ) =1/2 in the limit case. (Rormula exists, see Steel’89.)

  37. Concluding remarks • Formulate the genome selection for reconstruction problem • Two greedy algorithms proposed for the problem • Validation test shows that the reconstruction accuracy of using the genomes selected by the greedy algorithms are comparable to the the max reconstruction accuracy.

  38. Thanks You!

  39. A Biological Example • Boreoeutherian ancestor • From Encode project • 4 states at leaf nodes • Expected accuracy at the root node

  40. A Biological Example – Results • Backward algo is always similar as the exhaustive search • With 8 leaf nodes, the accuracy from Backward algo is 93.6%, near to the accuracy 94.6% with full phylogeny

  41. Outline • Introduction to phylogeny reconstruction accuracy analysis • More genomes are not necessarily better for reconstruction accuracy • Greedy algorithms for genome selection problem • Validation test • Conclusion

  42. Conclusion • Formulate the genome selection for reconstruction problem • Two greedy algorithms proposed for the problem • Validation test shows that the reconstruction accuracy of using the genomes selected by the greedy algorithms are comparable to the the max reconstruction accuracy.

  43. Fitch Parsimony method Given character states in the leave nodes the method reconstructs a subset of states at each internal nodes by the following rule: {0} {0, 1} 0 0 1

  44. More Genomes Are Not Necessarily Better – An example with 4 leaves • The complete tree

  45. More Genomes Are Not Necessarily Better – An example with 4 leaves • The unambiguous reconstruction accuracy of using one genome is Ppath= p2+(1-p)2; • The unambiguous reconstruction accuracy of using all the 4 genomes is Pwhole= Ppath – 3p2(1-p)2; More genomes give more noise!

  46. A Small with Six Leaves (The ambiguous reconstruction accuracy) < (The unambiguous accuracy on the shortest path) When 0.5<p<0.65

  47. Reconstruction accuracy on complete phylogeny in limit case When conservation rate on each branch is less than 7/8, (The ambiguous reconstruction accuracy) = (the accuracy of using just one genome ) =1/2

More Related