1 / 68

Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study

Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study. Ming-Yang Kao Department of Computer Science Northwestern University Evanston, Illinois U. S. A. Perspectives. computer science. biology. Use biology ideas to solve computer science problems.

rocio
Download Presentation

Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast and Accurate Reconstructionof Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer ScienceNorthwestern University Evanston, Illinois U. S. A.

  2. Perspectives computer science biology Use biology ideas to solve computer science problems Use computer science tools to solve biology problems this talk

  3. Use Biology to Solve CS Problems • DNA Computing • DNA Self-Assembly • Genetic Algorithms • Neural Network • Others

  4. Use CS to Solve Biology Problems • Bioinformatics or Computational Biology data mining (this talk) • Related fields computational neuroscience computational ecology medical informatics … many more ...

  5. Example Research Areas of Bioinformatics • DNA sequencing • DNA microarray analysis • DNA self-assembly for nano-structures • DNA word design • RNA secondary structure prediction • Protein sequencing (my talk #4) • Proteomics • Protein database search • Protein sequence design (my talk #3) • Protein landscape analysis • Phylogeny reconstruction (this talk) • Phylogeny comparison (my talk #1)

  6. Evolutionary Trees definition: a tree with distinct labels at leaves leaf labels: species, organisms, DNAs, RNAs, proteins, features, etc. ancestral species wheat rice peach plum bird (Just a joke!) present-day species

  7. Evolutionary Trees leaf labels: DNA sequences wheat rice CGGC CGGG peach plum bird CCAT CCAG AAGT (Just a joke!)

  8. Problem Formulation Input: DNA sequences of present-day species Output: the true evolutionary tree Question: What is “true”? Need a model! wheat rice CGGC CGGG peach plum bird CCAT AAGT CCAG (Just a joke!)

  9. A Fundamental Problem of Biology • Since the time of Charles Darwin, • Problem:reconstruct the evolutionary history of all known species. • Importance: • intellectually fascinating • practical benefits – medicine, food … • Charles Robert Darwin --- 1809-1882 • Origin of Species --- 1859

  10. Main Difficulties • Availability of data • Hundreds of millions of species --- unlikely to be all available any time soon or ever. • But DNA sequences of more and more species are becoming available. • Extracting information from data • focus of this talk

  11. Today’s Technical Focus Input: DNA sequences of present-day species Output: the true evolutionary tree Question: What is “true”? Need a model! Collaborators: Csuros & Kim wheat rice CGGC CGGG peach plum bird CCAT AAGT CCAG

  12. Main Result An algorithm that constructs an evolutionary tree from biomolecular sequences • Provable high accuracy • Short sequence length • Optimal running time • Optimal memory space

  13. Outline of Technical Discussion • Define the model of evolution. • Formulate the computational problem. • Discuss the theoretical performance of our algorithm. • Discuss the empirical performance. • Describe and analyze the algorithm. • Further research.

  14. Outline of Technical Discussion (1) • Define the model of evolution. • Formulate the computational problem. • Discuss the theoretical performance of our algorithm. • Discuss the empirical performance. • Describe and analyze the algorithm. • Further research.

  15. Model of EvolutionIntuitions ACGTACT AGTTCCT AGGAGAA CAGGAGTTTTAA • Mutation occurs probabilistically. • edge length ~ time • edge length ~ mutation probability • edge length ~ dissimilarity (or distance)

  16. Jukes-Cantor Model of Evolution (1)Edge Mutation Probability A X • No insertion or deletion. • X = A with probability 1 - 0.6 = 0.4 • X = C, G, or T with probability 0.6/3 = 0.2

  17. Jukes-Cantor Model of Evolution (2)Independent Mutations along All Edges A 0.2 0.6 A C 0.65 0.7 G G

  18. Jukes-Cantor Model of Evolution (3)i.i.d. mutations at every character AAGT 0.2 0.6 AGTT CAGG 0.65 0.7 GGTG GTTG

  19. Outline of Technical Discussion (2) • Define the model of evolution. • Formulate the computational problem. • Discuss the theoretical performance of our algorithm. • Discuss the empirical performance. • Describe and analyze the algorithm. • Further research.

  20. True Tree (not known to algorithm) Problem Formulation CAGGT 0.3 0.2 CGTTT AGTGT 0.2 0.5 0.7 0.6 CGTGT ATCGT CAGGT GTACT • Pick any sequence for the root • (also unknown to algorithm). • Generate the other sequences. 0.7 0.1 GGTAC TGGAC Input: but not the other sequences, nor the tree. unrooted Output:

  21. Computational Objectives • Minimize: • running time • memory space • probability of incorrect output • sample size, i.e., length of the input sequences Input: DNA sequences Output:

  22. Outline of Technical Discussion (3) • Define the model of evolution. • Formulate the computational problem. • Discuss the theoretical performance of our algorithm. • Discuss the empirical performance. • Describe and analyze the algorithm. • Further research.

  23. Triplets • A triplet is one formed by three leaves. • P is thecenter of XYZ. X P Z Y

  24. G-depth of Triplet X Z Y # of edges between X and Y 5, 8, 7

  25. G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree g-depth = 4 the best case

  26. G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree g-depth = 2 log n the worst case

  27. G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree • at most 2 log n • can be O(1)

  28. Our New Result (1)

  29. Our New Result (2) polynomial sample size

  30. Our New Result (3) polynomial sample size provable high accuracy

  31. Our New Result (4) polynomial sample size provable high accuracy optimal time & space

  32. Comparison with Previous Results this talk

  33. Outline of Technical Discussion (4) • Define the model of evolution. • Formulate the computational problem. • Discuss the theoretical performance of our algorithm. • Discuss the empirical performance. • Describe and analyze the algorithm. • Further research.

  34. Experimental Study Design • Step 1 -- Pick a model tree T. • Step 2 -- Use T to generate sequences. • Step 3 -- Use an algorithm to reconstruct a tree T’ from the sequences (without knowing T). • Step 4 -- Compare T’ and T.

  35. Wrong and Right Edges true tree X1 X3 X4 X2 X5 X3 X1 bad good X4 X2 reconstructed tree X5

  36. Experiment #1 • the 135-taxon African-Eve tree (courtesy of Huson and Maddison) • algorithms compared: HGT and bioNJ (Olivier Gascuel) • parameters: sequence length and percentage of wrong edges • edge mutation probabilities: between 0.47 and 0.088 • # of simulations = 20 per sequence length • more experiments in progress

  37. 135-taxon African Eve Tree

  38. Results of Experiment #1

  39. Experiment #2 • a 1892-taxon tree of eukaryotes • algorithms compared: HGT and bioNJ • parameters: sequence length and percentage of wrong edges • edge mutation probabilities: between 0.47 and 0.088 • # of simulations = 20 per sequence length • more experiments in progress • several variants of the basic HGT

  40. Results of Experiment #2

  41. Results of Experiment #2

  42. Results of Experiment #2

  43. Outline of Technical Discussion (5) • Define the model of evolution. • Formulate the computational problem. • Discuss the theoretical performance of our algorithm. • Discuss the empirical performance. • Describe and analyze the algorithm. • Further research.

  44. Our New Result (4) polynomial sample size provable high accuracy optimal time & space

  45. Outline of Technical Discussion (5) • Describe the HGT algorithm. • Prove the sample size bound (and high probability for accuracy). • Prove the optimal time & space.

  46. Outline of Technical Discussion (5/1) • Describe the HGT algorithm. • Prove the sample size bound (and high probability for accuracy). • Prove the optimal time & space.

  47. Closeness and Distance of Two Leaves AAGT 0.2 AGTT X CAGG Closeness is multiplicative. Distance is additive!!! 0.65 0.7 The larger the closeness, the more accurately we can estimate the distance. GGTG Y GTTG

  48. Closeness = Cubic Root of Determinant AAGT CAGG A C G T

  49. Closeness of Triplet The larger the closeness, the more accurately we can estimate the three pairwise distances. AAGT 0.2 AGTT X CAGG 0.65 0.7 GGTG Y GTTG Z

  50. Assemble Triplets Into Treevia Distance Additivity (I) P c a b X A Y P 6 25 3 X A Y

More Related