1 / 54

Simulation, Modeling, and Benchmarks

Simulation, Modeling, and Benchmarks. U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen U Texas : David Hillis, Lauren Meyers Eric Miller, Tracy Heath, Derrick Zwickl NC State: Spencer Muse Errol Strain

aiden
Download Presentation

Simulation, Modeling, and Benchmarks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen U Texas : David Hillis, Lauren Meyers Eric Miller, Tracy Heath, Derrick Zwickl NC State: Spencer Muse Errol Strain Yale: Paul Turner and Bernard Moret Tandy Warnow Robert Jensen Randy Linder

  2. Goal: Develop validated datasets of sufficient complexity and scale to realistically benchmark latest tree algorithms

  3. Problems • Large-scale simulations is computationally demanding and difficult to reproduce independently • The model parameter space explodes in combinatorial complexity with increase in model complexity • Large-scale algorithm test experimental design is extremely difficult to manage • Branching structure specification is critical but the standard options are limited for very large trees • Credible simulation model acceptable to the community is difficult to establish

  4. Problems • Large-scale simulations is computationally demanding and difficult toreproduce independently • The model parameter space explodes in combinatorial complexity with increase in model complexity • Large-scale algorithm test experimental design is extremely difficult to manage • Branching structure specification is critical but the standard options are limited for very large trees • Credible simulation model acceptable to the community is difficult to establish

  5. Branching structure specification is critical computationally demanding difficult toreproduce model acceptable to the community difficult to manage combinatorial complexity Simulation Design • Pre-generate a very large dataset (>106 positions) over a very large complex tree (>106 taxa) using a suite of complex models of evolution • Store the data in a database • Retrieve subsets of the data by various sampling schemes

  6. Simulation and Data Access Model Characterization Simulators • Character Evolution Simulators • HyPhy • Micro-evolution • Others Taxon Sampling Database • Tree Topology Simulators • Pure Birth • Birth-Death • Empirical Fit • Others Data Subset with Associated Subtree Model Sampling • Others • Tree/Char Combined • Experimental Evolution • Virtual Cell • etc Format Translators PAUP*, etc

  7. Crimson Simulation DB

  8. Obligatory Schema Diagram (Don’t Look)

  9. random stratified Implemented tree-based taxon sampling query MRC subtree Database Performance: Constant or Linear Time Queries Select 20 fixed taxa from tree of size t (100 to 600) Select n random taxa from 2000-taxon tree Select 20 random taxa from tree of size t (100 to 600)

  10. Query Options • Species Selection • Select All • Random Selection (num species) • Select By Depth (num species, depth threshold) • Manual Selection • Sequence Selection • Select All • Random Selection (num bp) • Manual Selection (positions) • Repeat Query (num runs) • Rerun Query (seed)

  11. Query Management • Load queries from the database • Save queries to the database • Import queries from a text file • Export queries to a text file • Create local queries (ie not stored in the database) • Delete queries from local session and database • Access query objects through the command line • Manipulate query objects within jython scripts

  12. Simulation • Tree Topology Simulation: • Generate the temporal branching structure of populations/species • Character Simulation: • Generate the evolution of sequences/morphology/etc over the tree generated above

  13. Tree Topology Simulation(Tracy Heath and David Hillis, UT Austin) • Standard Approach: • Simulate a homogeneous branching process (e.g., pure-birth model) • Sub-sample from a large homogeneous branching process • Problems: • Larger trees are self-similar to smaller trees • Most biologists don’t think trees in simulations “look” like “real” trees

  14. Tree Topology Simulation • Modified code from Phyl-O-Gen - a tree simulation program (Rambaut) • Birth-death process • After a speciation event, the rates of each daughter lineage are mutated • The new rate is obtained by multiplying the parent rate by a gamma-distributed multiplier centered on 1 • The new rate is accepted in proportion to a prior distribution on birth and death rates

  15. Tree Shape Balanced Imbalanced

  16. Tree Shape Expectation under the equal rates Markov (ERM) model

  17. Tree Shape I = 1 I = 0 I = 1 I = 1 I = 1 I = 0 I = 0.5 I = 1 I = 1 Weighted mean imbalance (I) Expectation under the equal rates Markov (ERM) model I = 0.5

  18. Tree Topology Simulation • Simulated trees were compared with published phylogenies using measures of tree shape. • 200 trees of 10000 taxa under constant rates standard model • 200 trees of 10000 taxa under variable rates our model • 433 trees were collected from various sources and sorted based on the method used to estimate the phylogeny and the proportion of the ingroup sampled. • Weighted mean imbalance (I) was used to compare the simulated trees with published trees

  19. Comparing Trees weighted mean imbalance (I) ln(node size)

  20. Comparing Trees weighted mean imbalance (I) ln(node size)

  21. Comparing Trees weighted mean imbalance (I) ln(node size)

  22. Comparing Trees weighted mean imbalance (I) ln(node size)

  23. Comparing Trees weighted mean imbalance (I) ln(node size)

  24. Comparing Trees weighted mean imbalance (I) ln(node size)

  25. Million-taxon Trees • Three trees ranging from simple to complex were simulated • Equal rates tree • Variable rates tree • Variable rates tree with mass extinctions

  26. Multi-layered simulations for character evolution • Key molecule simulation (Muse, Hillis) • RNA macro-evolution simulation (Kim) • RNA micro-evolution simulation (Kim, Meyers) • Experimental viral evolution (Turner)

  27. Key molecule simulation (Muse, Hillis, Holder) • Estimate statistical parameters for real molecules (e.g., rbcL) using HyPhy, extend model family to include more discrete rate distribution and positional dependencies, and finally generate a very large tree of 106~107 taxa using the key molecule models as its basis. • rbcL model family estimated under codon-specific model (Muse) • rRNA gene model (including 2nd structure; Hillis and Gutel) invariable sites rbcL • =0.8 • / = 0.5 • = (0.1,..,0.5) . . • =2.1 • / = 1.3 • = (0.1,..,0.2) . . • =1.1 • / = 1.7 • = (0.3,..,0.2) . .

  28. Simulation of complex evolutionary processes • Reflect more complex dynamics • Heterogeneous rates: • lineage and site specific mutation rates • genomic context dependent rates • Phenotypic effects • Selection • Population interaction

  29. RNA and its secondary structure as a model system for genotype-phenotype evolution

  30. Micro-Macro simulation model (Meyers, Kim) • Generate a population of molecules incorporating a fitness model and speciation process based on RNA folding. Fitness from (1) similarity to known 16S RNA (~67k seqs); (2) similarity to known 16S structure (~200 crystal structure); (3) folding stability • Experimental viral evolution (Turner; non-ITR funding for empirical work) • Use the RNA bacteriophage phi-6 system to generate an experimental phylogeny (~64-taxon tree with host switching and horizontal transfer)

  31. Individual-based simulation (E. Miller and L. Ancel) Different adaptive peaks More fit

  32. Strategy for macro-evolution Compute probability of fixation of different mutation types using Kimura’s derivations. Draw waiting time for each event from an exponential process mutation fixation

  33. Mutations in RNA ? Advantageous Neutral Deleterious

  34. Folding energy based fitness model -491.07J/mol -636.71J/mol Assumption: Thermodynamically more stable structure is more fit.

  35. 2.2 A Free Energy Based Schema M0 (E0) M1 (E1) Mn (En) . . . M2 (E2) Mi (Ei) M3 (E3) . . .

  36. Computation For each ancestral RNA molecule, enumerate all its mutants. Compute Ei – free energy of a RNA molecule Mi RNAeval from Vienna RNA package computes Ei for all possible single mutants of a RNA molecule in 5~6 minutes using one CPU (2 ghz). Draw new descendent molecule according to convolution of mutation probability and fixation probability from free energy calculations.

  37. Acceptance-Rejection Method In the descendent, assume that the energy differential to local minimum is the same as the ancestor. Sample a new mutation, accept-reject as a conditional event vis-à-vis the local minimum Enumerate Energy (=fitness) landscape around ancestor, Find minimum (most fit)

  38. New RNA macro simulator • Can simulate folding-energy dependent evolution efficiently (estimate 30 days for 1 million taxa on 20 CPU 2ghz cluster) • Produces secondary structure changes and records history of changes • Produces indel events and produces alignment history--will output files with indels and the correct alignment • Parameterized with empirical data statistics (Hillis, Gutell)

  39. Alignment Top is homologous alignment. Bottom is Clustalw alignment. First sequence is root RNA, others are randomly chosen leaf RNA’s

  40. Statistics from 100 Eukaryote ssRNA Statistics from RNASim

  41. Heuristic Search Landscape Properties • rbcL: 467 taxa, 660 sites • rna simulator: 512 taxa • seqgen: 512 taxa, • rate heterogeneity: 0 (no gamma dist.), gamma=1 • Sample 660 sites from each dataset without replacement • Call PAUP Hsearch with default settings and time limit=6hrs • Report best parsimony score at each second

  42. Normalized Parsimony Score Excess (NPSE) • Let B(t) be the best parsimony score at time t; let B(0) be the score of the starting tree • B is monotonically decreasing • Assume we run the heuristic search for 6 hrs. The NPSE is defined as NPSE(t) = (B(t)-B(6hrs))/(B0-B(6hrs))

  43. P phaseo. P pseudo. P phaseo. b’neck P pseudo. b’neck Alternating b’neck Increasing P. phaseo pop size Decreasing P. phaseo pop size 350 generations Experimental Evolution of Phi 6 virus

  44. O71 K71 J71 N71 I71 M71 L71 P71 G71 A71 H71 E71 C71 B71 D71 F71 P41 L41 A61 J41 H41 E41 F41 A41 350 generations H51 D31 Clones for sequencing D51 B21 G51 B31 Whole-genome sequencing 50% complete (Penn Genomics Inst funds) B51 1 11 C51 C31 F51 A21 E51 A31 A51

More Related