Simulation modeling and benchmarks
Download
1 / 54

Simulation, Modeling, and Benchmarks - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Simulation, Modeling, and Benchmarks. U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen U Texas : David Hillis, Lauren Meyers Eric Miller, Tracy Heath, Derrick Zwickl NC State: Spencer Muse Errol Strain

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Simulation, Modeling, and Benchmarks' - aiden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Simulation modeling and benchmarks

Simulation, Modeling, and Benchmarks

U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson

Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen

U Texas : David Hillis, Lauren Meyers

Eric Miller, Tracy Heath, Derrick Zwickl

NC State: Spencer Muse

Errol Strain

Yale: Paul Turner

and

Bernard Moret

Tandy Warnow

Robert Jensen

Randy Linder


Goal: Develop validated datasets of sufficient complexity and scale to realistically benchmark latest tree algorithms


Problems
Problems

  • Large-scale simulations is computationally demanding and difficult to reproduce independently

  • The model parameter space explodes in combinatorial complexity with increase in model complexity

  • Large-scale algorithm test experimental design is extremely difficult to manage

  • Branching structure specification is critical but the standard options are limited for very large trees

  • Credible simulation model acceptable to the community is difficult to establish


Problems1
Problems

  • Large-scale simulations is computationally demanding and difficult toreproduce independently

  • The model parameter space explodes in combinatorial complexity with increase in model complexity

  • Large-scale algorithm test experimental design is extremely difficult to manage

  • Branching structure specification is critical but the standard options are limited for very large trees

  • Credible simulation model acceptable to the community is difficult to establish


Simulation design

Branching structure specification is critical

computationally demanding

difficult toreproduce

model acceptable to the community

difficult to manage

combinatorial complexity

Simulation Design

  • Pre-generate a very large dataset (>106 positions) over a very large complex tree (>106 taxa) using a suite of complex models of evolution

  • Store the data in a database

  • Retrieve subsets of the data by various sampling schemes


Simulation and data access
Simulation and Data Access

Model

Characterization

Simulators

  • Character Evolution Simulators

  • HyPhy

  • Micro-evolution

  • Others

Taxon Sampling

Database

  • Tree Topology Simulators

  • Pure Birth

  • Birth-Death

  • Empirical Fit

  • Others

Data Subset with Associated Subtree

Model Sampling

  • Others

  • Tree/Char Combined

  • Experimental Evolution

  • Virtual Cell

  • etc

Format Translators

PAUP*, etc




Database performance constant or linear time queries

random

stratified

Implemented tree-based taxon sampling query

MRC subtree

Database Performance: Constant or Linear Time Queries

Select 20 fixed taxa from tree of size t (100 to 600)

Select n random taxa from 2000-taxon tree

Select 20 random taxa from tree of size t (100 to 600)


Query options
Query Options

  • Species Selection

    • Select All

    • Random Selection (num species)

    • Select By Depth (num species, depth threshold)

    • Manual Selection

  • Sequence Selection

    • Select All

    • Random Selection (num bp)

    • Manual Selection (positions)

  • Repeat Query (num runs)

  • Rerun Query (seed)


Query management
Query Management

  • Load queries from the database

  • Save queries to the database

  • Import queries from a text file

  • Export queries to a text file

  • Create local queries (ie not stored in the database)

  • Delete queries from local session and database

  • Access query objects through the command line

  • Manipulate query objects within jython scripts


Simulation
Simulation

  • Tree Topology Simulation:

    • Generate the temporal branching structure of populations/species

  • Character Simulation:

    • Generate the evolution of sequences/morphology/etc over the tree generated above


Tree topology simulation tracy heath and david hillis ut austin
Tree Topology Simulation(Tracy Heath and David Hillis, UT Austin)

  • Standard Approach:

    • Simulate a homogeneous branching process (e.g., pure-birth model)

    • Sub-sample from a large homogeneous branching process

  • Problems:

    • Larger trees are self-similar to smaller trees

    • Most biologists don’t think trees in simulations “look” like “real” trees


Tree topology simulation
Tree Topology Simulation

  • Modified code from Phyl-O-Gen - a tree simulation program (Rambaut)

  • Birth-death process

  • After a speciation event, the rates of each daughter lineage are mutated

    • The new rate is obtained by multiplying the parent rate by a gamma-distributed multiplier centered on 1

    • The new rate is accepted in proportion to a prior distribution on birth and death rates


Tree shape
Tree Shape

Balanced

Imbalanced


Tree shape1
Tree Shape

Expectation under the equal rates Markov (ERM) model


Tree shape2
Tree Shape

I = 1

I = 0

I = 1

I = 1

I = 1

I = 0

I = 0.5

I = 1

I = 1

Weighted mean imbalance (I)

Expectation under the equal rates Markov (ERM) model

I = 0.5


Tree topology simulation1
Tree Topology Simulation

  • Simulated trees were compared with published phylogenies using measures of tree shape.

    • 200 trees of 10000 taxa under constant rates standard model

    • 200 trees of 10000 taxa under variable rates our model

  • 433 trees were collected from various sources and sorted based on the method used to estimate the phylogeny and the proportion of the ingroup sampled.

  • Weighted mean imbalance (I) was used to compare the simulated trees with published trees


Comparing trees
Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees1
Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees2
Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees3
Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees4
Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees5
Comparing Trees

weighted mean imbalance (I)

ln(node size)


Million taxon trees
Million-taxon Trees

  • Three trees ranging from simple to complex were simulated

    • Equal rates tree

    • Variable rates tree

    • Variable rates tree with mass extinctions


Multi layered simulations for character evolution
Multi-layered simulations for character evolution

  • Key molecule simulation (Muse, Hillis)

  • RNA macro-evolution simulation (Kim)

  • RNA micro-evolution simulation (Kim, Meyers)

  • Experimental viral evolution (Turner)


  • Key molecule simulation (Muse, Hillis, Holder)

    • Estimate statistical parameters for real molecules (e.g., rbcL) using HyPhy, extend model family to include more discrete rate distribution and positional dependencies, and finally generate a very large tree of 106~107 taxa using the key molecule models as its basis.

    • rbcL model family estimated under codon-specific model (Muse)

    • rRNA gene model (including 2nd structure; Hillis and Gutel)

invariable sites

rbcL

  • =0.8

  • / = 0.5

  • = (0.1,..,0.5)

    .

    .

  • =2.1

  • / = 1.3

  • = (0.1,..,0.2)

    .

    .

  • =1.1

  • / = 1.7

  • = (0.3,..,0.2)

    .

    .


Simulation of complex evolutionary processes

Simulation of complex evolutionary processes

  • Reflect more complex dynamics

  • Heterogeneous rates:

    • lineage and site specific mutation rates

    • genomic context dependent rates

  • Phenotypic effects

    • Selection

    • Population interaction



  • Micro-Macro simulation model (Meyers, Kim) genotype-phenotype evolution

    • Generate a population of molecules incorporating a fitness model and speciation process based on RNA folding. Fitness from (1) similarity to known 16S RNA (~67k seqs); (2) similarity to known 16S structure (~200 crystal structure); (3) folding stability

  • Experimental viral evolution (Turner; non-ITR funding for empirical work)

    • Use the RNA bacteriophage phi-6 system to generate an experimental phylogeny (~64-taxon tree with host switching and horizontal transfer)


Individual based simulation e miller and l ancel
Individual-based simulation genotype-phenotype evolution(E. Miller and L. Ancel)

Different adaptive peaks

More fit


Strategy for macro evolution
Strategy for macro-evolution genotype-phenotype evolution

Compute probability of fixation of different mutation types using Kimura’s derivations. Draw waiting time for each event from an exponential process

mutation

fixation


Mutations in RNA genotype-phenotype evolution

?

Advantageous

Neutral

Deleterious


Folding energy based fitness model genotype-phenotype evolution

-491.07J/mol

-636.71J/mol

Assumption: Thermodynamically more stable structure is more fit.


2.2 A Free Energy Based Schema genotype-phenotype evolution

M0 (E0)

M1 (E1)

Mn (En)

.

.

.

M2 (E2)

Mi (Ei)

M3 (E3)

. . .


Computation genotype-phenotype evolution

For each ancestral RNA molecule, enumerate all its mutants.

Compute Ei – free energy of a RNA molecule Mi

RNAeval from Vienna RNA package computes Ei for all possible single mutants of a RNA molecule in 5~6 minutes using one CPU (2 ghz).

Draw new descendent molecule according to convolution of mutation probability and fixation probability from free energy calculations.


Acceptance rejection method
Acceptance-Rejection Method genotype-phenotype evolution

In the descendent, assume that the energy differential to local minimum is the same as the ancestor. Sample a new mutation, accept-reject as a conditional event vis-à-vis the local minimum

Enumerate Energy (=fitness) landscape around ancestor, Find minimum (most fit)


New rna macro simulator
New RNA macro simulator genotype-phenotype evolution

  • Can simulate folding-energy dependent evolution efficiently (estimate 30 days for 1 million taxa on 20 CPU 2ghz cluster)

  • Produces secondary structure changes and records history of changes

  • Produces indel events and produces alignment history--will output files with indels and the correct alignment

  • Parameterized with empirical data statistics (Hillis, Gutell)


Alignment genotype-phenotype evolution

Top is homologous alignment. Bottom is Clustalw alignment.

First sequence is root RNA, others are randomly chosen leaf RNA’s


Statistics from 100 Eukaryote ssRNA genotype-phenotype evolution

Statistics from RNASim


Heuristic search landscape properties
Heuristic Search Landscape Properties genotype-phenotype evolution

  • rbcL: 467 taxa, 660 sites

  • rna simulator: 512 taxa

  • seqgen: 512 taxa,

    • rate heterogeneity: 0 (no gamma dist.), gamma=1

  • Sample 660 sites from each dataset without replacement

  • Call PAUP Hsearch with default settings and time limit=6hrs

  • Report best parsimony score at each second


Normalized parsimony score excess npse
Normalized Parsimony Score Excess (NPSE) genotype-phenotype evolution

  • Let B(t) be the best parsimony score at time t; let B(0) be the score of the starting tree

  • B is monotonically decreasing

  • Assume we run the heuristic search for 6 hrs. The NPSE is defined as

    NPSE(t) = (B(t)-B(6hrs))/(B0-B(6hrs))


P phaseo genotype-phenotype evolution.

P pseudo.

P phaseo. b’neck

P pseudo. b’neck

Alternating b’neck

Increasing P. phaseo

pop size

Decreasing P. phaseo

pop size

350 generations

Experimental Evolution of Phi 6 virus


O71 genotype-phenotype evolution

K71

J71

N71

I71

M71

L71

P71

G71

A71

H71

E71

C71

B71

D71

F71

P41

L41

A61

J41

H41

E41

F41

A41

350 generations

H51

D31

Clones for sequencing

D51

B21

G51

B31

Whole-genome sequencing 50% complete (Penn Genomics Inst funds)

B51

1

11

C51

C31

F51

A21

E51

A31

A51


350 generations genotype-phenotype evolution

Next Steps:

Introduce Reticulations


Christina Burch, UNC genotype-phenotype evolution

Paul Higgs, McMaster

Laura Landweber, Princeton

Carlo Maley, Wistar Inst.

Claus Wilke, UT Austin

John Yin, U Wisconsin


Year 3 status
Year 3 Status genotype-phenotype evolution

  • Complex Simulations

    • Scaling up to 1 million-taxon tree

      • Generated Complex Branching Models

      • Developed Novel Importance Sampling Scheme

      • Parallelized HyPhy

  • Simulation Database

    • Scaling and Code Hardening

      • Developed New Extensions to Indexing

      • Developed and tested with a test suite

  • Experimental Evolution

    • Generated 16 new lineages evolved for 350 generations under heterogeneous conditions

    • 50% of whole genomes sequenced


Simulation modeling and benchmarks1

Simulation, Modeling, and Benchmarks genotype-phenotype evolution

U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson

Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen

U Texas : David Hillis, Lauren Meyers

Eric Miller, Tracy Heath, Derrick Zwickl

NC State: Spencer Muse

Errol Strain

Yale: Paul Turner

and

Bernard Moret

Tandy Warnow

Robert Jensen

Randy Linder


ad