Simulation modeling and benchmarks
This presentation is the property of its rightful owner.
Sponsored Links
1 / 54

Simulation, Modeling, and Benchmarks PowerPoint PPT Presentation


  • 62 Views
  • Uploaded on
  • Presentation posted in: General

Simulation, Modeling, and Benchmarks. U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen U Texas : David Hillis, Lauren Meyers Eric Miller, Tracy Heath, Derrick Zwickl NC State: Spencer Muse Errol Strain

Download Presentation

Simulation, Modeling, and Benchmarks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Simulation modeling and benchmarks

Simulation, Modeling, and Benchmarks

U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson

Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen

U Texas : David Hillis, Lauren Meyers

Eric Miller, Tracy Heath, Derrick Zwickl

NC State: Spencer Muse

Errol Strain

Yale: Paul Turner

and

Bernard Moret

Tandy Warnow

Robert Jensen

Randy Linder


Simulation modeling and benchmarks

Goal: Develop validated datasets of sufficient complexity and scale to realistically benchmark latest tree algorithms


Problems

Problems

  • Large-scale simulations is computationally demanding and difficult to reproduce independently

  • The model parameter space explodes in combinatorial complexity with increase in model complexity

  • Large-scale algorithm test experimental design is extremely difficult to manage

  • Branching structure specification is critical but the standard options are limited for very large trees

  • Credible simulation model acceptable to the community is difficult to establish


Problems1

Problems

  • Large-scale simulations is computationally demanding and difficult toreproduce independently

  • The model parameter space explodes in combinatorial complexity with increase in model complexity

  • Large-scale algorithm test experimental design is extremely difficult to manage

  • Branching structure specification is critical but the standard options are limited for very large trees

  • Credible simulation model acceptable to the community is difficult to establish


Simulation design

Branching structure specification is critical

computationally demanding

difficult toreproduce

model acceptable to the community

difficult to manage

combinatorial complexity

Simulation Design

  • Pre-generate a very large dataset (>106 positions) over a very large complex tree (>106 taxa) using a suite of complex models of evolution

  • Store the data in a database

  • Retrieve subsets of the data by various sampling schemes


Simulation and data access

Simulation and Data Access

Model

Characterization

Simulators

  • Character Evolution Simulators

  • HyPhy

  • Micro-evolution

  • Others

Taxon Sampling

Database

  • Tree Topology Simulators

  • Pure Birth

  • Birth-Death

  • Empirical Fit

  • Others

Data Subset with Associated Subtree

Model Sampling

  • Others

  • Tree/Char Combined

  • Experimental Evolution

  • Virtual Cell

  • etc

Format Translators

PAUP*, etc


Crimson simulation db

Crimson Simulation DB


Obligatory schema diagram don t look

Obligatory Schema Diagram (Don’t Look)


Database performance constant or linear time queries

random

stratified

Implemented tree-based taxon sampling query

MRC subtree

Database Performance: Constant or Linear Time Queries

Select 20 fixed taxa from tree of size t (100 to 600)

Select n random taxa from 2000-taxon tree

Select 20 random taxa from tree of size t (100 to 600)


Query options

Query Options

  • Species Selection

    • Select All

    • Random Selection (num species)

    • Select By Depth (num species, depth threshold)

    • Manual Selection

  • Sequence Selection

    • Select All

    • Random Selection (num bp)

    • Manual Selection (positions)

  • Repeat Query (num runs)

  • Rerun Query (seed)


Query management

Query Management

  • Load queries from the database

  • Save queries to the database

  • Import queries from a text file

  • Export queries to a text file

  • Create local queries (ie not stored in the database)

  • Delete queries from local session and database

  • Access query objects through the command line

  • Manipulate query objects within jython scripts


Simulation

Simulation

  • Tree Topology Simulation:

    • Generate the temporal branching structure of populations/species

  • Character Simulation:

    • Generate the evolution of sequences/morphology/etc over the tree generated above


Tree topology simulation tracy heath and david hillis ut austin

Tree Topology Simulation(Tracy Heath and David Hillis, UT Austin)

  • Standard Approach:

    • Simulate a homogeneous branching process (e.g., pure-birth model)

    • Sub-sample from a large homogeneous branching process

  • Problems:

    • Larger trees are self-similar to smaller trees

    • Most biologists don’t think trees in simulations “look” like “real” trees


Tree topology simulation

Tree Topology Simulation

  • Modified code from Phyl-O-Gen - a tree simulation program (Rambaut)

  • Birth-death process

  • After a speciation event, the rates of each daughter lineage are mutated

    • The new rate is obtained by multiplying the parent rate by a gamma-distributed multiplier centered on 1

    • The new rate is accepted in proportion to a prior distribution on birth and death rates


Tree shape

Tree Shape

Balanced

Imbalanced


Tree shape1

Tree Shape

Expectation under the equal rates Markov (ERM) model


Tree shape2

Tree Shape

I = 1

I = 0

I = 1

I = 1

I = 1

I = 0

I = 0.5

I = 1

I = 1

Weighted mean imbalance (I)

Expectation under the equal rates Markov (ERM) model

I = 0.5


Tree topology simulation1

Tree Topology Simulation

  • Simulated trees were compared with published phylogenies using measures of tree shape.

    • 200 trees of 10000 taxa under constant rates standard model

    • 200 trees of 10000 taxa under variable rates our model

  • 433 trees were collected from various sources and sorted based on the method used to estimate the phylogeny and the proportion of the ingroup sampled.

  • Weighted mean imbalance (I) was used to compare the simulated trees with published trees


Comparing trees

Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees1

Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees2

Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees3

Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees4

Comparing Trees

weighted mean imbalance (I)

ln(node size)


Comparing trees5

Comparing Trees

weighted mean imbalance (I)

ln(node size)


Million taxon trees

Million-taxon Trees

  • Three trees ranging from simple to complex were simulated

    • Equal rates tree

    • Variable rates tree

    • Variable rates tree with mass extinctions


Multi layered simulations for character evolution

Multi-layered simulations for character evolution

  • Key molecule simulation (Muse, Hillis)

  • RNA macro-evolution simulation (Kim)

  • RNA micro-evolution simulation (Kim, Meyers)

  • Experimental viral evolution (Turner)


Simulation modeling and benchmarks

  • Key molecule simulation (Muse, Hillis, Holder)

    • Estimate statistical parameters for real molecules (e.g., rbcL) using HyPhy, extend model family to include more discrete rate distribution and positional dependencies, and finally generate a very large tree of 106~107 taxa using the key molecule models as its basis.

    • rbcL model family estimated under codon-specific model (Muse)

    • rRNA gene model (including 2nd structure; Hillis and Gutel)

invariable sites

rbcL

  • =0.8

  • / = 0.5

  • = (0.1,..,0.5)

    .

    .

  • =2.1

  • / = 1.3

  • = (0.1,..,0.2)

    .

    .

  • =1.1

  • / = 1.7

  • = (0.3,..,0.2)

    .

    .


Simulation of complex evolutionary processes

Simulation of complex evolutionary processes

  • Reflect more complex dynamics

  • Heterogeneous rates:

    • lineage and site specific mutation rates

    • genomic context dependent rates

  • Phenotypic effects

    • Selection

    • Population interaction


Simulation modeling and benchmarks

RNA and its secondary structure as a model system for genotype-phenotype evolution


Simulation modeling and benchmarks

  • Micro-Macro simulation model (Meyers, Kim)

    • Generate a population of molecules incorporating a fitness model and speciation process based on RNA folding. Fitness from (1) similarity to known 16S RNA (~67k seqs); (2) similarity to known 16S structure (~200 crystal structure); (3) folding stability

  • Experimental viral evolution (Turner; non-ITR funding for empirical work)

    • Use the RNA bacteriophage phi-6 system to generate an experimental phylogeny (~64-taxon tree with host switching and horizontal transfer)


Individual based simulation e miller and l ancel

Individual-based simulation (E. Miller and L. Ancel)

Different adaptive peaks

More fit


Strategy for macro evolution

Strategy for macro-evolution

Compute probability of fixation of different mutation types using Kimura’s derivations. Draw waiting time for each event from an exponential process

mutation

fixation


Simulation modeling and benchmarks

Mutations in RNA

?

Advantageous

Neutral

Deleterious


Simulation modeling and benchmarks

Folding energy based fitness model

-491.07J/mol

-636.71J/mol

Assumption: Thermodynamically more stable structure is more fit.


Simulation modeling and benchmarks

2.2 A Free Energy Based Schema

M0 (E0)

M1 (E1)

Mn (En)

.

.

.

M2 (E2)

Mi (Ei)

M3 (E3)

. . .


Simulation modeling and benchmarks

Computation

For each ancestral RNA molecule, enumerate all its mutants.

Compute Ei – free energy of a RNA molecule Mi

RNAeval from Vienna RNA package computes Ei for all possible single mutants of a RNA molecule in 5~6 minutes using one CPU (2 ghz).

Draw new descendent molecule according to convolution of mutation probability and fixation probability from free energy calculations.


Acceptance rejection method

Acceptance-Rejection Method

In the descendent, assume that the energy differential to local minimum is the same as the ancestor. Sample a new mutation, accept-reject as a conditional event vis-à-vis the local minimum

Enumerate Energy (=fitness) landscape around ancestor, Find minimum (most fit)


New rna macro simulator

New RNA macro simulator

  • Can simulate folding-energy dependent evolution efficiently (estimate 30 days for 1 million taxa on 20 CPU 2ghz cluster)

  • Produces secondary structure changes and records history of changes

  • Produces indel events and produces alignment history--will output files with indels and the correct alignment

  • Parameterized with empirical data statistics (Hillis, Gutell)


Simulation modeling and benchmarks

Alignment

Top is homologous alignment. Bottom is Clustalw alignment.

First sequence is root RNA, others are randomly chosen leaf RNA’s


Simulation modeling and benchmarks

Statistics from 100 Eukaryote ssRNA

Statistics from RNASim


Heuristic search landscape properties

Heuristic Search Landscape Properties

  • rbcL: 467 taxa, 660 sites

  • rna simulator: 512 taxa

  • seqgen: 512 taxa,

    • rate heterogeneity: 0 (no gamma dist.), gamma=1

  • Sample 660 sites from each dataset without replacement

  • Call PAUP Hsearch with default settings and time limit=6hrs

  • Report best parsimony score at each second


Normalized parsimony score excess npse

Normalized Parsimony Score Excess (NPSE)

  • Let B(t) be the best parsimony score at time t; let B(0) be the score of the starting tree

  • B is monotonically decreasing

  • Assume we run the heuristic search for 6 hrs. The NPSE is defined as

    NPSE(t) = (B(t)-B(6hrs))/(B0-B(6hrs))


Simulation modeling and benchmarks

P phaseo.

P pseudo.

P phaseo. b’neck

P pseudo. b’neck

Alternating b’neck

Increasing P. phaseo

pop size

Decreasing P. phaseo

pop size

350 generations

Experimental Evolution of Phi 6 virus


Simulation modeling and benchmarks

O71

K71

J71

N71

I71

M71

L71

P71

G71

A71

H71

E71

C71

B71

D71

F71

P41

L41

A61

J41

H41

E41

F41

A41

350 generations

H51

D31

Clones for sequencing

D51

B21

G51

B31

Whole-genome sequencing 50% complete (Penn Genomics Inst funds)

B51

1

11

C51

C31

F51

A21

E51

A31

A51


Simulation modeling and benchmarks

350 generations

Next Steps:

Introduce Reticulations


Simulation modeling and benchmarks

Christina Burch, UNC

Paul Higgs, McMaster

Laura Landweber, Princeton

Carlo Maley, Wistar Inst.

Claus Wilke, UT Austin

John Yin, U Wisconsin


Year 3 status

Year 3 Status

  • Complex Simulations

    • Scaling up to 1 million-taxon tree

      • Generated Complex Branching Models

      • Developed Novel Importance Sampling Scheme

      • Parallelized HyPhy

  • Simulation Database

    • Scaling and Code Hardening

      • Developed New Extensions to Indexing

      • Developed and tested with a test suite

  • Experimental Evolution

    • Generated 16 new lineages evolved for 350 generations under heterogeneous conditions

    • 50% of whole genomes sequenced


Simulation modeling and benchmarks1

Simulation, Modeling, and Benchmarks

U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson

Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen

U Texas : David Hillis, Lauren Meyers

Eric Miller, Tracy Heath, Derrick Zwickl

NC State: Spencer Muse

Errol Strain

Yale: Paul Turner

and

Bernard Moret

Tandy Warnow

Robert Jensen

Randy Linder


  • Login