RNAsim/CRIMSON Algorithm Benchmark Suite

RNAsim/CRIMSON Algorithm Benchmark Suite U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson, Steve Fisher, Sheng Guo U Texas : David Hillis, Lauren Meyers, Tracey Heath, Derrick Zwickl NC State: Spencer Muse Florida State: Mark Holder Yale: Paul Turner

Goal: Develop validated datasets of sufficient complexity and scale to realistically benchmark latest tree algorithms

Benchmark Infrastructure Model Characterization Simulators Character Evolution Simulators Taxon Sampling Database Tree Topology Simulators Data Subset with Associated Subtree • Others • Tree/Char Combined • Experimental Evolution • Virtual Cell • etc Model Sampling Format Translators RNAsim CRIMSON PAUP*, etc

Benchmark Scheme • Generate a very large dataset (>106 positions) over a very large tree (>106 taxa) using various models of evolution • Store the data in a database • Retrieve subsets of the data by various sampling schemes

RNA macro-evolution simulation (Sheng Guo, Lisan Wang) • Incorporate 2ndary structure constraints, incorporate indels, using a simulator based on edit mutations. A set of edit operators are implemented, such as stem edit, each of which operate on evolving strings with a characteristic wait time. Ancestral molecule is based on known rRNA gene with putative known 2ndary structure. Evolution of the 2ndary structure is tracked. anc delete stem pair change base initiate new stem insert base delete base add stem pair desc

Fixation probability as a function of fitness Parameters: Ne:effective population size  :neutral mutation rate s : fitness change Neutral Advantageous(s>0)/Deleterious(s<0) Compensatory Mutation

One-step mutation ensemble of a RNA

Weaker Selection

Calibration on Empirical Data Simulated RNA 100 Eukaryotic ssRNA

Example: Pairwise Similarity of 1000 locally optimal ML trees (MDS plot) Empirical Data RNAsim ROSE SeqGen

CPU Time to reach local optimum (PAUP* ML, TBR)

1 Million Leaves (Tracey Heath; Birth-Death Model with variable rates)20 Data Replicate Partition Simulated and Stored at SDSC

Crimson Stephen Fisher, Susan Davidson, Junhyong Kim • Facilitates the extraction of sub-trees from very large phylogenetic trees. • Trees loaded into a shared database (Oracle or MySQL) • Extensive tree sampling options • Save query output to NEXUS or phylip files • Include paup commands in query output files • Comprehensive graphical dialogs • Command line interface allowing python-like scripting • Display trees with Walrus 3D Viewer

Query Options • Species Selection • Select All • Random Selection • Select By Temporal Depth • Same number of samples per sub-tree • Weight sampling of sub-trees by number of leaves • Select By Species Level • Same number of samples per sub-tree • Weight sampling of sub-trees by number of leaves • Manual Selection • Sequence Selection • Select All • Random Selection • Manual Selection

Depth Threshold Distribution L-1 L-2 L-3 L-4 L-5 L-6 L-7 L-8

Crimson Interface

Current Benchmarking Effort • Sample #1 • 10 leaves per sampled tree • Repeat taxon sampling 40 times per replicate data partition • Sample #2 • 100 leaves per sampled tree • Repeat taxon sampling 30 times per replicate data partition • Sample #3 • 1,000 leaves per sampled tree • Repeat taxon sampling 20 times per replicate data partition • Sample #4 • 10,000 leaves per sampled tree • Repeat taxon sampling 10 times per replicate data partition

Algorithms (to be expanded) • Neighbor Joining (paup) • breakties=random • Parsimony (paup) • set maxtrees=200 increase=no • hsearch timelimit=432000 • contree all /strict=no majrule=yes • RAxML (raxmlHPC) • -f a • -# 100 • -m GTRGAMMA

Benchmarking Stats

Distribution of False Positive Edges

Computational Difficulty of Dataset Versus Accuracy sec hr hr

RAxML Computation Time (Heuristic) Over 30 Random 100-taxon Trees Replicates

Thanks to: Davidson, Susan Fisher, Steve Guo, Sheng Hillis, David Heath, Tracey Wang, Lisan Zhang, Yifeng Zwickl, Derrick Please Ask and Talk to: Steve Fisher Sheng Guo Lisan Wang Please See CRIMSON Demo by Steve Fisher

RNAsim/CRIMSON Algorithm Benchmark Suite