SysMO-DB: Sharing and Exchanging Data and Models in Systems Biology

Katy Wolstencroft University of Manchester SysMO-DB: Sharing and Exchanging Data and Models in Systems Biology

SysMO-DB DB A data access, model handling and data integration platform for Systems Biology: • To support and manage the diversity of • Data, Models and experimental protocols • Local data management systems • That promotes shared understanding • Using a common platform and common technologies

Systems Biology Challenges • Interdisciplinary work • Heterogeneous data and models • Modellers and experimentalists have different skills, training, experience • Modellers and experimentalists have different vocabularies and jargon • Working together

Systems Biology of Microorganisms http://www.sysmo.net • Pan European collaboration • Eleven individual projects, 91 institutes • Different research outcomes • A cross-section of microorganisms, incl. bacteria, archaea and yeast • Record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way • Present these processes in the form of computerized mathematical models • Pool research capacities and know-how • Already running since April 2007 • Runs for 3-5 years • This year, 2 new projects join and 6 leave

The Problem No one concept of experimentation or modelling No planned, shared infrastructure for pooling 

Types of data • Multiple omics • genomics, transcriptomics • proteomics, metabolomics • fluxomics, reactomics • Images • Molecular biology • Reaction Kinetics • Models • Metabolic, gene network, kinetic • Relationships between data sets/experiments • Procedures, experiments, data, results and models • Analysis of data

Started in June 2008 Web-based solution to facilitate: exchange of data, models and processes (intra- and inter- consortia) search for data, models and processes across the initiative maximisation of the "shelf life" and utility of the data, models and processes generated dissemination of results DB SysMO-DB

SysMO-DB Team Hits, Germany Stuart Owen Wolfgang Müller Carole Goble Isabel Rojas Olga Krebs SABIO-RK Katy Wolstencroft University of Manchester, UK Finn Bacall Taverna myExperiment Jacky Snoep JWS Online University of Stellenbosch, South Africa University of Manchester, UK

SysMO-DB PALS team Power Contributors. 21 Postdocs and PhD students Design and technical collaboration team Intense collaboration UK and Continental PALS Chapters Audits and Sharing. Methods, data, models, standards, software, schemas, spreadsheets, SOPs….. 20 questions Deployment into Projects

Principles… A series of small victories Realistic Don‘t reinvent Sustainable and extensible Migrate to standards Provide instant gratification Incremental development Fitting in with normal lab practices

The Lowest Hanging Fruit SysMO SEEK – a catalogue of assets SysMO Yellow Pages The people and their expertise The institutions and their facilities Data – experimental data sets Data – analysed results Data – external reference data sets Models Processes – laboratory protocols and bioinformatics analyses Publications The catalogue references assets held elsewhere

SEEK screenshot?

Harvesters JERM SysMOLab Wiki COSMIC Alfresco MOSES Wiki BaCell-SysMO Alfresco ANOTHER A DATA STORE

Why not a central Warehouse? • Protective of models • in progress vs published models. • Access and Version management • Curator-Rival conflict • Reluctant to share data • Even within their own projects • Legacy spreadsheets dominate • Curation practices vary • Centralised archive take-up • Point to Point Exchange • People don’t mind sharing methods • People want to advertise publications Nature461, 145 (10 Sept09)

Just Enough Sharing Access Permissions Reusing myExperiment

SysMO-DB Architecture Models Assets and Yellow Pages Catalogues SysMO-SEEK web interface SysMO DB JERM Data Processes

Making use of the Assets • Understanding the content of the data • Linking assets together • Linking assets to experimental context • Running comparisons between data files • Running model simulations • Running data analysis pipelines

What is the JERM? JERM • JERM “Just Enough Results Model” • Minimum information to exchange data • What type of data is it • Microarray, growth curve, enzyme activity… • What was measured • Gene expression, OD, metabolite concentration…. • What do the values in the datasets mean • Units, time series, repeats…. • Which experiment does it relate to? • How does it relate to models? • How was the data created • SOPs and protocols

CIMRCore Information for Metabolomics Reporting MIABEMinimal Information About a Bioactive Entity MIACAMinimal Information About a Cellular Assay MIAMEMinimum Information About a Microarray Experiment MIAME/EnvMIAME / Environmental transcriptomic experiment MIAME/NutrMIAME / Nutrigenomics MIAME/PlantMIAME / Plant transcriptomics MIAME/ToxMIAME / Toxicogenomics MIAPAMinimum Information About a Phylogenetic Analysis MIAPARMinimum Information About a Protein Affinity Reagent MIAPEMinimum Information About a Proteomics Experiment MIAREMinimum Information About a RNAi Experiment MIASEMinimum Information About a Simulation Experiment MIENSMinimum Information about an ENvironmental Sequence MIFlowCytMinimum Information for a FlowCytometry Experiment MIGenMinimum Information about a Genotyping Experiment MIGSMinimum Information about a Genome Sequence MIMIxMinimum Information about a Molecular Interaction Experiment MIMPPMinimal Information for Mouse Phenotyping Procedures MINIMinimum Information about a Neuroscience Investigation MINIMESSMinimal Metagenome Sequence Analysis Standard MINSEQEMinimum Information about a high-throughput SeQuencing Experiment MIPFEMinimal Information for Protein Functional Evaluation MIQASMinimal Information for QTLs and Association Studies MIqPCRMinimum Information about a quantitative Polymerase Chain Reaction experiment MIRIAMMinimal Information Required In the Annotation of biochemical Models MISFISHIEMinimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments STRENDAStandards for Reporting Enzymology Data TBCTox Biology Checklist BioPAX : Biological Pathways Exchangehttp://www.biopax.org/ FuGE Functional Genomics Experiment MGED: Microarray Experimental Conditions Minimum Information Models http://www.mibbi.org/index.php/MIBBI_portal

The Idea For each data type….. Transcriptomics Proteomics Metabolomics Single Cell Data ISA-TAB 1 Define a JERM….. • Top down analysis of standards • Bottom up analysis of practice 2 Generate and apply…. • JERM template • JERM extractor for data host • Subset registered in SEEK • Access / export through JERM interface / template 3

SEEK + JERM JERM Experimental Data Metadata People Investigation Homogenised terminology and values in the datasets themselves Study Projects Assay Models Experimental conditions SOPs Factors studied Workflows Based on ISA-TAB

For publishing JERM data needs to be related to SOPs, experimental context (ISA) and other data JERM must be “MIBBI” compliant for exporting to public repositories e.g. Microarray data needs to be MIAME compliant

ISA-TAB Relating data and its experimental context Investigation, Study, Assay TAB = tabular A format suitable for spreadsheets http://isatab.sourceforge.net/

ISA Provides.... A common framework for relating different types of data e.g. microarrays and proteomics Facilitates submission to international public repositories of genomics, transcriptomics and proteomics studies

Identifying Biological Objects • What do you have in your data? • Proteins/enzymes, genes/expression levels, metabolites • Where/how do these objects interact? • Pathways, flux, experimental conditions • What models describe these interactions Possible when using common frameworks, naming schemes and controlled vocabularies

BioPortal Integration for Searching • Repository for submitting and sharing Biological ontologies http://bioportal.bioontology.org/ • Search for concepts across all or selected ontologies • BioPortal provides a number of Restful Webservices • Search • Concept lookup • Visualisation • Integrated within SEEK as a plugin

Tools to help manage data:Annotation standards by stealth Controlled vocabulary plug in BioPortal

Following Standards • We recommend formats but we do not enforce them • Protocols and SOPs – Nature Protocols • Data – JERM models and community minimum information models • Models – SBML and related standards • Publications – PubMed and DOI • If you follow the prescribed formats, you get more out, but if you don’t, you can still participate • lowering the adoption barrier

Off the shelf • Except for the JERM, we have only used community resources, vocabularies and services You can get a long way by implementing community practices and providing ways to integrate them

SysMO-DB and Models

Nicolas Le Novere, Data Integration in the Life Sciences, Manchester, 2009

Models: Incentives for using Standards Models can be shared in SysMO-SEEK in any format SBML is the recommended format We also recommend MIRIAM compliance and SBO annotation If you use SBML, you can use JWS Online to run simulations in SEEK

Screenshot of JWS Online • JWS Online Plugin • online simulator, runs in your browser • upload models in SBML format • Web Service enabled • SBGN schemas, with annotations and external links

http://www.semanticsbml.org/aym Falko Krause, Humboldt-University, Berlin

Models Resources Models can be published in public repositories JWS-Online, BioModels Models can be annotated SBML, MIRIAM, SBO No public resources currently for sharing models with associated data, or for loading new data into models

Linking Data to Models Relating data and models Where did the data come from for developing the model? Where did the data come from for validating the model? What were the results of model simulations?

Current Functionality in SEEK • Show all data used for construction together with the model, such that process can be repeated • Uploaded models loaded with this data by default • Manually alter parameters and run simulations

Next Steps: Model Validation • Test/compare model with experimental data for complete system • Find data in SEEK • Upload data from elsewhere • Automatically load into model • Run simulations and compare with original results • JERM for models • Mapping tools – allows you to identify columns/rows in spreadsheets containing the right information

ISA for Models • Modelling and experimental work intersect • Investigations, Study, Assay.....or modelling analysis..... • Modelling analysis types • Metabolic models, gene networks • Modelling type • ODE, algebraic • Studies – combinations of experimental assays, modelling analyses, and informatics analyses

SysMO-DB the e-Laboratory An e-Laboratory is an information system for bringing together people, data and analytical methods at the point of investigation or decision-making

Current Status Finding things so that we can compare them Understanding who has what Understanding what can be compared with what – the experimental context

Where we are going… A dynamic resource for analysis as well as browsing Automatic comparison of data from inside files Understanding where and how data and models are linked Running simulations with new experimental data Running analyses and workflows over the data and models

Workflows from myExperiment • Data preparation, annotation and analysis • Systems Biology workflow Pack on myExperiment Microarray analysis and text mining Created by Afsaneh Maleki-Dizaji from SUMO, University of Sheffield Based on previous work by Paul Fisher, University of Manchester http://www.myexperiment.org/workflows/187

SEEK as a data analysis and meta analysis service • SBML model construction and population • Calibration workflow • Data requirements • Parameterised SBML model • Experimental data • Metabolite concentrations from key results database • Calibration by COPASI web service Peter Li

Data analysis and meta analysis SEEK Analysis Service with pre-cooked analysis tools. • Calibration workflow • Data requirements • Parameterised SBML model • Experimental data • Metabolite concentrations from key results database • Calibration by COPASI web service Load model: Load data: GO Peter Li

New Directions

Opening SysMO Out • Using SysMO as a dissemination space for the SysMO consortium • Supplementary material in publications • Data citation • Packaging software so that others can use it • Easy to install a SEEK for yourself • Packaging and exchanging JERM Templates • Helping with standardisation • Promotion and example work with SBRML and data and models linkage

SysMO-DB Approach in Other projects • SysMO2 – new projects and legacy • EraSysBio+ • Lungsys and SBCancer • Virtual Liver

New Considerations • Eukaryotic organisms • Interactions between host and pathogen • Human disease • multicellular interactions, tissues, organs • multiscale modelling

Outstanding Issues Keeping data at project sites has responsibilities • Reliability - Sites available continuously and promptly • Support - Must be proof against virus attacks, etc. • Archiving - Beyond the lifetime of the project.

SysMO-DB: Sharing and Exchanging Data and Models in Systems Biology