The iPlant Collaborative Cyberinfrastructure

The iPlant CollaborativeCyberinfrastructure Matt Vaughn Cold Spring Harbor Laboratory April 2010

What is iPlant? • Simply put, the mission of the iPlant Collaborative is to build Cyberinfrastructure to support the solution of the grand challenges of plant biology. • A “unique” aspect is the grand challenges were not defined in advance, but are identified through an ongoing engagement with the community. • Not a center, but a virtual organization forming grand challenge teams and relying on the national CI. • Long term focus on sustainable food supply, climate change, biofuels, pharmaceuticals, etc. • Hundreds of participants from around the world; Working group members at > 50 US academic institutions, USDA, DOE, etc.

What is Cyberinfrastructure?(Originally about TeraGrid) It was six men of Indostan, To learning much inclined, Who went to see the elephant, (Though all of them were blind), That each by observation Might satisfy his mind. WWW.TERAGRID.ORG It’s a Grid! It’s a Network! They are HPC Centers! It’s a Common Software Environ! And More!: - Viz - Facilities - Data collections … It’s Apps and Support! It’s Storage!

The iPlant CI • Engagement with the CI Community to leverage best practice and new research • Unprecedented engagement with the user community to drive requirements • An exemplar virtual organization for modern computational science • A Foundation of Computational and Storage Capability • A single CI for all plant scientists, with customized discovery environments to meet grand challenges • Open source principles, commercial quality development process

A Foundation of Computational and Storage Capability • iPlant is positioned to take advantage of *tremendous* amounts of NSF and institutional compute and storage resources: • Compute: Ranger, Lonestar, Stampede (UT/TeraGrid) Saguaro, Sonora (ASU) Marin, Ice (UA) • ~700 Teraflops, more computing power than existed in all the Top 500 computers in the world 4 years ago • Storage: Corral, Ranch (UT), Ocotillo (ASU) • Well over 10 Petabytes of storage can be made available for the project, on scalable systems capable of growing much more. • Visualization: Spur, Stallion (UT), Matinee (ASU), UA-Cave • Among the world’s largestvisualization systems • Virtualized/Cloud Services: iPlant (UA) and ASU virtual environments, vendor clouds • iPlant is positioned to cloud technologies to deliver persistent gateways and services to users. In short, the physical aspects of cyberinfrastructure employed via iPlant, utilizing large scale NSF investments, has capabilities second to none anywhere on the planet.

A Single CyberInfrastructure, many Discovery Environments • iPlant is constructing one constantly evolving software environment • A single architecture and “core” • An ever-growing collection of many integrated tools and datasets (many will be externally-sourced). • Transparently leveraging an evolving national physical infrastructure • Customized for particular problems/use cases through the creation of individual “Discovery Environments” (DE): • Have an interface customized to the particular problem domain. • Integrate a specific collection of tools • Utilize the common core • Several DE’s may exist to address a single grand challenge • Think of these like ‘applications’

Open Source Philosophy, Commercial Quality Process • iPlant is open in every sense of the word: • Open access to source • Open API to build a community of contributors • Open standards adopted wherever possible • Open access to data (where users so choose). • iPlant code design, implementation, and quality control will be based in best industrial practice

Commercial Quality Process • Agile development methodology has been adopted • Complete product lifecycle in place: • Product Definition, Requirements Elicitation, Solution Design, Software Development, Acceptance Testing • Code is only built after rigorous requirements process • Needs Analysis • User Persona • Problem Statement • User Stories • The Grand Challenge Engagement Team plays the role of “Product Champion” and “Customer Advocate” in this scheme

Scope: What iPlant won’t do iPlant is not a funding agency A large grant shouldn’t become a bunch of small grants iPlant does not fund data collection iPlant will (generally) not continue funding for <favorite tool x> whose funding is ending. iPlant will not seek to replace all online data repositories iPlant will not *impose* standards on the community.

Scope: What iPlant *will* do Provide storage, computation, hosting, and lots of programmer effort to support grand challenge efforts. Work with the community to support and develop standards Provide forums to discuss the role and design of CI in plant science Help organize the community to collect data Provide appropriate funding for time spent helping us design and test the CI

What is the iPlant CI? • Two grand challenges defined to date: • iPlant Tree of Life (IPTOL): • Build a single tree showing the evolutionary relationships of all green plant species on Earth • iPlant Genotype-to-Phenotype (IPG2P) • Construct a methodology whereby an investigator, given the genomic and environmental information about a given individual plant, can predict it’s characteristics. Taken together, these challenges are the key to unlocking many “holy grails” of plant biology, such as the creation of drought resistant or pest resistant crops, or breaking reliance on fossil fuel based fertilizer

What is the iPlant CI? • IPTOL CI: • Five areas: Data assembly and integration, visualization, scalable algorithms for large trees, trait evolution, tree reconciliation • IPG2P CI: • Five areas: Data Integration, Visualiztion, Modeling, Statistical Inference, Next Gen Sequencing Tools In both, a combination of applying compute resources, developing or enhancing new tools, and creating web-based “discovery environments” to integrate tools and facilitate collaboration.

Genotype-to-Phenotype (G2P) Problem Statement • Given: A particular • species of plant (e.g. corn, rice) • genetic description of an individual (genotype) • growth environment • trait of interest (flowering time, yield, or any of hundreds of others) • Predict: the quantitative result (phenotype) Top priority problem in plant biology (NRC) • Reverse problem: What genotype will yield the desired result in a given environment?

Seq data DI DI DI DI DI Expression data Data Visualization Visualization Metabolic data Whole plant data Environment data Super-user Developer Modeling and Statistical Inference Output Computational User inferred User inferred Hypothesis Experiment

iPG2P Working Groups Ultra High Throughput Sequencing Establishing an informatics pipeline that will allow the plant community to process NextGen sequence data Statistical Inference Developing a platform using advanced computational approaches to statistically link genotype to phenotype Modeling Tools Developing a framework to support tools for the construction, simulation and analysis of computational models of plant function at various scales of resolution and fidelity Visual Analytics Generating, adapting, and integrating visualization tools capable of displaying diverse types of data from laboratory, field, in silico analyses and simulations Data Integration Investigating and applying methods for describing and unifying data sets into virtual systems that support iPG2P activities

UHTS Discovery Environment Scalable services Metadata Manager Expression Levels (RPKM) • Data • NCBI SRA • User local • iPlant store • Data Wrangling • Quality Control • Preprocessing • Transformation • Alignments • BWA • TopHat +BOWTIE Cufflinks SAMTools Variants (VCF3.3) • Metadata • MIAME • MINSEQE • SRA SAM Alignments User story: Arthur, an ecological genomics postdoc, is looking for gene regulators by eQTL mapping expression data in a panel of recombinant inbred lines he has constructed and genotyped. Coming Q2 2010

Statistical Inference • Network Inference • QTL Mapping • Regression (fixed, random effects) • Maximum likelihood • Bayesian methods • Decision trees

Computational Challenges 6.5 million markers: Two Arabidopsis-sized genomes @ 5% diversity 38,963 expression phenotypes: # transcripts in Arabidopsis measured by UHTS X * Single-SNP test: a few min * 100-replicate bootstrap: a few hours * Only gets larger for epistasis tests, forward model selection, fms+bootstrapping

Statistical Genetics DE • Data Wrangling • Projection • Imputation • Conversion • Transformation Scalable service • Data • User local • iPlant store Significant results GLM Computation Kernel • Configuration • User-specified • Driver code • Configuration • User-specified • Driver code • Reconfigurable GLM Kernel • C/MPI/Scalapack • GPU • Hybrid CPU Command-line environment and API expected Q3 2010

Modeling Tools • Integrated suite of tools for: • model construction & simulation • parameter estimation, sensitivity analysis • verification • Draw on existing SBML tools • Protocol converters for network models • Facilitate MIRIAM usage for code/model verification

Data Integration Principles G2P Biology is data-driven science. Integration is key: information curators already exist and do extremely good work. • No monolithic iPlant database(s) • Provide virtual databases via services • Provenance preservation • Foster and actively support standards adoption • Match orphan data sets with interested researchers & educators

Existing genetic and genomic data Existing expression, metabolomic, network, physical phenotypes Data Integration Layer Genotype Phenotype Powerful Statistical Inference • Generation of new genomic data • Re-sequencing • De novo sequencing • Generation of new phenotype data • RNAseq • High-throughput phenotyping • Image Analysis

High-throughput Image Analysis Physical Infrastructure Cameras, Scanners, etc Web GUI Workflow Control RESTful API Data Intake Processes Consumer Processes httpd HTIP Service Layer • Database Schema • Semantic storage and retrieval of images and metdata • Storage of derived results from analysis procedures RDBMS • Inputs • Serial images • Multichannel images • Volumetric data • Movies Scalable services Algorithm Plugins Python C/C++ MATLAB Requirements elicitation ongoing

Evolutionary Biology Plant Ecology Plant Biology CI Empowerment Strategy GC Solutions Phenotyping Plant Genomics

Evolutionary Biology Tree Reconciliation Big Trees Trait Evolution Plant Ecology Plant Biology CI Empowerment Strategy Taxonomic Intelligence GC Solutions Green Plant ToL Tree Decoration Visualization Statistical Inference Flowering Phenology Stress & Adaptation Modeling Phenotyping C3/C4 Evolution Image Analysis Data Integration Plant Genomics Next Gen Sequencing

Technology and the iPC CI User iPlant Discovery Environments Grand Challenge Workflows, iPlant Interfaces Third Party Tools, iPlant-built Tools, Community Contributed Tools and Data! iPlant Middleware Job Submission Workflow Management Service/Data APIs iRODS, Grid Technologies, Condor, RESTful Services Compute Storage Persistent Virtual Machines TeraGrid Open Science Grid UA/ASU/TACC Physical Infrastructure Build a CI that’s robust, leverages national infrastructure, and can grow through community contribution! Technical Questions? Contact NiravMerchart – nirav@email.arizona.edu

iPlant : Connecting Users, Ideas & Resources Core CI Foundation: • Data layer • Registry and Integration layer • Compute and Analysis layer • Interaction & Collaboration layer Technical Questions? Contact NiravMerchart – nirav@email.arizona.edu

iPlant: Using proven technologies • Data layer:providing access to raw and ingested data sets including high throughput data transfers • iRODS • GridFTP , Aspera • Dspace (DuraSpace), OpenArchive initiative • Content Distribution Networks (CDN) • High performance storage @ TACC (Lustre) • MySQL and Postgres database clusters • Connection to other DataOne, DataNet initiatives • Cloud style storage (similar to Amazon S3 and Walrus) Technical Questions? Contact NiravMerchart – nirav@email.arizona.edu

iPlant: Using proven technologies • Registry and Integration LayerConnecting services, data and meta data elements using semantic understanding Technical Questions? Contact NiravMerchart – nirav@email.arizona.edu

iPlant: Using proven technologies • Compute and Analysis Layer:Connecting tasks with scalable platforms and algorithms • Virtualization (Xen clusters) • High Performance Computing at TACC and Teragrid • Grid (Condor, BOINC, Gearman) • Cloud (Eucalyptus, Nimbus, Hadoop) • Reconfigurable Hardware (GPU, FPGA) • Checkpoint & Restart (DMTCP) • Scaling and parallelizing code (MPI) Technical Questions? Contact NiravMerchart – nirav@email.arizona.edu

iPlant: Using proven technologies • Interaction and Collaboration layer:Providing end user access to unified services and data, from API to large scale visualization • Google Web Toolkit (GWT driven front end) • Messaging bus (Java Mule, XMPP/Jabber) • RESTful web services (web API access) • Single sign-on/identity management (Shibboleth. OAuth) • Integration with desktop applications (via web services) • Sharing data (DOI, persistent URL, CDN, social networks) • Large scale visualization (Large Tree, Paraview, ENVISION) Technical Questions? Contact NiravMerchart – nirav@email.arizona.edu

An Example Discovery Environment

First DE • Support for one use case: independent contrasts. But also… • Seamless remote execution of compute tasks on TeraGrid resources • Incorporation of existing informatics tools behind iPlant interface • Parsing of multiple data formats into Common Semantic Model • Seamless integration of online data resources • Role based access and basic provenance support • Next version will support: • Ultra High Throughput Sequencing pipeline, Variant Detection, Transcript Quantification • Public RESTful API

Example Service API

Acknowledgments University of Arizona Rich Jorgensen Greg Andrews Kobus Barnard Rick Blevins Sue Brown Vicki Bryan Vicki Chandler John Hartman Travis Huxman Tina Lee Nirav Merchant Martha Narro Sudha Ram Steve Rounsley Suzanne Westbrook RaminYadegari Cold Spring Harbor Laboratory, NY Lincoln Stein Matt Vaughn Doreen Ware Dave Micklos Sheldon McKay Jerry Lu Liya Wang Texas Advanced Computing Center Dan Stanzione Michael Gonzales Chris Jordan Greg Abram WeijiaXu University of North Carolina-Wilmington Ann Stapleton Funded by NSF

Collaborating Institutions • CSHL iPlant CI • EMEC External Evaluator • TACC iPlant CI • UNCW iPlant CI • Field Museum Natural History • MoBot APWeb2 • BIEN Taxonomic Intelligence • UCSB Image Platform • UWISC Image Platform • Boyce Thompson Inst. iPG2P • KSU iPG2P • UCD iPG2P • VA Tech iPG2P • Brown iPToL • UFL iPToL • UGA iPToL • UPenniPToL • UTK iPToL • Yale iPToL

Soft Collaborators • 1kP Consortium • ARS at USDA • BRIT: Botanical Research Institute of Texas • CGIAR and Generation Challenge Program • Cyberinfrastructure for Phylogenetic Research (CIPRES) • The Croquet Consortium • NIMBioS: National Institute for Mathematical and Biological Synthesis • Pittsburgh Supercomputing Center • pPOD: processing PhyloData • Syngenta Foundation • NanoHub & HubZero • ELIXIR • Fluxnet. • Howard Hughes Medical Institute • Knowledgebase • NPN: National Phenology Network • PEaCE Lab: Pacific Ecoinformatics and Computational Ecology Lab • MORPH: Research Coordination Network (RCN) • NCEAS: National Center for Ecological Analysis and Synthesis • NEON: National Ecological Observation Network • NESCent: National Evolutionary Synthesis Center

Unprecedented Engagement with the Plant Science User Community • A unique engagement process • The Grand Challenge process has resulted in the most intensive user input of any large scale CI project to date. • iPlant will construct a single CI for plant science; driven by grand challenges and specific user needs • Grand Challenge Engagement Teams will continue this very close cooperation with the community • Work closely with the GC proposal team and the broader community • Build use cases to drive development

An Exemplar Virtual Organization for Modern Computational Science • iPlant aims to be the Gold Standard against which other science-focused CI projects will be measured. • One Cyberinfrastructure Team, many skills and roles • iPC CI Creation is done by a diverse group: • Faculty, postdocs, staff, and students • Bioinformatics, Biology, Computing and Information Researchers, Software Engineers, Database Specialists, etc. • Arizona, Cold Spring Harbor, Texas, etc. • Many different tasks: • Engagement/Requirements, Tech Eval, Prototyping, Software Design (DE and Core), Data Integration, Systems, many more. • A single Cyberinfrastructure Team, where roles may change rapidly to match skill sets

Timelines/Milestones • Growth in staffing & capability; from a few in March 2009, now 47 involved in CI across all sites. • Architecture definition in August-Sept 2009; enough to get started, still evolving. • Software environment, tools, practices laid down about the same time. • Real SW development commenced in September 2009. • Serious prototyping and tool support in response to ET needs began ramping up in November.

Technology Eval Activities • Largest investment in semantic web activities • Key for addressing the massive data integration challenges • Exploring alternate implementations of QTL mapping algorithms • Experimental Reproducability • Policy and Technology for Provenance Management • Evaluation of HubZero, Workflow engines, numerous other tools

IPTOL CI – A High Level Overview • Goal: Build very large trees, perhaps all green plant species • Needs: • Most of the data isn’t collected. A lot of what is collected isn’t organized. • Lots of analysis tools exist (probably plenty of them) – but they don’t work together, and use many different data formats. • The tree builder tools take too long to run. • The visualization tools don’t scale to the tree sizes needed.

IPTOL CI – High Level • Addressing these needs through CI • MyPlant – the social networking site for phylogenetic data collection (organized by clade) • Provide a common repository for data without an NCBI home (e.g. 1kP) • Discovery Environment: Build a common interface, data format, and API to unite tools. • Enhance tree builder tools (RAxML, NINJA, Sate’) with parallelization and checkpointing • Build a remote visualization tool capable of running where we can guarantee RAM resources

The iPlant Collaborative Cyberinfrastructure