Service-Oriented Bioscience Cluster at OSC Umit V. Catalyurek

Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics Dept. of Electrical & Computer Engineering The Ohio State University

Origins of caBIG • Goal:Enable investigators and research teams nationwide to combine and leverage their findings and expertise in order to meet NCI 2015 Goal. • Strategy:Create scalable, actively managed organization that will connect members of the NCI-supported cancer enterprise by building a biomedical informatics network “Relieve suffering and death due to cancer by the year 2015”

Driving needs:cancer Biomedical Informatics Grid • A multitude of “legacy” information systems, most of which cannot be readily shared between institutions • An absence of tools to connect different databases • An absence of common data formats • A huge and growing volume of data must be collected, analyzed, and made accessible • Few common vocabularies, making it difficult, if not impossible, to interlink diverse research and clinical results • Difficulty in identifying and accessing available resources • An absence of information infrastructure to share data within an institution, or among different institutions

Common, widely distributed infrastructure that permits the cancer research community to focus on innovation Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange Collection of interoperable applications developed to common standards Cancer research data available for mining and integration What is caBIG?

What is caGrid? • A grid based software infrastructure consisting of services, toolkits, APIs, and applications • A production grid deployment of the core services provided by that infrastructure • A community of developers leveraging that grid and infrastructure to provide applications and services to the cancer research community

What is caGrid? • Development project of Architecture Workspace • The Grid infrastructure for caBIG (the “G” in caBIG) • Driven from use cases and needs of cancer research community • Service Oriented Architecture • Based on federation • Model Driven • Object-Oriented, Semantically-Annotated Data Virtualization

What is caGrid? cont… • Builds on existing Grid technologies • Provides additional enterprise Grid components • Grid Service Graphical Development Toolkit • Metadata Infrastructure • Advertisement and Discovery • Semantic Services • Data Service Infrastructure • Analytical Service Infrastructure • Identifiers • Workflow • Security Infrastructure • Client tooling

caGrid Community Involvement • caGrid itself provides no real “data” or “analysis” to caBIG™; its the enabling infrastructure which allows the community to do so • Community members add value to the grid as applications, services, and processes (for example: shared workflows) • caGrid provides the necessary core services, APIs, and tooling • The real “value” of the grid comes from bringing this information to the “end user” • Community members develop end user applications which consume of the resources provided by the grid

caGrid @ OSC • Goals: • Create an expandable caGrid Installation at OSC • Deploy Pilot Applications to demonstrate Service Oriented Access to HPC resources • Dorian, GTS and Index services are deployed • cagrid-dorian01.osc.edu • cagrid-gts01.osc.edu • cagrid-index01.osc.edu • SyncGTS along with Dorian and Index for performance • caGrid 1.2 was released this week, and we deployed it!

Pilot Application : TMA • Image Mining for Performing Comparative Analysis of Expression Patterns in Tissue Microarrays • Project funded by NIH R01 (PI: David Foran, Co-PI: Joel Saltz) • Development of innovative analysis methods for analysis of tissue microarrays • Computation of features, annotations of image data based on features • Development of software support • to manage and share tissue microarray data and analysis results • to process large volumes of tissue microarray data on high performance systems • Development of ability to share data and analytical resources using caGrid • Supports Help Defeat Cancer project which 100,000 imaged histology specimens originating from breast, head & neck, colorectal cancers.

TMA Analytical Service Implementation • TMA Application is a pipelined workflow • Several processing steps that need to be applied in sequence to the images • Build a prototype workflow orchestration system • Wraps a program execution • Stages the the data in • Invoke the executable • Retrieve the output files • Uses caGrid’s bulk data transfer to move files from host to host • Interacts with a scheduler to allocate resources for the execution • Executable can be a parallel/distributed application • TMA user interface • Specify the workflow • List with executables and parameters • Invoke the service for the first stage

What is next? • Next Pilot Application: Prof. Dan Janies’ Supramap • http://supramap.osu.edu • Builds a phylogenetic tree and projects onto the map of the planet • Computationally expensive • Next Pilot Application(s): Your Application!? • More Info: http://bmi.osu.eduandhttp://www.cagrid.org • Contact: Umit V. Catalyurek email: catalyurek.1@osu.edu

Service-Oriented Bioscience Cluster at OSC Umit V. Catalyurek