Scientific Workflows: Some Examples and Technical Issues

Scientific Workflows: Some Examples and Technical Issues Bertram Ludäscher ludaesch@sdsc.edu Data & Knowledge Systems (DAKS) San Diego Supercomputer Center University of California, San Diego

Outline • Scientific Workflow (SWF) Examples • Genomics: Promoter Identification (DOE SciDAC/SDM) • Neuroscience: Functional MRI (NIH BIRN) • Ecology: Invasive Species, Climate Change (NSF SEEK) • SWFs & Analysis Pipelines … • vs Business WFs • vs Traditional Distributed Computing • Some Technical Issues

NSF, NIH, DOE GEOsciences Network (NSF) www.geongrid.org Biomedical Informatics Research Network (NIH) www.nbirn.net Science Environment for Ecological Knowledge (NSF) seek.ecoinformatics.org Scientific Data Management Center (DOE) sdm.lbl.gov/sdmcenter/ Acknowledgements

Scientific Workflow Examples Promoter Identification (Genomics) fMRI (Neurosciences) Invasive Species (Ecology) Bonus Material: Semantic Data Integration (Geology)

Example: Promoter Identification Workflow (PIW) (simplified) From: SciDAC/SDM project and collaboration w/ Matt Coleman (LLNL)

Retrieve Transcription factors Arrange Transcription factors Retrieve matching cDNA Retrieve genomic Sequence Align promoters Extract promoter Region(begin, end) Create consensus sequence Conceptual Workflow (Promoter Identification Workflow PIW) Compute clusters (min. distance) For each promoter Select gene-set (cluster-level) Compute Subsequence labels For each gene With all Promoter Models Compute Joint Promoter Model

SWF: Promoter Identification • More dataflow than workflow • but some branching looping, merging, … • not: documents/objects undergoing modifications • instead: dataset-out = analysis(dataset-in) • Need for “collection programming” (functional programming style) • Iterations over lists (foreach) • Filtering • Functional composition • Generic & higher-order operations (zip, map(f), …) • Need for abstraction and nested workflows • Need for rich user interaction / steering: • pause & resume • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF • Need for persistence of intermediate products  data provenance (virtual data concept, e.g. GriPhyN)

From: BIRN-CC, UCSD (Jeffrey Grethe)

Details of the Functional MRI (Magnetic Resonance Imaging) Analysis Workflow (Jeffrey Grethe) • Collect data (K-Space images in Fourier space) from MR scanner while subject performs a specific task • Reconstruct K-Space data to image data (this requires scanner parameters for the reconstruction) • Now have anatomical and functional data • Pre-process the functional data • Correct for difference in slice acquisition (each slice in a volume is collected at a slightly different time). Try to correct for these differences so that all slices seem to be acquired at same time • Not correct for subject motion (head movement in scanner) by realigning all functional images • Register the functional images with the anatomical image  all images are now in the same space (all aligned with one another) • Move all subjects into template space through non-linear spatial normalization. There exist atlas templates (made from many subjects) that one can normalize to so that all subjects are in the same space, allowing for direct comparison across subjects. • DATA VERIFICATION - check if all these procedures worked. If not, go back and try again (possibly tweaking some parameters for the routines or by re-doing some of it by hand). • Move onto statistics. First we do single subject statistics: in addition to the images, information about the experimental paradigm is required. These can be overlayed onto an anatomical to create visual displays of brain activation during a particular task. • Can also combine statistical data from multiple subjects and do a group/population analysis and display these results.  Interactive nature of these workflows is critical (data verification) - can these steps be automated or semi-automated?  need metadata from collection equipment and experimental design !

AM: Analysis and Modeling System Execution Environment EcoGrid providesunified access toDistributed Data Stores , Parameter Ontologies, & Stored Analyses, and runtime capabilities via theExecution Environment Semantic Mediation System & Analysis and Modeling SystemuseWSDL/UDDItoaccess services within the EcoGrid, enabling analytically driven data discovery and integration SEEK is the combination of EcoGrid data resources and information services, coupled with advanced semantic and modeling capabilities SAS, MATLAB,FORTRAN, etc Example of “AP0” Analytical Pipeline (AP) TS2 ASy TS1 ASx ASz ASr W S D L / U D D I etc. Parameters w/ Semantics Data Binding ASr SMS: SemanticMediation System Semantic Mediation Engine AP0 j¬y j¬ a Invasive speciesover time Logic Rules ECO2-CL Query Processing Library of Analysis Steps,Pipelines & Results WSDL/UDDI WSDL/UDDI C C Raw data setswrappedfor integrationw/ EML, etc. Dar C C ECO2 MC EML C C ParameterOntologies Wrp KNB SRB Species ... ECO2 TaxOn SEEK: Vision & Overview • Large collaborative NSF/ITR project: UNM, UCSB, SDSC/UCSD, UKansas,.. • Fundamental improvements for researchers: Global access to ecologically relevant data; Rapidly locate and utilize distributed computation; Capture, reproduce, extend analysis process

Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Test sample (d) Species presence & absence points (native range) (a) Native range prediction map (f) Training sample (d) GARP rule set (e) Data Calculation Map Generation Map Generation EcoGrid Query EcoGrid Query Validation Validation User Sample Data +A2 +A3 Model quality parameter (g) Generate Metadata Integrated layers (native range) (c) Layer Integration Layer Integration +A1 Environmental layers (native range) (b) Invasion area prediction map (f) Selected prediction maps (h) Model quality parameter (g) Integrated layers (invasion area) (c) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) GARP Invasive Species Pipeline From: NSF SEEK (Deana Pennington et al)

Details GARP Invasive Species Pipeline Modeling the distribution of a species using appropriate ecological niche modeling algorithms (e.g., GARP—the Genetic Algorithm for Rule-set Production) in an analytical pipeline environment. a) The EcoGrid is queried for data specifying the presence or absence of a particular species in a given area. b) Multiple environmental layers relevant to the specie’s distribution are selected with a second EcoGrid query. c) Environmental layers, representing the current range of the species (native range) and the range of interest to the invasion study, are spatially-integrated into layer stacks. d) Samples are selected from the presence/absence data, and the corresponding values from the native range environmental layer stack are retrieved. The sample is divided into a training set and a testing set. e) The GARP algorithm is run on the training set. The GARP ruleset is then applied to both areas, creating predictive maps under current conditions (native range) and after invasion (invaded range). f) Predictive maps are sent to the scientist’s workstation for further analysis. g) A comparison is made between the ground truth occurrence data that were set aside as test data, and the corresponding location on the native range predictive map. Error measures providing an indication of model quality are sent to the scientist’s workstation. h) After multiple runs, the user may select maps for metadata generation and archiving back to the EcoGrid. From: NSF SEEK (Deana Pennington et al)

Data Integration: Spatial Integration Aspects Sample 1, lat, long, presence Sample 3, lat, long, absence Excel File Access File Sample 2, lat, long, presence Vegetation cover type Integrated data: Elevation (m) P, juniper, 2200m, 16C P, pinyon, 2320m, 14C A, creosote, 1535m, 22C Mean annual temperature (C) Integration of heterogeneous data formats. Semantically-integrated species occurrence data is combined with spatially-integrated environmental data, to produce sample data consisting of specie’s occurrence (P = present, A = absent), vegetation type, elevation (m), and mean annual temperature (C). From: NSF SEEK (Deana Pennington et al)

SEEK Components • EcoGrid • Seamless access to distributed, heterogeneous data: ecological, biodiversity, environmental data • “Semantically” mediated and metadata driven • Centralized search & management portal(s) • Analysis and Modeling System (AMS) • Capture, reproduce, and extend analysis process • Declarative means for documenting analysis • “Pipeline” system for linking generic analysis steps • Strong version control for analysis steps • Easy-to-use interface between data and analysis • Semantic Mediation System (SMS) • “smart” data discovery, “type-correct” pipeline construction & data binding: • determine whether/how to link analytic steps • determine how data sets can be combined • determine whether/how data sets are appropriate inputs for analysis steps

AMS Overview • Objective • Create a semi-automated system for analyzing data and executing models that provides documentation, archiving, and versioning of the analyses, models, and their outputs (visual programming language?) • Scope • Any type of analysis or model in ecology and biodiversity science • Massively streamline the analysis and modeling process • Archiving, rerunning analyses in SAS, Matlab, R, SysStat, C(++),… • …

SEEK Analytical Pipeline • A “workflow” is one or more analytical processes chained together into an analytical pipeline • In the SEEK model, data ingestion/cleaning is metadata driven (specifically with EML) From: NSF SEEK (Chad Berkley, Matt Jones)

Automation of data integration using workflows • Workflows can automate the integration process if data is described with adequate structured metadata •  WF layer one level above data integration/mediation layer

Simple Data Integration (homogeneous data):Metadata (EML) may be good enough! • Integration of homogeneous or mostly homogeneous data via EML metadata is relatively straightforward

Complex Data Integration (simple example!) • Integration of heterogeneous data requires much more advanced metadata and processing • Attributes must be semantically typed • Collection protocols must be known • Units and measurement scale must be known • Measurement mechanics must be known (i.e. that Density=Count/Area) • This is an advanced research topic within the SEEK project (SMS)

Semantic Typing • Label data with semantic types • Label inputs and outputs of analytical components with semantic types • Use Semantic Mediation System (SMS) to generate transformation steps • Beware of analytical constraints • Use SMS to discover relevant components • Ontology = specification of a conceptualization (a knowledge map) Data Ontology Workflow Components

SWF: Ecology Examples • Similar requirements as before: • Rich user interaction • Analysis pipelines running on an EcoGrid • Collection programming probably needed • Abstraction & nested workflows • Persistent intermediate steps (cf. e.g. Virtual Data concept) • Additionally: • Very heterogeneous data  need for semantic typingof data and analysis steps  semantic mediation support … • … for pipeline design • … for data integration at design time and at runtime

Bonus Material: Semantic Data Integration “Geology Workbench” Kai Lin (GEON, SDSC)

domain knowledge Knowledge representation AGE ONTOLOGY Nevada Show formations where AGE = ‘Paleozic’ (with age ontology) Show formations where AGE = ‘Paleozic’ (without age ontology)

Navigatable, Amalgamated Rocktype Ontology

click on Ontologies click on Datasets click on Applications An Ontology-based Mediator Geology Workbench : Initial State

Name Space Can be used to import this ontology into others Click to check its detail click on Ontology Submission Choose an OWL file to upload Geology Workbench: Uploading Ontologies

Choose an ontology class Click on Submission Select a shapefile Data set name Geology Workbench: Data (to Ontology!) RegistrationStep 1: Choose Classes

It contains information about geologic age Geology Workbench: Data RegistrationStep 2: Choose Columns for Selected Classes

Two terms are not matched any ontology terms Manually mapping algonkian into the ontology Geology Workbench: Data RegistrationStep 3: Resolve Dismatches

All areas with the age Paleozoic Click on the name Choose interesting Classes Geology Workbench: Ontology-enabled Map Integrator

New query interface Run it Switch from Canadian Rock Classification to British Rock Classification Ontology mapping between British Rock Classification and Canadian Rock Classification Submit a mapping Geology Workbench: Change Ontology

Scientific Workflows (SWF) vs. Business Workflowsand some Technical Issues

Business Workflows • Business Workflows • show their office automation ancestry • documents and “work-tasks” are passed • no data streaming, data-intensive pipelines • lots of standards to choose from: WfMC, BMPL, BPEL4WS,.. XPDL,… • but often no clear semantics for constructs as simple as this: Source: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002

What is a Scientific Workflow? • A Misnomer … • … well, at least for a number of examples… • Scientific Workflows  Business Workflows • Business Workflows: “control-flow-rich” • Scientific Workflows: “data-flow-rich” • … much more to say …

More on Scientific WF vs Business WF • Business WF • Tasks, documents, etc. undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout • Complex control flow, task-oriented • Transactions w/o rollback (ticket: reserved  purchased) • … • SWF • data-in and data-out of an analysis step are not the same object! • dataflow, data-oriented (cf. AVS/Express, Khoros, …) • re-run automatically (a la distrib. comp., e.g. Condor) or user-driven/interactively (based on failure type) • data integration & semantic mediation as part of SWF framework! • …

SWF vs Distributed Computing • Distributed Computing (e.g. a la Condor-(G) ) • Batch oriented • Transparent distributed computing (“remote Unix/Java”; standard/Java universes in Condor) • HPC resource allocation & scheduling • SWF • Often highly interactive for decision making/steering of the WF and visualization (data analysis) • Transparent data access (Grid) and integration (database mediation & semantic extensions) • Desktop metaphor (“microworkflow”!?); often (but not always!) light-weight web service invocation

Some Technical Issues (SWFs) • Design Environment • Intuitive “visual programming” interface (ideally w/o the “programming” part!!) • “Smart Typing” extensions (for datatask and tasktask bindings) • Structural typing (e.g. XML Schema) • Semantic typing (e.g. OWL) • Specialized semantic types (SI unit system, measurement scales, …) • “resource typing”: token consumption/production, execution preconditions • Declarative programming extensions • Functional collection programming (e.g., Haskell-like; cf. also BioKleisli/CPL) • Also: consider what standards bring to the table (BPEL4WS) • Alternation of analysis and data transformation steps • Sophisticated dataflow execution models and hybrids thereof: • Ptolemy-II leads the way: Process Networks (PN), Synchronous Dataflow Networks (SDF), Continuous Time modeling (CT), Discrete Event modeling (DE) • Grid-enabling process networks and data integration: • Borrow from distributed computing technologies and tools (e.g. Globus, Condor) and distributed data access (e.g., SRB) and integration (mediators) • Virtualize and Grid-enable everything! analysis@LOC, data@LOC, …

Where do we go? from Ptolemy-II to Kepler example of what extensions are needed

From Ptolemy-II to … Kepler • Ptolemy-II: Extensible Open Source Tool (EECS UC Berkeley) • Various combinable, clearly defined execution models (“domains”) • PN, SDF, DE, CT • Kepler • = PT-II extensions for • Scientific Workflows • Adopted by • SEEK, SciDAC/SDM, • and hopefully others! • (open source!) Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

designed to fit designed to fit hand-crafted Web-service actor Promoter Identification Workflow in Ptolemy-II (SSDBM’03) hand-crafted control solution; also: forces sequential execution! No data transformations available Complex backward control-flow

Back to purely functional dataflow process network (= a data streaming model!) Re-introducing map(f) to Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from piw(GeneId) to PIW :=map(piw) over [GeneId] Simplified Process Network PIW map(f)-style iterators Powerful type checking Generic, declarative “programming” constructs Generic data transformation actors Forward-only, abstractable sub-workflow piw(GeneId)

PIW as a declarative, referentially transparent functional process optimization via functional rewriting possible e.g. map(fog) = map(f) o map(g) Details: Technical report &PIW specification in Haskell Optimization by Declarative Rewriting I map(fo g) instead ofmap(f) o map(g) Combination of map and zip http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

Rewritings require that data transformation semantics is known e.g., Haskell-like for FP and SQL (XQuery)-like for (XML) database querying Optimization by Declarative Rewriting II Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney

Summary: Scientific Workflows Everywhere • Shown bits scientific workflows in: • SciDAC/SDM, SEEK, BIRN, GEON, … • Many others are there: • GriPhyN et al (virtual data concept): Chimera, Pegasus, DAGman, CondorG, …, GridANT, … • E-Science: e.g, myGrid: XScufl, Taverna, DiscoveryNet • Pragma, iLTER, .. • Commercial efforts: DiscoveryNet (inforsense), Scitegic, IBM, Oracle, … • One size fits all? • Most likely not (Business WFs =/= Scientific WFs) • Some competition is healthy and reinventing a round wheel is OK • But some coordination & collaboration can save … • reinventing the squared wheel • “leveraging” someone else’s wheel in a squared way … • Even within SWF, quite different requirements: • exploratory and ad-hoc vs. well-designed and high throughput • interactive desktop (w/ lightweight web services/Grid) vs. distributed, batched

Combine Everything:Die eierlegende Wollmilchsau: • Database Federation/Mediation • query rewriting under GAV/LAV • w/ binding pattern constraints • distributed query processing • Semantic Mediation • semantic integrity constraints, reasoning w/ plans, automated deduction • deductive database/logic programming technology, AI “stuff”... • Semantic Web technology (OWL, …) • Scientific Workflow Management • more procedural than database mediation (often the scientist is the query planner) • deployment using grid services!

F I N

Scientific Workflows: Some Examples and Technical Issues