Semantic Workflows: Metadata Meets Computational Experiments

Semantic Workflows:Metadata Meets Computational Experiments Yolanda Gil, PhD Information Sciences Institute And Department of Computer Science University of Southern California gil@isi.edu With Ewa Deelman, Jihie Kim, Varun Ratanakar, Paul Groth, Gonzalo Florez, Pedro Gonzalez, Joshua Moody

Scientific Analysis Complex processes involving a variety of algorithms/software

Issues in Managing Complex Scientific Applications • Managing a variety of software packages • Installation • Documentation • Conversion routines • Each lab has software to convert across packages • New algorithms/software available all the time • Cost of installing and understanding may be a deterrent • May require large scale computations • Some packages include treatment of large scale computations • May or may not be transparent to the user • Individual software components are only the ingredients of overall analysis method • Need to capture the method that combines the components

Managing Scientific Applications as Computational Workflows Emerging paradigm for large-scale and large-scope scientific inquiry Large-scope science integrates diverse models, phenomena, disciplines “in-silico experimentation” Workflows provide a formalization of scientific analyses Analysis routines need to be executed, the data flow amongst them, and relevant execution details The overall “method” used to produce a new result Workflows can represent semantic constraints of components and parameters that can be used to assist users in creating valid analyses Workflow systems can reason about constraints and requirements, manage their execution, and automatically keep provenance records for new results

NSF Workshop on Challenges of Scientific Workflows (2006, Gil and Deelman co-chairs) • Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science: • Exponential growth in Compute, Sensors, Data storage, Network BUT growth of science is not same exponential • Reproducibility, key to scientific method, is threatened • What is missing: • Perceived importance of capturing and sharingprocess in accelerating pace of scientific advances • Process (method/protocol) is increasingly complex and highly distributed • Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself • Workflows need to be first class citizens in scientific CyberInfrastructure • Enable reproducibility • Accelerate scientific progress by automating processes • Interdisciplinary and intradisciplinary research challenges • Report available at http://www.isi.edu/nsf-workflows06

Wings/Pegasus Workflow SystemforLarge-Scale Data Analysis

The Wings/Pegasus Workflow System[Gil et al 07; Deelman et al 03; Deelman et al 05; Kim et al 08; Gil et al forthcoming] WINGS: Knowledge-based workflow environment www.isi.edu/ikcap/wings • Ontology-based reasoning on workflows and data (W3C’s OWL) • Workflow library of useful analyses • Proactive assistance +automation • Execution-independent workflows Pegasus: Automated workflow refinement and execution pegasus.isi.edu • Optimize for performance, cost, reliability • Assign execution resources • Manage execution through DAGMan • Daily operational use in many domains Grid services condor.wisc.edu www.globus.org • Sharing of distributed resources • Remote job submission • Scalable service-oriented architecture • Commercial quality, open source

Physics-Based Seismic Hazard Analysis: SCEC’s CyberShake [Slide from T. Jordan of SCEC] Seismicity Paleoseismology Geologic structure Local site effects Faults Seismic Hazard Model Stress transfer Rupture dynamics Crustal motion Crustal deformation Seismic velocity structure

A Wings Workflow Template for Seismic Hazard Analysis Rich metadata descriptions for all data products Intensional descriptions of parallel computations Querying results of other data creation subworkflows Intensional descriptions of data sets Component Collection Application Component Single File Nested File Collection File Collection Generation Fringes Rupture Variations -> Strain Green Tensors -> Seismograms -> Spectral acceleration

Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] • Input data: a site and an earthquake forecast model • thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed • ~110,000 rupture variations to be simulated for that site • High-level template combines 11 application codes • 8048 application nodes in the workflow instance generated by Wings • Provenance records kept for 100,000 workflow data products • Generated more than 2M triples of metadata • 24,135 nodes in the executable workflow generated by Pegasus, including: • data stage-in jobs, data stage-out jobs, data registration jobs • Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available • Including MPI jobs, each runs on hundreds of processors for 25-33 hours • Runtime was 1.9 CPU years

Pegasus: Large Scale Distributed Execution [Deelman et al 06] Best Paper Award, IEEE Int’l Conference on e-Science, 2006 • Pegasus managed 1.8 years of computationin 23 days in NSF’s TeraGrid • Processed 20 TB of data with 260,000 jobs • Daily operations of NSF’s TeraGrid shared resources managed using Globus services • Pegasus managed ~840,000 individual tasks in a workflow over a period of three weeks

Background: Benefits of Workflow Systems • Managing execution • Dependencies among steps • Failure recovery • Managing distributed computation • Move data when needed • Managing large data sets • Efficiency, reliability • Security and access control • Remote job submission • Provenance recording • Low-cost high-fidelity reproducibility

This Talk:Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: Workflow retrieval and discovery Automation of workflow generation Systematic exploration of design space Validation of workflows Automated generation of metadata Guarantees of data pedigree “Conceptual” reproducibility

Parallel Data Processing in Workflows with Accuracy/Quality Tradeoffs Thanks to Vijay Kumar, Tahsin Kurc, Joel Saltz, and the rest of the OSU’s Multiscale Computing Lab and Emory’s Center for Comprehensive Informatics groups

Accuracy/Quality Tradeoffs in Large-Scale Biomedical Image Analysis PIQ: Pixel Intensity Quantification (from National Center for Microscopy and Imaging Research [Chow et al 06]) Terabyte-sized out-of-core image data Need to minimize execution time while preserving highest output quality Some operations are parallelizable, others must operate on entire images NC: Neuroblastoma Classification (from OSU’s BMI [Kong et al 07]) Diverse user queries: “Minimize time, 60% accuracy” “Most accurate classification of a region within 30 mins” “Classify image regions with some min accuracy but obtain higher accuracy for feature-rich regions” Easily parallelizable computations: per image chunk

Architecture [Hall et al 07] Wings: workflow representation and reasoning Pegasus/Condor/Globus: workflow mapping and execution on available resources DataCutter (DCMPI): execution within a site (eg a cluster) using efficient filter/stream data processing Trade-off module: explores and selects parameter values

Starting point:high-level dataflow • [Sharma et al 07]: • 3D image layers • Chunked into tiles • Projected into the Z plane • Normalized and aligned • Stitched back into a single 2D image Chunkisize Z-projection Tiles and parallelism are not explicit! Normalize Auto-align Stitch Warp

Workflow Template Sketch Controllable parameter: pis number of chunks in an image layer LEGEND Regular codes Parallel codes (MPI) with one computation node Will be expanded to several computation nodes Z-projection Normalize Normalize Set of z layers each with t tiles (.IMG) Format conversion Set of z layers each with t tiles (.DIM) Chunkisize Set of stacks of chunks Collections of z layers, each has p chunks, each chunk has t tiles) p (one per chunk stack) Sets of chunks for z-proj Internally iterates over each chunk of the z-proj Generate magic tiles 2 magic tiles p (one per chunk of z-proj) p * z (one per chunk per layer) Sets of normalized chunks for z-proj Internally iterates over each chunk of the normalized z-proj Auto-align Sets of normalized chunks for all layers Scores per pairs of chunks in z-proj MST Graph Internally iterates over each chunk of each layer Stitch Sets of stitched norm. chunks for all layers

Managing Parallelism with WINGS Workflows Many degrees of freedom, many tradeoffs: Processors available: amount and capabilities Data processing operations: reduce I/O and communication overhead Parameters of application components: vary accuracy and performance Workflow parameters: degree of parallel branching based on chunking data into smaller datasets Workflow Template (Edited in Wings) Workflow Instance (Automatically generated by Wings)

Exploring Tradeoffs [Kumar et al, forthcoming]

Workflows for Genetic Studies of Mental DisordersIn collaboration with Chris Mason, Ewa Deelman, Varun Ratnakar, Gaurang Mehta

Workflow Portal for Genetic Studies of Mental Disorders Work with Christopher Mason from Yale and Cornell

Workflows for Population Studies Work with Christopher Mason from Yale and Cornell • Workflows for common analysis types • Association tests • CNV detection • Variant discovery • Family-based association analysis (TDT) • Workflow components based on widely used software • Plink (Purcell, Harvard) • R (Chambers et al) • PennCNV (Penn) -- Hidden Markov Models • Gnosis (State, Yale) -- sliding windows • Allegro (Decode, Iceland) -- Multiterminal Binary Decision Diagrams • Structure (Pritchard, Chicago) -- structured association • FastLink (Schaffer, NCBI) • (BWA) Burrows-Wheeler Aligner (Li * Durbin) • SAMTools

Workflow Portal for Population Studies in Genomics Association Test Transmission Disequilibrium Test (TDT) Variant Discovery from Resequencing CNV Detection

Why Semantics: Choosing an Analysis • What type of analysis is appropriate for my data? • What workflow is appropriate for my data? • How do I set up the workflow parameters? Association Test Variant Discovery from Resequencing Association tests are best for large datasets that are not within a family Variant discovery is used for genomic data from the same individual CNV Detection Transmission Disequilibrium Test (TDT) TDT analysis requires no less than 100 families

Why Semantics: Choosing a Workflow • What type of analysis is appropriate for my data? • What workflow is appropriate for my data? • How do I set up the workflow parameters? Association Test Applies population stratification to remove outliers Assumes outliers have been removed Uses CMH association Uses structured association Transmission Disequilibrium Test (TDT) Uses a standard test Incorporates parental phenotype information

Why Semantics: Configuring Parameters • What type of analysis is appropriate for my data? • What workflow is appropriate for my data? • How do I set up the workflow parameters? Association Test Max Population Max individuals per cluster (“mc”) and merge distance p-value constraint (“ppc”) If Affimetrix data, set cutoff (“miss”) to 94%, if Illumina 98%

There is documentation to answer all these questions…

But the Problem is: • Will a user be inclined to read all? • Would a user know what subset to read? • A user may read about something, but will she remember? • How can a user be sure she is aware of all information relevant to making a given choice correctly ? • When a new and better version/algorithm/workflow comes along, will a user have time to read up on it? This can lead to: 1) meaningless results 2) suboptimal analyses • There is a cost to advisors/colleagues/reviewers/public

Our Approach: Semantic Workflows [Kim et al CCPEJ 08; Gil et al IEEE eScience 09; Gil et al K-CAP 09; Kim et al IUI 06; Gil et al IEEE IS 2010] • Workflows augmented with semantic constraints • Each workflow constituent has a variable associated with it • Nodes, links, workflow components, datasets • Workflow variables can represent collections of data as well as classes of software components • Constraints are used to restrict variables, and include: • Metadata properties of datasets • Constraints across workflow variables • Incorporate function of workflow components: how data is transformed • Reasoning about semantic constraints in a workflow • Algorithms for semantic enrichment of workflow templates • Algorithms for matching queries against workflow catalogs • Algorithms for generating workflows from high-level user requests • Algorithms for generating metadata of new data products • Algorithms for assisting users w/creation of valid workflow templates

Semantic Workflows in WINGS • Workflow templates • Dataflow diagram • Each constituent (node, link, component, dataset) has a corresponding variable • Semantic properties • Constraints on workflow variables (TestData dcdom:isDiscrete false) (TrainingData dcdom:isDiscrete false)

Semantic Constraints as Metadata Properties • Constraints on reusable template (shown below) • Constraints on current user request (shown above) [modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]

Three Uses of Semantic Reasoning • Semantic enrichment of workflow templates • Enables workflow retrieval (discovery) • Workflow generation algorithm • Enables users to start with a high level specification of their analysis as an underspecified request • Metadata generation • Enables automatic generation of descriptions of new workflow data products for future reuse and provenance recording

The Need for Semantic Enrichment of Workflow Templates • Semantic information is not provided when creating the workflow • Only Components and Dataflow Links • We don’t expect users to specify any properties of datasets in the workflow • E.g., when a user adds a NaiveBayesModeler, he wouldn’t be expected to define that the output of this would be a NaiveBayesModel or a Bayes Model (superclass) or a nonhuman readable Model • However, retrieval queries are often based on metadata properties of datasets (either input or results) • e.g., “Find workflows that can normalize data which is continuous and has missing values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]

An Algorithm for Semantic Enrichment of Workflow Templates [Gil et al K-CAP 09] • Phase 1: Goal Regression • Starting from final products, traverse workflow backwards • For each node, query component catalog for metadata constraints on inputs • Phase 2: Forward Projection • Starting from input datasets, traverse workflow forwards • For each node, query component catalog for metadata constraints on outputs

Semantic Enrichment Algorithm (Phase 1) Rule in Component Catalog: [modelerSpecialCase2: (?c rdf:type pcdom:ID3ModelerClass) (?c pc:hasInput ?idv) (?idv pc:hasArgumentID "trainingData”) -> (?idv dcdom:isDiscrete "true"^^xsd:boolean)] ?Dataset4 dcdom:isDiscrete true Model5 Model6 Model7

Semantic Enrichment Algorithm (Phase 1) ?Dataset3 dcdom:isDiscrete true Rule in Component Catalog: [samplerTransfer: (?c rdf:type pcdom:RandomSampleNClass) (?c pc:hasOutput ?odv) (?odv pc:hasArgumentID "randomSampleNOutputData") (?c pc:hasInput ?idv) (?idv pc:hasArgumentID "randomSampleNInputData”) (?odv ?p ?val) (?p rdfs:subPropertyOf dc:hasMetrics) -> (?idv ?p ?val)] ?Dataset4 dcdom:isDiscrete true Model5 Model6 Model7

Semantic Enrichment Algorithm (Phase 1) ?TrainingData dcdom:isDiscrete true ?Dataset3 dcdom:isDiscrete true Rule in Component Catalog: [normalizerTransfer: (?c rdf:type pcdom:NormalizeClass) (?c pc:hasOutput ?odv) (?odv pc:hasArgumentID "normalizeOutputData") (?c pc:hasInput ?idv) (?idv pc:hasArgumentID "normalizeInputData") (?odv ?p ?val) (?p rdfs:subPropertyOf dc:hasMetrics -> (?idv ?p ?val)] ?Dataset4 dcdom:isDiscrete true Model5 Model6 Model7

Semantic Enrichment Algorithm (Phase 2) ?TrainingData dcdom:isDiscrete true ?Dataset3 dcdom:isDiscrete true Rule in Component Catalog: [modelerTransferFwdData: (?c rdf:type pcdom:ModelerClass) (?c pc:hasOutput ?odv) (?odv pc:hasArgumentID "outputModel”) (?c pc:hasInput ?idv) (?idv pc:hasArgumentID "trainingData") (?idv ?p ?val) (?p rdfs:subPropertyOf dc:hasDataMetrics) notEqual(?p dcdom:isSampled) -> (?odv ?p ?val)] ?Dataset4 dcdom:isDiscrete true Model5 Model6 Model7 ?Model5 dcdom:isDiscrete true ?Model6 dcdom:isDiscrete true ?Model7 dcdom:isDiscrete true

Semantic Enrichment Algorithm (Phase 2) ?TrainingData dcdom:isDiscrete true ?Dataset3 dcdom:isDiscrete true Rule in Component Catalog: [voteClassifierTransferDataFwd10: (?c rdf:type pcdom:VoteClassifierClass) (?c pc:hasInput ?idvmodel1) (?idvmodel1 pc:hasArgumentID "voteInput1") (?c pc:hasInput ?idvmodel2) (?idvmodel2 pc:hasArgumentID "voteInput2") (?c pc:hasInput ?idvmodel3) (?idvmodel3 pc:hasArgumentID "voteInput3") (?c pc:hasInput ?idvdata) (?idvdata pc:hasArgumentID "voteInputData") (?idvmodel1 dcdom:isDiscrete ?val1) (?idvmodel2 dcdom:isDiscrete ?val2) (?idvmodel3 dcdom:isDiscrete ?val3) equal(?val1, ?val2), equal(?val2, ?val3) -> (?idvdata dcdom:isDiscrete ?va1l)] ?Dataset4 dcdom:isDiscrete true Model5 Model6 Model7 ?Model5 dcdom:isDiscrete true ?Model6 dcdom:isDiscrete true ?Model7 dcdom:isDiscrete true ?TestData dcdom:isDiscrete true

Result: Semantically Enriched Workflow Template ?TrainingData dcdom:isDiscrete true • Support semantic queries • Template will be retrieved only for discrete datasets • Many other metadata constraints, e.g.: • Model7 is human readable • Dataset4 is normalized • … ?Dataset3 dcdom:isDiscrete true ?Dataset4 dcdom:isDiscrete true Model5 Model6 Model7 ?Model5 dcdom:isDiscrete true ?Model6 dcdom:isDiscrete true ?Model7 dcdom:isDiscrete true ?TestData dcdom:isDiscrete true

The Need for Semantic Reasoning for Workflow Generation • Users provide high level (underspecified) input to the system Unspecified parameter value Underspecified algorithm

Automatic Template-Based Workflow Generation Algorithm unified well-formed request Seed workflow from request seeded workflows Find input data requirements • Input: Workflow request • Needs to specify: • High-level workflow template to be used • Additional constraints on data • User does not need to specify: • Data sources to be used • Particular component to be used • Parameter settings for the components • Output: workflow instances ready to submit for execution • By reasoning about semantic constraints of workflow & components and the properties of datasets, the algorithm: • Maintains a pool of possible workflow candidates, • Elaborates existing candidates by propagating constraints • Creates a new candidate when several choices are possible for datasets, components, and parameters • Eliminates invalid workflow candidates binding-ready workflows Data source selection bound workflows Parameter selection configured workflows Workflow instantiation workflow instances Workflow grounding ground workflows Workflow ranking top-k workflows Workflow mapping executable workflows

Automatic Workflow Generation: Exploring the Workflow Design Space ✔ Assigns sensible parameter values Eliminates invalid workflows ✔ x x Suggests the use of novel algorithms

Metadata Generation:An Example in a Seismic Hazard Workflow • 127_6.txt.variation • -s0000-h0000 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0000 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0000 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0001 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0001 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • 127_6.txt.variation • -s0000-h0001 • - source_id: 127 • rupture_id: 6 • slip_relaization_#:0 • hypo_center_#: 1 • FD_SGT/PAS_1/A/SGT161 • - site_name: PAS • tensor_direction: 1 • time_period: A • xyz_volumn_id: 161 • 127_6.txt.variation • -s0000-h0001 • - source_id: 127 • rupture_id: 6 • slip_realization_#:0 • hypo_center_#: 1 127_6.rvm - source_id: 127 - rupture_id: 6 RVM … … Rupture_variation Rupture_variation SGT SGT SeismogramGration Seismogram • Seismogram_PAS_127_6.grm • site_name: PAS • source_id: 127 • rupture_id: 6

Conclusions: Benefits of Semantic Workflows [Gil JSP-09] Execution management: Automation of workflow execution Managing distributed computation Managing large data sets Security and access control Provenance recording Low-cost high fidelity reproducibility Semantics and reasoning: Workflow retrieval and discovery Automation of workflow generation Systematic exploration of design space Validation of workflows Automated generation of metadata Guarantees of data pedigree “Conceptual” reproducibility

Semantic Workflows: Metadata Meets Computational Experiments