Building Scientific Workflows with Taverna and BPEL: a Comparative Study in caGrid

Building Scientific Workflows with Taverna and BPEL: a Comparative Study in caGrid Wei Tan1, Paolo Missier2, Ravi Madduri1,Ian Foster1 foster@mcs.anl.gov http://www-fp.mcs.anl.gov/~foster/ 1 University of Chicago and Argonne National Laboratory, USA 2 School of Computer Science, University of Manchester, Manchester, U.K

Agenda • Introduction to caGrid • Why scientific workflows in caGrid? • BPEL and Taverna comparison • Service discovery • Service composition & workflow execution • Data-driven vs. control-driven modeling • Implicit vs. explicit definition of data • Implicit vs. explicit iteration on data • Workflow result analysis • Conclusion

Globus Introduction: caBIG and caGrid

As of Oct19, 2008: 122 participants 105services 70data 35 analytical

Introduction: caGrid and workflow Scientific workflow lifecycle Composition Discovery instruments reuse Community data Execution generate Connectivity Analysis Virtualization Security caGrid computation resource

Challenges faced by caGrid users Composition Discovery • Locating needed services • Determining function • Accessing services from a workflow • GUI for building workflows easily • Persisting and visualizing results Community • Executing workflow efficiently reuse Execution Analysis generate Sharing and reusing workflows caGrid 6

Our goals in this paper • Communicate practical experiences based on our work in the caGrid project • Cover the entire scientific workflow lifecycle, from service discovery to service composition, workflow execution, and workflow result analysis  Based on caGrid requirements for workflow language and tooling  Also applicable to other areas in data-intensive and exploratory science?

BPEL and Taverna • Not the only two but they are representative choices • BPEL • XML-based specification for web service based process behavior • Industry standard adopted by IBM, SAP, Oracle, etc. • Has also attracted attention from the scientific community because of its support for SOA paradigm • Taverna • Open-source, from the myGrid consortium in UK • Design and execution of scientific workﬂows • Plug-in architecture for extension (access more applications, visualize more data types, etc.)

Querying semantic data in cancer research • Identify description logic concepts relating to a particular context, e.g., “caCore” • Query all projects related to context “caCore” • find UML classes in each project • use project and UML class information to query the semantic metadata • retrieve the concept code • We adopt this query as a use case to guide our comparison 1 2 3 4

Support for service discovery • Before building a workflow • Need to find appropriate services to be composed • Service endpoints are not naturally known to users • Exact semantics of those services are not known  Taverna offers • A extensible scavenger interface for arbitrary service discovery according to users needs (see next page) • A native semantic discovery facility called Feta: myGrid ontology based service annotation and search.  BPEL offers • UDDI which is not widely adopted • Research efforts like: WSMO, OWL-S, which are more on specification level • No open-source tool is available that works with a service query component in an integrated way

Solution for caGrid: Metadata-based service query caGrid service metadata • Types of query • String based • Property based • Semantic based caGrid scavenger: query the CaDSR Service in the use case 1. Semantic/metadata based service discovery. 1. Semantic/metadata based service discovery. 2. Build a workflow using the services obtained by discovery. 2. Build a workflow using the services obtained by discovery. 3. Execute the workflow and view the results. 3. Execute the workflow and view the results.

Service composition & workflow execution • Data-driven vs. control-driven modeling • Implicit vs. explicit definition of data • Implicit vs. explicit iteration on data

Data-driven vs. control-driven modeling Comparison of BPEL and Taverna (Scufl) w.r.t. control/data-flow

Implicit vs. explicit definition of data • Taverna • Processors have input/output ports with an associated data type • Data travels from the output port of a processor to the input of one or more downstream processors • Interaction among processors is defined entirely by the arcs in the dataflow graph • BPEL • Requires the explicit definition of variables, and explicit initiation for complex types • Data are shared amongst activities (i.e., are global) • More complexity, but more power and flexibility in data handling

Implicit vs. explicit iteration on data • Implicit iteration in Taverna • Occurs when an input port receives a list element: • E.g., a processor that outputs a “list of strings,” can legally be connected to a processor with an input port of type “string.” • Taverna interprets this type mismatch as an indication that the destination processor must be invoked repeatedly, once for each element of the input list • This behavior is defined with Taverna's functional programming model • Explicit iteration in BPEL • BPEL does not allow type mismatch and iterate needs to be defined explicitly • Again, BPEL offers more flexibility to define more advanced iteration patterns (with more complexity in the model, though)

Implicit vs. explicit iteration in CaDSR • findProjects returns an array Project [] • findClassesInProject receives type Project and finds all UML classes in this (single) project • In Taverna an xmlsplitter extracts the project array and feeds this directly into findClassesInProject • In BPEL a ForEach construct is needed for the iteration over array Project []

Workflow result analysis • Workflow provides a natural framework for data tracking and analysis • In both Taverna and BPEL • Taverna: offers native provenance support • More precise linkage annotation between services’ input and output • Semantic support • Not the focus of our project, see ref. [16] [17] for more details

Conclusion: Taverna offers lifecycle support • Provides a compact set of primitives that eases the modeling of data flows • Allows users to specify “what to do” instead of “how to do it” composition Discovery • Scufl: compact modeling of data flow • Built-in processors: Soaplab, BioMart, etc. • Customized processors as plug-ins • Scavenger: for customized service discovery • Feta: service annotation and discovery. • Result persistence and visualization Community Execution reuse • Implicit iteration: handle parallel execution Analysis generate A community for sharing workflows caGrid + = ? + = ? + = ? + = ? caGrid caGrid caGrid caGrid

Conclusion: BPEL offers unique features • Build-time • A comprehensive set of primitives to model processes of all flavors • control-flow oriented • data-flow oriented (although a little verbose) • event driven, etc. • Full featured • process logic, data manipulation, event and message processing, fault handling, etc. • Run-time • BPEL engines typically run inside application servers with • persistent state storage • reliability and scalability guarantees • Important for long-running and computation-intensive workflows • For now Taverna engine does not provide these capabilities

Conclusion • Factors in deciding which language/tool to choose • User IT expertise • some prefer scripting language, others a friendly GUI • Problem size • Taverna often runs on desktop and handles problem of moderate size (currently common in bioinformatics) • Grid/server based systems like Swift can deal with huge volume of data and intensive computation (for example, applications in medical informatics, neuroscience, physics) • Applications involved • Web services, batch jobs, shell scripts, etc. • Future work • Enrich the caGrid workflow tool set based on Taverna • Build more real workflows to help scientific investigation • Address issues of scale as they arise

Thank you for your attention

Introduction: caGrid and workflow instruments data Connectivity Virtualization Security caGrid computation resource

Building Scientific Workflows with Taverna and BPEL: a Comparative Study in caGrid