Ilkay ALTINTAS Assistant Director, National Laboratory for Advanced Data Research

Accelerating the Scientific Exploration Process with Scientific Workflows Ilkay ALTINTAS Assistant Director, National Laboratory for Advanced Data Research Manager, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD

Why do weneed to know about scientific workflow systems?

The answer lies in how today’s science is being conducted! Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict Traditional scientific process “A process cannot be understood by stopping it. Understanding must move with the flow of the process, must join it and flow with it.” (First Law of Mentat), Frank Herbert, Dune.

The answer lies in how today’s science is being conducted! Observe  Hypothesize  Conduct experiment  Analyze data  Compare results and Conclude   Predict • Observing / Data: Microscopes, telescopes, particle accelerators, X-rays, MRI’s, • microarrays, satellite-based sensors, sensor networks, field studies… • Analysis, Prediction / Models and model execution: Potentially large • computation and visualization Today’s scientific process + + + + More to add to this picture: network, Grid, portals, +++

Increasing Usage of Technology in Geosciences • Online data acquisition and access • Managing large databases • Indexing data on spatial and temporal attributes • Quick subsetting operations • Large scale resource sharing and management • Collaborative and distributed applications • Parallel gridding algorithms on large data sets using high performance computing • Integrate data with other related data sets, e.g. geologic maps, and hydrology models • Provide easy-to-use user interfaces from portals and scientific workflow environments The Wishlist

High-Throughput Computational Chemistry Wishlist • Perform synchronouslymany embarrassingly parallel calculations with different molecules, configurations, methods and/or other parameters • Use existing data, codes, procedures, and resources with the usual sophistication and accuracy • Let different programs seamlessly communicate with each other • Integrate preparation and analysis steps • Distribute jobs automatically onto a computational grid • Provide interfaces to database infrastructures to obtain starting structures and to allow data mining of results • Stay as general, flexible, manageable, and reusable as possible

SWF Systems Requirements • Design tools-- especially for non-expert users • Ease of use-- fairly simple user interface having more complex features hidden in the background • Reusable generic features • Generic enough to serve to different communities but specific enough to serve one domain (e.g. geosciences) customizable • Extensibility for the expert user • Registration, publication & provenance of data products and “process products” (=workflows) • Dynamic plug-in of data and processes from registries/repositories • Distributed WF execution (e.g. Web and Grid awareness) • Semantics awareness • WF Deployment • as a web site, as a web service,“Power apps” (a la SciRUN II)

So,what is a scientific workflow?

The Big Picture: Supporting the Scientist From “Napkin Drawings” … … to Executable Workflows Conceptual SWF Executable SWF Here: John Blondin, NC State Astrophysics Terascale Supernova Initiative SciDAC, DOE

Phylogeny Analysis Multiple Sequence Alignment Tree Visualization Local Disk Phylogeny Analysis Workflows

Promoter Identification Workflow Source: Matt Coleman (LLNL)

Promoter Identification Workflow

Enter initial inputs, Run and Display results

Custom Output Visualizer

Mineral Classification Workflow

PointInPolygon algorithm

Enter initial inputs, Run and Display results

Output Visualizers Browser display of results

TSI Workflow-2 (D. Swesty)

Update web page Extract Get Variables Create neutrino vars Create Chem vars Remap coordinates Derive other vars Write diagnostic file Generate plots Tool-1 Tool-2 Tool-3 Tool-4 TSI-2 Workflow Overview Delay Queued Check job status Submit batch request at NERSC Running or Done Identify new complete files If Running Transfer completed correctly Transfer files to HPSS Transfer completed correctly Transfer files to SB Delete file Generate movie Generate thumbnails

TSI-2 Executable Workflow Screenshot

TSI-2 Web Interface for Monitoring

TSI-2 Workflow Running Interface

CPES Fusion Simulation Workflow • Fusion Simulation Codes: (a) GTC; (b) XGC with M3D • e.g. (a) currently 4,800 (soon: 9,600) nodes Cray XT3; 9.6TB RAM; 1.5TB simulation data/run • GOAL: • automate remote simulation job submission • continuous file movement to secondary analysis cluster for dynamic visualization & simulation control • … with runtime-configurable observables Submit FileMover Job Submit Simulation Job Execution Log (=> Data Provenance) Select JobMgr Overall architect (& prototypical user): Scott Klasky (ORNL) WF design & implementation: Norbert Podhorszki (UC Davis)

CPES Analysis Workflow • Concurrent analysis pipeline (@Analysis Cluster): • convert ; analyze ; copy-to-Web-portal • easy configuration, re-purposing Reusable Actor “Class” Specialized Actor “Instances” SpecializeActor “instances” SpecializeActor “instances” Pipelined Execution Model Inline Documentation Inline Display Easy-to-edit Parameter Settings Overall architect (& prototypical user): Scott Klasky (ORNL) WF design & implementation: Norbert Podhorszki (UC Davis)

Scientific Workflow Systems • Combination of • data integration, analysis, and visualization steps • larger, automated"scientific process" • Mission of scientific workflow systems • Promote “scientific discovery” by providing tools and methods to generate scientific workflows • Create an extensible and customizable graphical user interface for scientists from different scientific domains • Support computational experiment creation, execution, sharing, reuse and provenance • Design frameworks which define efficient ways to connect to the existing data and integrate heterogeneous data from multiple resources • Make technology useful through user’s monitor!!!

Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows Kepler is a Scientific Workflow System www.kepler-project.org • … and a cross-project collaboration • 1st Beta release (June 2, 2006) • Builds upon the open-source Ptolemy II framework

Ptolemy II Kepler is a Team Effort Griddles SKIDL Resurgence SRB Cipres NLADR Other contributors: - Chesire (UK Text Mining Center) - DART (Great Barrier Reef, Australia) - National Digital Archives + UCSD-TV (US) - … Contributor names and funding info are at the Kepler website!! LOOKING

How does this all work? Joint CVS -- special rules! Projects like SDM, Cipres, Resurgence have their specialized releases out of a common infrastructure! Open-source (BSD) Website: Wiki -- http: kepler-project.org Communications: Busy IRC channel Mailing lists: Kepler-dev, Kepler-users, Kepler-members Telecons for design discussions 6-monthly hackatons Focus group meetings: workshops and conference calls Kepler Software Development Practice

Batch users Portals Other workflow systems as an engine User interface users Workflow developers Scientists Software Developers Engineers Researchers Kepler Users

Actors are the Processing Components • Actor • Encapsulation of parameterized actions • Interface defined by ports and parameters • Port • Communication between input and output data • Without call-return semantics • Model of computation • Communication semantics among ports • Flow of control • Implementation is a framework • Examples • Simulink(The MathWorks) • LabVIEW ( from National Instruments) • Easy 5x (from Boeing) • ROOM(Real-time object-oriented modeling) • ADL(Wright) • … Actor-Oriented Design

Some actors in place for… • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update • Command Line wrapper tools (local, ssh, scp, ftp, etc.) • Some Grid actors-Globus Job Runner, GridFTP-based file access, Proxy Certificate Generator • SRB support • Native R and Matlab support • Interaction with Nimrod and APST • Communication with ORBs through actors and services • Imaging, Gridding, Vis Support • Textual and Graphical Output • …more generic and domain-oriented actors…

Directors are the WF Engines that… • Implement different computational models • Define the semantics of • execution of actors and workflows • interactions between actors Ptolemy and Kepler are unique in combining different execution models in heterogeneous models! • Kepler is extending Ptolemy directors with specialized ones for web service based workflows and distributed workflows. • Dataflow • Time Triggered • Synchronous/reactive model • Discrete Event • Wireless • Process Networks • Rendezvous • Publish and Subscribe • Continuous Time • Finite State Machines

Dataflow as a Computation Model • Dataflow: Abstract representation of how data flows in the system • Alternative to: stored-program (von Neumann) execution • A dataflow program: a graph • nodes represent operations, edges represent data paths • Sound, simple, powerful model of parallel computation • NOT having a locus of control makes it simple! • Naturally distributed model of computation: • Asynchronous: Many actors can be ready to fire simultaneously • Execution ("firing") of a node starts when (matching) data is available at a node's input ports. • Locally controlled events • Events correspond to the “firing” of an actor • Actor: • A single instruction • A sequence of instructions • Actors fire when all the inputs are available

Vergil is the GUI for Kepler Data Search Actor Search • Actor ontology and semantic search for actors • Search -> Drag and drop -> Link via ports • Metadata-based search for datasets

Actor Search • Kepler Actor Ontology • Used in searching actors and creating conceptual views (= folders) • Currently more than 200 Kepler actors added!

Data Search and Usage of Results • Kepler DataGrid • Discovery of data resources through local and remote services • SRB, • Grid and Web Services, • Db connections • Registry of datasets on the fly using workflows

Analyze Visualize move process move render display Kepler can be used as a batch execution engine Portal • Configuration phase • Subset: DB2 query on DataStar Monitoring/ Translation Subset • Interpolate: Grass RST, Grass IDW, GMT… • Visualize: Global Mapper, FlederMaus, ArcIMS Scheduling/ Output Processing Grid

Kepler System Architecture Authentication GUI …Kepler GUI Extensions… Vergil Documentation Smart Re-run / Failure Recovery Provenance Framework Kepler Object Manager SMS Type System Ext Actor&Data SEARCH Kepler Core Extensions Ptolemy

Kepler Authentication Framework • Actors manage data, programs, computing resources in • Distributed & Heterogeneous environments • Under various secure administration • How to use ONE system handle all of the authentication jobs? • Data: • Database • SRB • XML • File System • … … • Programs: • Command Line • MPI Parallel • Online CGI • Web Service • Grid Application • Resources: • Mobil Device • Laptop • Desktop • Cluster • Supercomputer • Grid • Job Management: • OS • Gondor • PBS qsub • GRAM • Web Portal • … …

Kepler Archives • Purpose: Encapsulate WF data and actors in an archive file • … inlined or by reference • … version control • More robust workflow exchange • Easy management of semantic annotations • Plug-in architecture (Drop in and use) • Easy documentation updates • A jar-like archive file (.kar) including a manifest • All entities have unique ids (LSID) • Custom object manager and class loader • UI and API to create, define, search and load .kar files

Kepler Object Manager • Designed to access local and distributed objects • Objects: data, metadata, annotations, actor classes, supporting libraries, native libraries, etc. archived in kar files • Advantages: • Reduce the size of Kepler distribution • Only ship the core set of generic actors and domains • Easy exchange of full or partial workflows for collaborations • Publish full workflows with their bound data • Becomes a provenance system for derived data objects => Separate workflow repository and distributions easily

Kepler Provenance Framework • OPTIONAL! • Modeled as a separate concern in the system • Listens to the execution and saves information customized by a set of parameters • Context: who, what, where, when, and why that is associated with the run • Input data and its associated metadata • Workflow outputs and intermediate data products • Workflow definition (entities, parameters, connections): a specification of what exists in the workflow and can have a context of its own • Information about the workflow evolution -- workflow trail • Types of Provenance Information: • Data provenance • Intermediate and end results including files and db references • Process provenance • Keep the wf definition with data and parameters used in the run • Error and execution logs • Workflow design provenance

Kepler Provenance Recording Utility • Parametric and customizable • Different report formats • Variable levels of detail • Verbose-all, verbose-some, medium, on error • Multiple cache destinations • Saves information on • User name, Date, Run, etc…

What other system functions does provenance relate to? • Failure recovery • Smart re-runs • Semantic extensions • Kepler Data Grid • Reporting and Documentation • Authentication • Data registration Re-run only the updated/failed parts Guided documentation generation an updates

Advantages of Scientific Workflow Systems • Formalization of the scientific process • Easy to share, adapt and reuse • Deployable, customizable, extensible • Management of complexity and usability • Support for hierarchical composition • Interfaces to different technologies from a unified interface • Can be annotated with domain-knowledge • Tracking provenance of the data and processes • Keep the association of results to processes • Make it easier to validate/regenerate results and processes • Enable comparison between different workflow versions • Execution monitoring and fault tolerance • Interaction with multiple tools and resources at once

Evolving Challenges For Scientific Workflows • Access to heterogeneous data and computational resources and link to different domain knowledge • Interface to multiple analysis tools and workflow systems • One size doesn’t fit all! • Support computational experiment creation, execution, sharing, reuse and provenance • Manage complexity, user and process interactivity • Extensions for adaptive and dynamic workflows • Track provenance of workflow design (= evolution), execution, and intermediate and final results’ • Efficient failure recovery and smart re-runs • Support various file and process transport mechanisms • Main memory, Java shared file system, …

Evolving Challenges For Scientific Workflows • Support the full scientific process • Use and control instruments, networks and observatories in observing steps • Scientifically and statistically analyze and control the data collected by the observing steps, • Set up simulations as testbeds for possible observatories • Come up with efficient and intuitive workflow deployment methods • Do all these in a secure and usable way!!!

Ilkay ALTINTAS Assistant Director, National Laboratory for Advanced Data Research