Kepler towards a grid enabled system for scientific workflows l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 59

Kepler: Towards a Grid-Enabled System for Scientific Workflows PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on
  • Presentation posted in: General

Kepler: Towards a Grid-Enabled System for Scientific Workflows. Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher* , Steve Mock [email protected] San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD). Outline.

Download Presentation

Kepler: Towards a Grid-Enabled System for Scientific Workflows

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Kepler towards a grid enabled system for scientific workflows l.jpg

Kepler: Towards a Grid-Enabled System for Scientific Workflows

Ilkay Altintas, Chad Berkley, Efrat Jaeger,

Matthew Jones, Bertram Ludäscher*, Steve Mock

[email protected]

San Diego Supercomputer Center (SDSC)

University of California, San Diego (UCSD)


Outline l.jpg

Outline

  • Motivation: Scientific Workflows (SEEK, SDM, GEON, ..)

  • Current Features of the Kepler Scientific Workflows System

  • Extending Kepler:

    • Grid-Enabling Kepler:

      • 3rd party transfer

    • WF planning & optimization

      • Shipping and Handling Algebra (SHA)

      • Web Service Composition as Declarative Query Plans

    • Semantic Types for Scientific Workflows

  • Conclusions


Kepler team projects sponsors l.jpg

Ilkay Altintas SDM

Chad Berkley SEEK

Shawn Bowers SEEK

Jeffrey Grethe BIRN

Christopher H. Brooks Ptolemy II

Zhengang Cheng SDM

Efrat Jaeger GEON

Matt Jones SEEK

Edward A. Lee Ptolemy II

Kai Lin GEON

Bertram Ludäscher BIRN, GEON, SDM, SEEK

Steve Mock NMI

Steve Neuendorffer Ptolemy II

Jing Tao SEEK

Mladen Vouk SDM

Yang Zhao Ptolemy II

Kepler Team, Projects, Sponsors

Ptolemy II


Example seek science environment for ecological knowledge large nsf itr l.jpg

Example: SEEK– Science Environment for Ecological Knowledge (large NSF ITR)

  • Analysis & Modeling System

    • Design and execution of ecological models and analysis

    • End user focus

    • application-/upperware

  • Semantic Mediation System

    • Data Integration of hard-to-relate sources and processes

    • Semantic Types and Ontologies

    • upper middleware

  • EcoGrid

    • Access to ecology data and tools

    • middle-/underware

Architecture Overview

(cf. Cyberinfrastructure)


Ecology garp analysis pipeline for invasive species prediction l.jpg

Archive

To Ecogrid

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Test sample (d)

Species

presence &

absence points

(native range)

(a)

Native range prediction

map (f)

Training sample

(d)

GARP

rule set

(e)

Data

Calculation

Map

Generation

Map

Generation

EcoGrid

Query

EcoGrid

Query

Validation

Validation

User

Sample

Data

+A2

+A3

Model quality

parameter (g)

Generate

Metadata

Integrated

layers

(native range) (c)

Layer

Integration

Layer

Integration

+A1

Environmental layers (native

range) (b)

Invasion

area prediction map (f)

Selected

prediction

maps (h)

Model quality

parameter (g)

Integrated layers

(invasion area) (c)

Environmental layers (invasion area) (b)

Species presence &absence points (invasion area) (a)

Ecology: GARP Analysis Pipeline for Invasive Species Prediction

Source: NSF SEEK (Deana Pennington et. al, UNM)


Genomics example promoter identification workflow piw l.jpg

Genomics Example: Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)


Slide7 l.jpg

Source: NIH BIRN (Jeffrey Grethe, UCSD)


Scientific workflows some findings l.jpg

Scientific “Workflows”: Some Findings

  • More dataflow than (business control-/) workflow

    • DiscoveryNet, Kepler, SCIRun, Scitegic, Taverna, Triana,, …,

  • Need for “programming extension”

    • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)

  • Need for abstraction and nested workflows

  • Need for data transformations (WS1DTWS2)

  • Need for rich user interaction & workflow steering:

    • pause / revise / resume

    • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF

  • Need for high-throughput transfers (“grid-enabling”, “streaming”)

  • Need for persistence of intermediate products andprovenance


In a flux workflow standards l.jpg

In a Flux: Workflow “Standards”

Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/

http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html


Commercial open source scientific workflow well dataflow systems l.jpg

Commercial & Open Source Scientific “Workflow” (well Dataflow) Systems

Kensington Discovery Edition from InforSense

Triana

Taverna


Scirun problem solving environments for large scale scientific computing l.jpg

SCIRun: Problem Solving Environments for Large-Scale Scientific Computing

  • SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations

  • New collaboration under Kepler/SDM

  • Component model, based on generalized dataflow programming

Steve Parker (cs.utah.edu)


Our starting point ptolemy ii dataflow process networks l.jpg

see!

Our Starting Point: Ptolemy II & Dataflow Process Networks

read!

try!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/


Why ptolemy ii l.jpg

Why Ptolemy II?

  • Ptolemy II Objective:

    • “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-definedmodels of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”

  • Data & Process oriented: Dataflow process networks

  • Natural Data Streaming Support

  • User-Orientation

    • “application-ware”, not middle-/under-ware)

    • Workflow design & exec console (Vergil GUI)

  • PRAGMATICS

    • mature, actively maintained, well-documented (500+pp)

    • open source system

    • developed across multiple projects (NSF/ITRs SEEK and GEON, DOE SciDAC SDM, …)

    • hoping to leverage e-sister projects (e.g. Taverna, …)


Dataflow process networks putting computation models orchestration first l.jpg

typed i/o ports

FIFO

actor

actor

Dataflow Process Networks: Putting Computation Models (“Orchestration”) first!

  • Synchronous Dataflow Network (SDF)

    • Statically schedulable single-threaded dataflow

      • Can execute multi-threaded, but the firing-sequence is known in advance

    • Maximally well-behaved, but also limited expressiveness

  • Process Network (PN)

    • Multi-threaded dynamically scheduled dataflow

    • More expressive than SDF (dynamic token rate prevents static scheduling)

    • Natural streaming model

  • Other Execution Models (“Domains”)

    • Implemented through different “Directors”

advanced push/pull


Slide15 l.jpg

Actor-/Dataflow Orientation

vs

Object-/

Control flow Orientation

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/


Marrying or divorcing control dataflow l.jpg

Marrying or Divorcing Control- & Dataflow

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/


Overview scientific workflows in kepler l.jpg

Overview: Scientific Workflows in Kepler

  • Modeling and Workflow Design

  • Web services = individual components (“actors”)

  • “Minute-Made” Application Integration:

    • Plugging-in and harvesting web service components is easy, fast

  • Rich SWF modeling semantics (“directors”):

    • Different and precise dataflow models of computation

    • Clear and composable component interaction semantics

       Web service composition and application integration tool

  • Coming soon:

    • Shrinked wrapped, pre-packaged “Kepler-to-Go”

    • Structural and semantic typing (better design support)

    • Grid-enabled web services (for big data, big computations,…)

    • Different deployment models (web service, web site, applet, …)


The kepler gui vergil steve neuendorffer ptolemy ii l.jpg

The KEPLER GUI: Vergil(Steve Neuendorffer, Ptolemy II)

Drag and drop utilities, director

and actor libraries.


Slide19 l.jpg

Running a Genomics WF (Ilkay Altintas, SDM)


Support for multiple workflow granularities l.jpg

Support for Multiple Workflow Granularities

Boulders

Plumbing

Powder

Abstraction:

Sand to Rocks

Sand


Directors and combining different component interaction semantics l.jpg

Directors and Combining Different Component Interaction Semantics

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/


Application examples mineral classification with kepler efrat jaeger geon l.jpg

Application Examples: Mineral Classification with Kepler … (Efrat Jaeger, GEON)


Inside the classifier l.jpg

… inside the Classifier


Standard browserui client side svg l.jpg

Standard BrowserUI: Client-Side SVG


Swf reengineering ashraf efrat kai geon l.jpg

SWF Reengineering (Ashraf, Efrat, Kai, GEON)


Datamapper sub workflow l.jpg

DataMapper Sub-Workflow


Result launched via browserui actor coupling with esri s arcims l.jpg

Result launched via BrowserUI actor(coupling with ESRI’s ArcIMS)


Distributed workflows in kepler l.jpg

Distributed Workflows in KEPLER

  • Web and Grid Service plug-ins

    • WSDL (now) and Grid services (stay tuned …)

    • ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard

    • SSH, SCP, SDSC SRB, OGS?-???… coming

  • WS Harvester

    • Import query-defined WS operations as Kepler actors

  • XSLT and XQuery Data Transformers

    • to link not “designed-to-fit” web services

  • WS-deployment interface (planned)


Generic web service actor ilkay altintas l.jpg

Configure - select service

operation

Generic Web Service Actor (Ilkay Altintas)

  • Given a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method.


Set parameters and commit l.jpg

Set Parameters and Commit

Set parameters

and commit


Specialized ws actor after instantiation l.jpg

Specialized WS Actor (after instantiation)


Web service harvester ilkay altintas sdm l.jpg

Web Service Harvester (Ilkay Altintas, SDM)

  • Imports the web services in a repository into the actor library.

  • Has the capability to search for web services based on a keyword.


Composing 3 rd party wss nmi steve mock l.jpg

Output of previous

web service

Composing 3rd-Party WSs (NMI, Steve Mock)

Input of next

web service

User interaction &

Transformations


A special generic ingestion actor for eml data seek chad berkley l.jpg

A Special Generic Ingestion Actor for EML Data (SEEK, Chad Berkley)

  • Ingests any data format described by EML metadata

  • Converts raw data to Ptolemy format

  • Data can then be operated on with other actors


Wrapping legacy applications l.jpg

Wrapping Legacy Applications


Promoter identification workflow piw l.jpg

Promoter Identification Workflow (PIW)

Source: Matt Coleman (LLNL)


Slide37 l.jpg

Execution

Semantics

Promoter

Identification

Workflow

in Ptolemy-II

[SSDBM’03]


Slide38 l.jpg

designed to fit

designed to fit

hand-crafted

Web-service actor

hand-crafted control solution; also: forces sequential execution!

No data transformations available

Complex backward control-flow


Promoter identification workflow in fp l.jpg

Promoter Identification Workflow in FP

genBankG :: GeneId -> GeneSeqgenBankP :: PromoterId -> PromoterSeqblast :: GeneSeq -> [PromoterId]promoterRegion :: PromoterSeq -> PromoterRegiontransfac :: PromoterRegion -> [TFBS]gpr2str :: (PromoterId, PromoterRegion) -> Stringd0 = Gid "7" -- start with some gene-id d1 = genBankG d0 -- get its gene sequence from GenBankd2 = blast d1 -- BLAST to get a list of potential promotersd3 = map genBankP d2 -- get list of promoter sequences d4 = map promoterRegion d3 -- compute list of promoter regions and ...d5 = map transfac d4 -- ... get transcription factor binding sitesd6 = zip d2 d4 -- create list of pairs promoter-id/regiond7 = map gpr2str d6 -- pretty print into a list of strings d8 = concat d7 -- concat into a single "file" d9 = putStr d8 -- output that file


Cleaned up process network piw l.jpg

Back to purely functional dataflow process network

(= also a data streaming model!)

Re-introducing map(f) to Ptolemy-II (was there in PT Classic)

no control-flow spaghetti

data-intensive apps

free concurrent execution

free type checking

automatic support to go from piw(GeneId) to

PIW :=map(piw) over [GeneId]

Cleaned up Process Network PIW

map(f)-style

iterators

Powerful type checking

Generic, declarative “programming” constructs

Generic data transformation actors

Forward-only, abstractable sub-workflow piw(GeneId)


Optimization by declarative rewriting i l.jpg

PIW as a declarative, referentially transparent functional process

optimization via functional rewriting possible

e.g. map(fog) = map(f) o map(g)

Technical report &PIW specification in Haskell

Optimization by Declarative Rewriting I

map(fo g) instead ofmap(f) o map(g)

Combination of map and zip

http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf


Optimizing ii streams pipelines l.jpg

Optimizing II: Streams & Pipelines

  • Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f • mapS g mapS (f • g)

Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney


Middle underware access querying databases l.jpg

Middle/Underware Access: Querying Databases

  • Database connection actor:

    • Opening a database connection and passing it to all actors accessing this database.

  • Database query actor:

    • A generic actor that queries a database and provides its result.

  • DBConnection type and DBConnectionToken:

    • A new IOPort type and a token to distinguish a database connection from any general type.


Database connection actor l.jpg

Database Connection Actor

  • OpenDBConnection actor:

    • Input: database connection information

    • Output: DBConnectionToken (reference to a DB connection instance, via a DBConnection output port)


Database query actor l.jpg

Database Query Actor

  • Database Query actor:

    • Input: SQL query string and a DB connection token

    • Parameters:

      • output type: XML, Record, or String

      • tuple-at-a-time vs set-at-a-time

    • Process:

      • execute query

      • produce results according to parameters


Querying example l.jpg

Querying Example


An oversimplified model of the grid l.jpg

g

f

X Y Z

An (oversimplified) Model of the Grid

  • Hosts: {h1, h2, h3, …}

  • [email protected]: d1@{hi}, d2@{hj}, …

  • [email protected]: f1@{hi}, f2@{hj}, …

  • Given: data/workflow:

  • … as a functional plan: […; Y := f(X); Z := g(Y); …]

  • … as a logic plan: […; f(X,Y)g(Y,Z); …]

  • FindHost Assignment: di hi , fj hj for all di ,fj

    … s.t. […; [email protected] := [email protected]([email protected]), …] is a valid plan


Shipping and handling algebra sha l.jpg

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

Shipping and Handling Algebra (SHA)

Logical view

(1)

  • plan [email protected] = [email protected] of [email protected] =

  • [ [email protected] to A, [email protected] := [email protected]([email protected]), [email protected] to C ]

  • [ [email protected] => B, [email protected] := [email protected]([email protected]), [email protected] to C ]

  • [ [email protected] to C, [email protected] => C, [email protected] := [email protected]([email protected]) ]

(2)

(3)

Physical view: SHA Plans


Grid enabling ptii handles l.jpg

Grid-Enabling PTII: Handles

  • AGA: get_handle

  • GAA: return &X

  • AB: send &X

  • BGB: request &X

  • GBGA: request &X

  • GA GB: send *X

  • GBB: send done(&X)

  • Example:

  • &X = “GA.17”

  • *X =<some_huge_file>

  • Candidate Formalisms:

  • GridFTP

  • SSH, SCP

  • SDSC SRB

  • OGS?-??? … WSRF?

Logical token transfer (3) requires get_handle(1,2); then exec_handle(4,5,6,7) for completion.

Keplerspace

3

A

B

4

7

2

1

5

Gridspace

GA

GB

6


Extensions semantic type l.jpg

Extensions: Semantic Type

  • Take concepts and relationships from an ontology to “semantically type” the data-in/out ports

  • Application: e.g., design support:

    • smart/semi-automatic wiring, generation of “massaging actors”

m1

(normalize)

p3

p4

Takes Abundance Count

Measurements for Life Stages

Returns Mortality Rate Derived

Measurements for Life Stages


Semantic types l.jpg

Semantic Types

  • The semantic type signature

    • Type expressions over the (OWL) ontology

m1

(normalize)

p3

p4

SemType m1 ::

Observation & itemMeasured.AbundanceCount &

hasContext.appliesTo.LifeStageProperty

->

DerivedObservation & itemMeasured.MortalityRate &

hasContext.appliesTo.LifeStageProperty


Extended type system here owl semantic types l.jpg

Extended Type System (here: OWL Semantic Types)

SemType m1 ::

Observation & itemMeasured.AbundanceCount &

hasContext.appliesTo.LifeStageProperty

 DerivedObservation & itemMeasured.MortalityRate & hasContext.appliesTo.LifeStageProperty

Substructure association:

XML raw-data =(X)Query=> object model =link => OWL ontology


Semantic types for scientific workflows l.jpg

Semantic Types for Scientific Workflows


Deriving data transformations from semantic service registration l.jpg

Deriving Data Transformations from Semantic Service Registration

[Bowers-Ludaescher,

DILS’04]


Structural and semantic mappings l.jpg

Structural and Semantic Mappings

[Bowers-Ludaescher,

DILS’04]


Workflow planning as planning queries with limited access patterns l.jpg

Workflow Planning as Planning Queries with Limited Access Patterns

  • User query Q: answer(ISBN, Author, Title) 

    book(ISBN, Author, Title),

    catalog(ISBN, Author),

    not library(ISBN).

  • Limited (web service) Access Patterns (API)

    • Src1.books: in: ISBN out: Author, Title

    • Src1.books: in: Author out: ISBN, Title

    • Src2.catalog: in: {} out: ISBN, Author

    • Src3.library: in: {} out: ISBN

  • Q is not executable, but feasible (equivalent to executable Q’: catalog ; book ; not library)

     ICDE (poster), EDBT, PODS (papers), [Nash-Ludaescher,2004]


Conclusions l.jpg

Conclusions

  • Summary

    • Kepler Scientific Workflow System

    • Open source, cross-project collaboration (SEEK, GEON, SDM,…)

    • Actor & Dataflow-oriented Modeling, Design, Execution (Ptolemy II heritage)

    • Prototyping, static analysis, web services, data transformations

  • Next Steps

    • First official release (“Kepler-to-Go”) April/May ’04

      • e-Science meeting NeSC, Edinburgh

    • Grid-enabling

      • 3rd party transfer, planning, optimization, …

    • Semantic Typing [DILS’04]

    • Provenance, Fault tolerance, …

    • Link-Up w/ e.g. Taverna, Pegasus, …

    • Become a member or co-developer (You!)


  • Login