towards scientific workflows based on dataflow process networks or from ptolemy to kepler
Download
Skip this Video
Download Presentation
Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler)

Loading in 2 Seconds...

play fullscreen
1 / 43

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler) - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler). Bertram Lud ä scher San Diego Supercomputer Center [email protected] A Note on the Style of the following Slides. Due to lack of time, most of the following slides will be “by reference” only ;-)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler)' - opa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
towards scientific workflows based on dataflow process networks or from ptolemy to kepler

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler)

Bertram Ludäscher

San Diego Supercomputer Center

[email protected]

a note on the style of the following slides

A Note on the Style of the following Slides

Due to lack of time, most of the following slides will be “by reference” only ;-)

…Each speaker was given four minutes to present his paper, as there were so many scheduled -- 198 from 64 different countries. To help expedite the proceedings, all reports had to be distributed and studied beforehand, while the lecturer would speak only in numerals, calling attention in this fashion to the salient paragraphs of his work. ... Stan Hazelton of the U.S. delegation immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only 22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that matter; Hazelton countered this objection with the crushing retort that, either way, 22. I turned to the number key in his paper and discovered that 22 meant the end of the world… [The Futurological Congress, Stanislaw Lem, translated from the Polish by Michael Kandel, Futura 1977]

acknowledgements
NSF, NIH, DOE

GEOsciences Network (NSF)

www.geongrid.org

Biomedical Informatics Research Network (NIH)

www.nbirn.net

Science Environment for Ecological Knowledge (NSF)

seek.ecoinformatics.org

Scientific Data Management Center (DOE)

sdm.lbl.gov/sdmcenter/

Acknowledgements
slide4

Example: Promoter Identification Workflow (PIW) (simplified)

From: SciDAC/SDM project and collaboration w/ Matt Coleman (LLNL)

slide5

Retrieve

Transcription factors

Arrange

Transcription factors

Retrieve matching

cDNA

Retrieve genomic

Sequence

Align promoters

Extract promoter

Region(begin, end)

Create consensus

sequence

Conceptual Workflow (Promoter Identification Workflow PIW)

Compute clusters

(min. distance)

For each

promoter

Select gene-set

(cluster-level)

Compute

Subsequence labels

For each gene

With all

Promoter Models

Compute Joint

Promoter Model

details of the functional mri magnetic resonance imaging analysis workflow jeffrey grethe
Details of the Functional MRI (Magnetic Resonance Imaging) Analysis Workflow (Jeffrey Grethe)
  • Collect data (K-Space images in Fourier space) from MR scanner while subject performs a specific task
  • Reconstruct K-Space data to image data (this requires scanner parameters for the reconstruction)
  • Now have anatomical and functional data
  • Pre-process the functional data
    • Correct for difference in slice acquisition (each slice in a volume is collected at a slightly different time). Try to correct for these differences so that all slices seem to be acquired at same time
    • Not correct for subject motion (head movement in scanner) by realigning all functional images
  • Register the functional images with the anatomical image  all images are now in the same space (all aligned with one another)
  • Move all subjects into template space through non-linear spatial normalization. There exist atlas templates (made from many subjects) that one can normalize to so that all subjects are in the same space, allowing for direct comparison across subjects.
  • DATA VERIFICATION - check if all these procedures worked. If not, go back and try again (possibly tweaking some parameters for the routines or by re-doing some of it by hand).
  • Move onto statistics. First we do single subject statistics: in addition to the images, information about the experimental paradigm is required. These can be overlayed onto an anatomical to create visual displays of brain activation during a particular task.
  • Can also combine statistical data from multiple subjects and do a group/population analysis and display these results.

 Interactive nature of these workflows is critical (data verification) - can these steps be automated or semi-automated?

 need metadata from collection equipment and experimental design !

garp invasive species pipeline

Archive

To Ecogrid

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Registered

Ecogrid

Database

Test sample (d)

Species

presence &

absence points

(native range)

(a)

Native range prediction

map (f)

Training sample

(d)

GARP

rule set

(e)

Data

Calculation

Map

Generation

Map

Generation

EcoGrid

Query

EcoGrid

Query

Validation

Validation

User

Sample

Data

+A2

+A3

Model quality

parameter (g)

Generate

Metadata

Integrated

layers

(native range) (c)

Layer

Integration

Layer

Integration

+A1

Environmental layers (native

range) (b)

Invasion

area prediction map (f)

Selected

prediction

maps (h)

Model quality

parameter (g)

Integrated layers

(invasion area) (c)

Environmental layers (invasion area) (b)

Species presence &absence points (invasion area) (a)

GARP Invasive Species Pipeline

From: NSF SEEK (Deana Pennington et al)

scientific workflows some findings
Scientific Workflows: Some Findings
  • More dataflow than workflow
    • but some branching looping, merging, …
    • not: documents/objects undergoing modifications
    • instead: dataset-out = analysis(dataset-in)
  • Need for “collection/functional-style programming” (FP)
    • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)
  • Need for abstraction and nested workflows
  • Need for data transformations (compute/transform alternations)
  • Need for rich user interaction / steering:
    • pause & resume
    • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF
  • Need for high-throughput transfers (“grid-enabling”, “streaming”)
  • Need for persistence of intermediate products

 data provenance (“virtual data”; cf. several ITR and e-Science projects)

analytical pipelines scientific workflows
(Analytical) Pipelines …. (Scientific) Workflows
  • Spectrum of languages & formalisms:
    • Pipelines (a la Unix)
    • Dataflow languages:
      • Synchronous dataflow networks (SDF)
      • Kahn’s process networks (PN)
    • “Web page-flow”:
      • Active XML, WebML, …
      • Hesitating-weak-alternating-tree-automata-ML
    • (Business) Workflows:
      • WfMC’s XPDL, WSFL, BPELWS, …
business workflows
Business Workflows
  • Business Workflows
    • show their office automation ancestry
    • documents and “work-tasks” are passed
    • no data streaming, data-intensive pipelines
    • lots of standards to choose from: WfMC, BMPL, BPEL4WS,.. XPDL,…
    • but often no clear semantics for constructs as simple as this:

Source: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002

the zoo of workflow standards and systems
The ZOO of Workflow Standards and Systems

Source: W.M.P. van der Aalst et al.

http://tmitwww.tm.tue.nl/research/patterns/

more on scientific wf vs business wf
More on Scientific WF vs Business WF
  • Business WF
    • Tasks, documents, etc. undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout
    • Complex control flow, task-oriented
    • Transactions w/o rollback (ticket: reserved  purchased)
  • SWF
    • data-in and data-out of an analysis step are not the same object!
    • dataflow, data-oriented (cf. AVS/Express, Khoros, …)
    • re-run automatically (a la distrib. comp., e.g. Condor) or user-driven/interactively (based on failure type)
    • data integration & semantic mediation as part of SWF framework!
swf vs distributed computing
SWF vs Distributed Computing
  • Distributed Computing (e.g. a la Condor-(G) )
    • Batch oriented
    • Transparent distributed computing (“remote Unix/Java”; standard/Java universes in Condor)
    • HPC resource allocation & scheduling
  • SWF
    • Often highly interactive for decision making/steering of the WF and visualization (data analysis)
    • Transparent data access (Grid) and integration (database mediation & semantic extensions)
    • Desktop metaphor (“microworkflow”!?); often (but not always!) light-weight web service invocation
ptolemy ii

Ptolemy-II

Recommendations following:

must read

must see (now: snippets following; watch for new ways to compress slides ;-)

must try

Bottom line:

a sophisticated system to do “simple” things (dataflows) as well as highly complex things (hybrid models)

(compare to your favorite standard/approach/system)

dataflow process networks and ptolemy ii

Dataflow Process Networks and Ptolemy-II

see!

read!

try!

Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

slide20

In our (SEEK) terminology:

Think of it as “Workflow Execution Model++”

kahn process networks pn
Kahn Process Networks (PN)
  • Concurrent processes communication through one-way FIFO channels with unbounded capacity
  • A functional process Fmaps a set of input sequences into a set of output sequences (sounds like XSM!)
  • increasing chain of sets of sequences  outputs may notincrease!
  • Consider increasing chains (wrt. prefix ordering “<“) of streams
  • PN is continuous if lub(Xs) exists for all increasing chains Xs and
    • F(lub(Xs)) < lub(F(Xs))
  • Continuous implies montonic:
    • if Xs < Ys then F(Xs)<F(Ys)
process networks cont d
Process Networks (cont’d)
  • PN in essence: simultaneous relations between sequences
  • Network of functional processes can be described by a mapping

X = F(X,I)

    • X denotes all the sequences in the network (inputs I+outputs)
  • X that forms a solution is a fixed point
  • Continuity implies exactly one “minimal” fixed point
    • minimal in the sense of pre-fix ordering for any inputs I
    • execution of the network: given I = ^ and find the minimal fixed point (works because of the monotonic property)
synchronous data flow networks sdf
Synchronous Data Flow Networks (SDF)
  • Special case of PN
  • Ptolemy-II SDF overview
    • SDF supports efficient execution of Dataflow graphs that lack control structures
    • with control structures Process Networks(PN)
    • requires that the rates on the ports of all actors be known before hand
    • do not change during execution
    • in systems with feedback, delays, which are represented by initial tokens on relations must be explicitly noted  SDF uses this rate and delay information to determine the execution sequence of the actors before execution begins.
extended kahn macqueen process networks
Extended Kahn-MacQueen Process Networks
  • A process is considered active from its creation until its termination
  • An active process can block when trying to read from a channel (read-blocked), when trying to write to a channel (write-blocked) or when waiting for a queued topology change request to be processed (mutation-blocked)
  • A deadlock is when all the active processes are blocked
    • real deadlock: all the processes are blocked on a read
    • artificial deadlock: all processes are blocked, at least one process is blocked on a write  increase the capacity of receiver with the smallest capacity amongst all the receivers on which a process is blocked on a write. This breaks the deadlock.
    • If the increase results in a capacity that exceeds the value of maximumQueueCapacity, then instead of breaking the deadlock, an exception is thrown. This can be used to detect erroneous models that require unbounded queues.
towards scimod sdmswe kepler

Towards SciMod/SDMSWE/Kepler/…

(my vote is for ‘Kepler’…)

scientific workflows dataflow process networks x
Scientific Workflows = Dataflow Process Networks + X
  • Kepler = current Ptolemy-II plus X, where X = …
    • Extended type system (structural & semantic extensions)
    • Collection programming extensions (declarative/FP) and
    • Rich user interactions/workflow steering
    • Rich data transformations (compute/transform alternations)
    • (Eco-)Grid extensions:
      • Actors as web/grid services
      • 3rd party data transfer, high-throughput data streaming
      • Data and servicerepositories, discovery
    • Data provenance
      • (semi-)automatic meta-data creation
    • What else???
  • … minus upcoming Ptolemy-II extensions!
    • The slower we are, the less we have to do ourselves ;-)
extended type system here owl semantic types
Extended Type System (here: OWL Semantic Types)

SemType m1 ::

Observation & itemMeasured.AbundanceCount &

hasContext.appliesTo.LifeStageProperty

 DerivedObservation & itemMeasured.MortalityRate & hasContext.appliesTo.LifeStageProperty

Substructure association:

XML raw-data =(X)Query=> object model =link => OWL ontology

actor repositories here a commercial tool
Actor Repositories (here: a commercial tool)

See why we said user-definable (or auto-generated) actor libraries?

slide33

designed to fit

designed to fit

hand-crafted

Web-service actor

Promoter

Identification

Workflow

in Ptolemy-II

(SSDBM’03)

hand-crafted control solution; also: forces sequential execution!

No data transformations available

Complex backward control-flow

promoter identification workflow in fp
Promoter Identification Workflow in FP

genBankG :: GeneId -> GeneSeqgenBankP :: PromoterId -> PromoterSeqblast :: GeneSeq -> [PromoterId]promoterRegion :: PromoterSeq -> PromoterRegiontransfac :: PromoterRegion -> [TFBS]gpr2str :: (PromoterId, PromoterRegion) -> Stringd0 = Gid "7" -- start with some gene-id d1 = genBankG d0 -- get its gene sequence from GenBankd2 = blast d1 -- BLAST to get a list of potential promotersd3 = map genBankP d2 -- get list of promoter sequences d4 = map promoterRegion d3 -- compute list of promoter regions and ...d5 = map transfac d4 -- ... get transcription factor binding sitesd6 = zip d2 d4 -- create list of pairs promoter-id/regiond7 = map gpr2str d6 -- pretty print into a list of strings d8 = concat d7 -- concat into a single "file" d9 = putStr d8 -- output that file

simplified process network piw
Back to purely functional dataflow process network

(= a data streaming model!)

Re-introducing map(f) to Ptolemy-II (was there in PT Classic)

no control-flow spaghetti

data-intensive apps

free concurrent execution

free type checking

automatic support to go from piw(GeneId) to PIW :=map(piw) over [GeneId]

Simplified Process Network PIW

map(f)-style

iterators

Powerful type checking

Generic, declarative “programming” constructs

Generic data transformation actors

Forward-only, abstractable sub-workflow piw(GeneId)

optimization by declarative rewriting i
PIW as a declarative, referentially transparent functional process

optimization via functional rewriting possible

e.g. map(fog) = map(f) o map(g)

Details:

Technical report &PIW specification in Haskell

Optimization by Declarative Rewriting I

map(fo g) instead ofmap(f) o map(g)

Combination of map and zip

http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

optimization by declarative rewriting ii
Rewritings require that data transformation semantics is known

e.g., Haskell-like for FP and SQL (XQuery)-like for (XML) database querying

Optimization by Declarative Rewriting II

Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney

data transformation actors our approach proposal
Data Transformation Actors: Our Approach (proposal)
  • Manual
    • XQuery, XSLT, Perl, Python, … transformation actor (development)
  • (Semi-)automatic
    • Semantic-type guided transformation generation (research)
  • Also: Web Service Composition is …
    • … a hot topic
    • … a reincarnation of many “old” ideas
    • (e.g., AI-style planning born-again; functional composition; query composition; … )
    • … a separate topic
user interaction
User Interaction
  • Brower Actor demo … (Ilkay)
f i n addtl material follows

F I N(addtl. material follows)

FYI: Flow-based programming has been re-discovered/re-invented several times:

Flow-based Programming, http://www.jpaulmorrison.com/fbp/index.shtm

ad