1 / 40

Using explicit control processes in distributed workflows to gather provenance

Using explicit control processes in distributed workflows to gather provenance. Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil. UFRJ. Agenda. Introduction Motivation

tilly
Download Presentation

Using explicit control processes in distributed workflows to gather provenance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using explicit control processes in distributed workflows to gather provenance Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil UFRJ

  2. Agenda • Introduction • Motivation • Control flow in data centric workflows • Objective • Provenance Gathering in Distributed Workflows with Explicit Control Flows • Case of Use • Control Flow on VisTrails • Conclusion

  3. Distribution & Heterogeneity in Workflows • Scientific Wf enables data intensive analyses • Use of grid x remote parallel machines • Use of different WfMS • Different provenance capture mechanisms • Use Centralized x Distributed WfMS • often offer disjoint set of capabilities How to obtain a homogeneous provenance representation and capture mechanism?

  4. Control flow matters in data centric workflows • Scientific workflows also need control structures to specify how the data flow should be directed • Goderis et al. [6] stress the importance of combining different models of computation in one scientific workflow • Bowers et al. [5] say that: • “modeling control-flow using only dataflow constructs can quickly lead to overly complex workflows that are hard to understand, reuse, reconfigure, maintain, and schedule” • Tudruj et al. [7] state the importance of general dynamic control flow, but focus on synchronization of parallel execution • Presented a set of generic control structures and proposed the use of a monitoring middleware

  5. A real example: OrthoSearch workflow Detect distant homologies on five parasites associated with tropical neglected diseases

  6. OrthoSearch specification in Kepler MAFFT/HMMER packages Best Hits Finder FormatDB BLAST InterPRO Time consuming tasks • Some lighweight tasks can run locally • Suppose we need to execute MAFFT/HMMER in a High Performance Environment • Just send it to a grid !

  7. OrthoSearch - loops, choice, … MAFFT/HMMER packages How to map this to the grid language ? Best Hits Finder FormatDB BLAST InterPRO

  8. OrthoSearch - loops, choice, … MAFFT/HMMER packages Alternatively, send one job at a time to execute remotely Best Hits Finder FormatDB LOCAL BLAST InterPRO Can be very inefficient !

  9. OrthoSearch - loops, choice, … Rewrite this to the grid language. e.g. Triana, supports loops ! But, how to bring provenance data back to Kepler ? How to register loop iterations ?

  10. OrthoSearch - loops, choice: other issues What if my available grid supports another WfMS ? What if the grid WfMS does not support loops ? What if my available grid does not have a WfMS ? Generic control flow modules with remote provenance gathering!

  11. Motivation • Workflow design • Different WfMS present their own control structures, parallel execution models, etc. • Expose different modeling semantics to the users! • Provenance gathering • WfMS register provenance in their own schema • Often encompassing specific grid features • Based on application domain attributes Many challenges in changing WfMS for the same workflow A lot of mappings and conversions!

  12. Objective • Diminish the dependence of the workflow definition on the WfMS • uncoupling the provenance gathering system from the WfMS • having some control flow of execution independent of the WfMS workflow specification language • Plugging control flow and provenance gathering modules along the workflow original tasks • the workflow specification can be executed almost independently of the current WfMS • provenance can be gathered uniformly

  13. Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al.

  14. Scientific Workflow Control Flows Implicit LOOP COGs DB MAFFT hmmbuild HMMER Implicit DECISION hmmcalibrate Ptn DB hmmsearch hmmpfam ReciprocalsBest Hits Finder BLAST fastacmd formatdb Reannotated genes InterPRO

  15. Scientific Workflow with Explicit Control Flows Initial condition MUX MAFFT hmmbuild HMMER hmmcalibrate Explicit LOOP Explicit DECISION T IF F hmmsearch hmmpfam Meta-Workflow eases migration of a Wf from WfMS to another! • All these modules can be sent to execute in any HPC environment • Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules

  16. Control flow modules on VisTrails • All these control flow modules were made available on Vistrails • More explicit control is now available • Remote execution can keep specified control • Remote execution can bring provenance data back to Vistrails with compatible structure Advantages

  17. Orthosearch on VisTrails Explicit DECISION External LOOP (parameter exploration) • All these inner modules (sub-workflow) can be sent to execute in a grid or HPC environment • Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules • In Vistrails the loop could not be implemented because it is a DAG based WfMS

  18. Scientific Workflow - Heterogeinity COGs DB MAFFT hmmbuild HMMER hmmcalibrate Ptn DB hmmsearch Time consuming hmmpfam ReciprocalsBest Hits Finder BLAST fastacmd formatdb Reannotated genes InterPRO

  19. Orthosearch on VisTrails REMOTE PARALLEL EXECUTION BLAST • BLAST modules should be sent to execute in PC cluster • Provenance gathering mechanisms can be inserted in the control flow modules to be sent to the parallel environement • In Vistrails this can be achieved using the MidMon modules

  20. MidMon on VisTrails Implementation • Monitoring tool that checks scientific processes running on distributed environments • Message exchange-based tool • Decoupled and present modular infrastructure • Support to legacy applications on distributed resources Control Modules Data Modules BLAST

  21. Concluding • We share the same motivation of Bowers et al., Goderis et al. and Tudruj et al. • And the same as Groth et al. • We propose: • A set of generic control-flow structures independent of WfMS • Our implementation has shown that: • Control-flow structures can allow generic sub-workflow remote execution • Remote process provenance can be captured in the same representation of the wf • Workflow refactoring is facilitated • Control-flow structures can be coupled to monitoring middleware Using explicit control flow Provenance independent of a WfMS

  22. Conclusion • Distribution & Heterogeneity are inevitable in scientific workflows • Adding control-flow modules to the scientific workflow specification can help the execution by heterogeneous WfMS running on distributed environments • Acts as documentation of the execution control workflow • Allows to evaluate and monitor the activities of the workflow • Helps to gather provenance from heterogeneous and independent environments with low programming efforts • MidMon on top of VisTrails • Enable scientists to monitor the submitted jobs status on their desktops • Preserves workflows’ original features

  23. Future work • Use workflow views, e.g. ZOOM* • Our solution makes the workflow very verbose • Use software component reuse and refactoring techniques to help the automatic incorporation of these modules • “Using Provenance to Improve Workflow Design” Tosta et al. • Work with other workflows from bioinformatics and oil industry

  24. Thanks ! Using explicit control processes in distributed workflows to gather provenance Sergio M. S. da Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil

  25. Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. MUX Describes a convergence between two or more input ports, resulting in just one branch

  26. Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. DEMUX Represents an incoming branch that diverges into two or more parts. Just one of the outgoing branches is enabled depending on a condition associated

  27. Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. STRING CONTROL The workflow is divided in two or more branches, and just one of them can be enabled; the other outgoing branches are withdrawn

  28. Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. NUMBER CONTROL All output data are originated simultaneously

  29. Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. NUMBER COMPARE Two or more incoming branches become one outgoing branch, which will be only enabled after the complete activation of all the input data.

  30. Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. IF Same pattern of the Demux But present two differences : If has only two input ports and has a logical expression, where the scientists can create any condition they need.

  31. MidMon • Offer a generic and lightweight monitoring tool that checks scientific processes running on distributed environments • Message exchange-based, 2 layered modular infrastructure • Decoupled and lightweight, crossing different network boundaries • Easy to deploy and manage • Support to legacy applications on distributed resources

  32. Midmon Monitoring Data • state data may be possible to be monitored • it may be possible to monitor about the state of the environment • it may be possible to monitor about service availability

  33. Midmon – State Data • List of task state data that it may be possible to monitor: • Progress of a service - Rely on check points within the service, or a service may be able to provide an estimate of its progress • Completion of a service - This could be a simple event that indicates that a service has produced all of its output file • Data consumption rate of a service - This is a measure of the rate at which service is consuming data from input file • Data production rate of a service - This is a measure of the rate at which service is generating data for output file

  34. Midmon – State of the environment • A list of the useful data that it may be possible to monitor about the state of the environment is: • Available execution nodes - This could be a list of changes in the available execution nodes in the environment • Load on an execution node - This is a measure of the load in a execution node. It could be one, or a tuple, or a composite of services, e.g., the CPU load, the number of processes, and the free resources of the execution node • Load on a network link - This is a measure of the usage of a network link, in terms of the available bandwidth • Memory usage on an execution node - This is a measure of the usage of memory in a execution node

  35. Midmon – Service availability • The following is a list of useful data that it may be possible to monitor about service availability • Available services - This could be a list of the services available as mapping targets for tasks in a workflow. The data could also include, e.g., the status of services currently deployed • Available data resources. This could be a list of the data resources available as mapping targets for inputs and outputs in a workflow

  36. OrthoSearch – SSH version • Without Control-Flow modules

  37. hmmSearch hmmPFam OrthoSearch on Kepler 1/3

  38. FormatDB OrthoSearch on Kepler 2/3 FastaCmd

  39. InterPro OrthoSearch on Kepler 3/3

More Related