150 likes | 278 Views
WINGS/Pegasus Provenance Challenge. Ewa Deelman Yolanda Gil Jihie Kim Gaurang Mehta Varun Ratnakar USC Information Sciences Institute. WINGS/Pegasus: Workflow Instance Generation and Selection. “ Validate this workflow based on the component specs”. Workflow templates specify
E N D
WINGS/Pegasus Provenance Challenge Ewa Deelman Yolanda Gil Jihie Kim Gaurang Mehta Varun Ratnakar USC Information Sciences Institute
WINGS/Pegasus: Workflow Instance Generation and Selection “Validate this workflow based on the component specs” • Workflow templates specify • complex analyses sequences • - Workflow instances specify data WINGS “Show me workflows that generate hazard maps” Workflow Creation Workflow Selection Workflow Libraries EXPERT SCIENTIST Ontologies: Domain terms, Component types, Workflow Products Workflow Template • Specifies data • requirements • Specifies execution • requirements Application Components SCIENTIST (OWL) “Run that with the USGS data set” Data Selection Data Repositories Component Specification - Preexisting data collections - Workflow execution results Workflow Instance SCIENTIST RESEARCHING NEW MODELS “Here is a new wave propagation model, takes in a series of fault ruptures, is compiled for MPI” DAGMan/ Globus Pegasus Executable Workflow
Workflow Template Collections Computational nodes
Metadata Constraints (in OWL ontology) • Constraints on Files • metadata attributes: data types and default values • Constraints on collections and collection of collection • Type of each element • Relations between metadata of a collection and metadata of individual items • Component-level constraints on metadata attributes of input/output files or collections • Deriving metadata of output files from metadata of input files • Template level constraints on metadata attributes of files or collections • Input/output files of different components can have the same metadata • Checking number of items in collections
Provenance records • Workflow templates specify • complex analyses sequences • - Workflow instances specify data WINGS “Show me workflows that generate hazard maps” Workflow Creation Workflow Selection Workflow Libraries EXPERT SCIENTIST Ontologies: Domain terms, Component types, Workflow Products Workflow Template • Specifies data • requirements • Specifies execution • requirements Application Components SCIENTIST (OWL) “Run that with the USGS data set” Data Selection Data Repositories Component Specification - Preexisting data collections - Workflow execution results Workflow Instance VDS PTC DAGMan/ Globus Pegasus Executable Workflow
Queries answered • Keys to provenance • Capturing the correct metadata and propagating it through the template and instance • Capturing runtime information • Used (SparQL and scripting) and SQL to pose queries • Queries 1,2,5,6,8—query to File and Workflow Instance Ontologies • Query 4—query to the VDS PTC • Queries 3,7,9 —lack of time
FileList Constraints on Nested Collections File Metadata:Int CollectionList 112_12.part5 part3 112_12.part2 127_6.part2 img112_12.part1 img112_1.part1 hasType Domain independent definitions hasType hasType CollOf Collection FileCollection hasItems hasItems Constraints on collection element types Domain dependent definitions hasType hasType AnatomyImages OfPatientInPeriod AnatomyImageFile AnatomyImages OfPatient hasIndexID hasPeriodID hasPatientID Metadata:String Skolem instance definitions hasPatientID … PatientID1 CC-AnatomyImages-Skolem hasPatientID hasTimePeriodID hasType PeriodID1 C-AnatomyImages-Skolem hasTimePeriodID hasType . . . AnatomyImage-Skolem IndexID1 hasIndexID metadataconstraints on collections & their elements hasItems example files and collections hasItems C-AnaImages_P112_p1 hasItems C-AnaImages_P112_p2 CC-AnaImages-for-Patient112 112_2.par3 112-2.part2 img112-2.part1 … … hasItems C-AnaImages_P112_p12
Refinement provenance (in design) • We not only consider the provenance of the executing application but also of the refinement process that maps an abstract workflow (workflow instance) onto a set of resources • The refinement process can be multi-staged • Stages of the refinement can execute on a variety of resources • We capture provenance of the entire workflow as well as workflow constituent • The representations of the refinement and of the workflow provenance are uniform
Original Workflow Workflow 1
Definition of refinement and execution provenance <object id> [[I/O] data input/output [function performed] [performance info] [optional annotations]] Could include a justification of the reasons for the tasks performed
Provenance records relating to the refinement process <Workflow8>[[I:<AnatomyImage3@S1><AnatomyHeader3@S1><ReferenceImage@S2><ReferenceHeader@S2><AnatomyImage4@S1><AnatomyHeader4@S1>] [O:<WarpParams3><WarpParams4><ReslicedImage3@S1><ReslicedHeader3@S1><ReslicesImage4><ReslicedHeader4>] [<description of tasks in workflow 8> (could be in a form of a DAX (XML-DAG used by Pegasus)), <task_id1_align_warp><task_id2_align_warp><task_id3_reslice><task_id4_reslice>] [<R1,R2><20hours (cumulative time)>…..][]] <task_id1_align_warp>[I:<AnatomyImage3@S1><AnatomyHeader3@S1>, <ReferenceImage@S2><ReferenceHeader@S2> O:<WarpParams3>] [<R1><1hr>…][]] <id1>[I:[<workflow1>O:<<workflow2>;<workflow3>] [<partition>][<host22><2 mins><…>] [<planning horizon set at 5 hours>] <id2>[I:[<workflow2>O:<workflow4>] [<reduction>][ <…..>][<……>] <id5>[I:<workflow6> O:<worfklow7>] [<registration>][<…>][<using primary RLS host14>] <id6>[I:<workflow7> O:<worlflow8>] [<clustering>][<host12><12mins>][] <id7>[I:[<worfklow8>O:< Ø>] [<dagman_exec>][][] Thanks to Luc Moreau for his input!