300 likes | 305 Views
DIRAC High-level workflow management. L.Arrabito , J.Bregeon , LUPM/CNRS, A.Tsaregorodtsev , CPPM/IN2P3/CNRS ISGC ’ 2019, 5 April 2019, Taipei. Plan. DIRAC: quick overview of basic subsystems Workflow management with the Transformation System Production system Conclusions. Interware.
E N D
DIRAC High-level workflow management L.Arrabito, J.Bregeon, LUPM/CNRS, A.Tsaregorodtsev, CPPM/IN2P3/CNRS ISGC’2019, 5 April 2019, Taipei
Plan • DIRAC: quick overview of basic subsystems • Workflow management with the Transformation System • Production system • Conclusions
Interware • DIRAC provides all the necessary components to build ad-hoc grid infrastructures interconnecting computing resources of different types, allowing interoperability and simplifying interfaces. This allows to speak about the DIRAC interware.
Users/communities/VOs A framework shared by multiple experiments/projects, both inside HEP, astronomy, and life science Experiment agnostic Extensible Flexible
WMS • Pilot jobs are submitted to computingresources by specialized Pilot Directors • After the start, Pilots check theexecution environment andform the resource description • OS, capacity, disk space, software, etc • The resources description is presented to the Matcher service, which chooses the most appropriate user job from the Task Queue • The user job description is delivered to the pilot, which prepares its execution environment and executes the user application • In the end, the pilot is uploading the results and output data to a predefined destination
DMS • DIRAC provides an abstraction of a File Catalog and StorageElement services with their specific implementation: • File Catalogs: DFC, LFC, Transformation, … • Storage Elements: most of the popular access protocols (DIPS, SRM, Xrootd, HTTP,…). Many implementations rely on gfal2 library • DataManagerAPI is a single client interface for logical data operations
RMS • Request Management System allows to queue requests for arbitraryoperations, including data operations • Requests are executed by dedicated agents • Possiblyusingexternal services • Exampleoperations • Wrappedany DIRAC service RPC call • ReplicateAndRegister(e.g. using FTS) • RemoveFile/RemoveReplica • Custom (using plugin mechanism),e.g. ReplicateUsingAnotherService
Massive computing and data operations • Workload, Data and Request Management systems are basic service for managing jobs and files • For large communitiesthereis a need for automaticcreation of a large number of jobs and automaticexecution of multiple data operations • This isachieved by using the DIRAC Transformation System (TS)
What is Transformation • Transformation is defined by • Input data filter: Metadata query defining the input dataset • Task template for generating jobs or data management requests+ configurable plugins (see below) • A list of files satisfying the input data filter together with their status
TS Plugins • Transformation definitionincludesspecification of plugins to beused–recipes for taskgeneration • Examples: • Grouping input files for jobs • Default – group files according to replica location • BySize– group files untiltheyreach a certain size • ByShare– group files according to predefinedshare and location • Data replication • Broadcast–take files at a given SE and broadcast to a givennumber of locations • Job generation • BySE– set job destination according to input data location • ByJobType–differentrules for site destination as specified in the CS • Cf. Ruciopolicies
Example of ByJobTypeplugin configuration • Define the rules for each JobType, e.g.: • ‘Merge’ jobs having input data at LCG.SARA.nl can run both at LCG.SARA.nl and LCG.NICKHEF.nl
Transformations for data management • Transformation system generatesrequests: • ReplicateAndRegister • Remove • etc • Requests are executed asynchronously by the DMS optionally using external services, e.g. FTS
Data driven transformations • The TS implementsalso the FileCatalog interface • Each time a new data isregistered, the TS getsthis information and updates Transformation File Tables if needed • The correspondingtasks are thencreatedwith a minimal latency • Transformations canbedefined for not yetavailable data • Will come intoeffect as soon as the new data arrive • Allow to program data production scenarios • Transformations creatingcomputingtasks and data management requestscanbegroupedtogether in a single workflow
Massive job submission • Jobs created by the TS are actuallyparametric jobs • Common definitionspecified in the job template • Specificdetails are provided as job parameters • This allows to the DIRAC WMS bulk job submissionmechanism • Up to 100 jobs submitted per sec • Transaction protection to avoidincompletesubmissions
Workflows • Transformations process input data and produce output data which in turncanbespecified as input data for yetanother Transformation • Allow to definechains and graphs of Transformations of arbitrarycomplexity • Transformations creatingcomputingtasks and data management requestscanbegroupedtogether in a single workflow
Workflow: CTA example • CTA MC Production workflow:
TS: whyitis not enough ? • TS automatize a single step of workflow execution • Need to monitor tens of transformations at once • Manually defining each transformation (job description, input data filter, ...) is error prone • A higher level System is needed to automatize the execution of full workflows • LHCb, ILC, Belle II developed specific Production Systems on the top of the TS • Found many commonalities • A common general Production System can benefit to several communities
Production System • Enhancement of the transformation definition to characterize its inputs and outputs through meta-queries • A production is a set of transformations with their associations (‘links’) • It is specified through a description consisting of several ‘production steps’ • Each production step corresponds to a transformation with the eventual specification of a ‘linked’ transformation • Two transformations are linked if their Input and Output meta-queries intersect
Production System • Available in the DIRAC v7r0 release • Automatic transformation instantiation based on the production definition • Fully data-driven • Support of multiple workflow schemes • Examples of simple workflows
PS Architecture • User provides a production description • All the transformations of the workflows and their meta-data queries • The ProdValidator utility checks the description validity • Veryfing links between transformations • If valid, the production is stored into the DB • The user activates the production • the ProdTransManager utility creates the associated transformations • Implemented in the DIRAC software framework • Easily extendable using plugin mechanisms
Monitoring transformations • Example, CTA Transformation Monitor • Monitor status of each Transformation • Number of jobs in different states • Overallprogress
Production example(1) • CTA recent production (oct-dec 2018) • Goal: evaluate performances of differentcamera+telescopeconfigurations (for smalltelescopesonly) • Total jobs = 563 000 • Total disk = 592 TB distributed in 3 SEs • 1.3 M of replicas in 64 ‘datasets’ • Workflow • Air shower simulation -> 360 TB of ‘corsika’ data • Telescope simulation processing corsikadata for 5 different telescope+camera configurations -> 230 TB of ‘simtel’ data • Processing of ‘simtel’ data for event reconstruction -> 0.6 TB • Realized with 68 transformations
Production example(2) • All the LHCb data management and processingoperations are performedusing DIRAC TS with a custom Production System on top of it • Centralized production management: a team of 3-4 persons • Custom Production System includescomplexprocedures for production runsapproval, results validation, etc
Conclusions • Large user communitiesneedtools for managing massive work and data flows • DIRAC provides Transformation System for creation, execution and monitoring automated data drivenworkflows • Production System isdevelopedbased on the experience of several large HEP and Astrophysicsexperiments • Available as one of the DIRAC Core service, Production System helps in description, instantiation and execution of complexworkflows
Bulk data transfers • Replication/Removal Requests with multiple files are stored in the RMS • By users, data managers, Transformation System • The Request Executing Agent invokes a Replication Operation executor • Performs the replication itself or • Delegates replication to an external service • E.g. File Transfer Service (FTS) • A dedicated FTSManager service keeps track of the submitted FTS requests • FTSMonitor Agent monitors the request progress, updates the FileCatalog with the new replicas • Other data moving services can be connected as needed • EUDAT, OneData
Example transformation definition • MC simulation • You want to generatemanyidentical jobs with a varyingparameter (and no input files) • The varyingparametershouldbebuiltfrom @{JOB_ID}, which corresponds to the TaskID, and it’sused in the job workflow, e.g.: • Create a MC Transformation