1 / 29

DIRAC High-level workflow management

DIRAC High-level workflow management. L.Arrabito , J.Bregeon , LUPM/CNRS, A.Tsaregorodtsev , CPPM/IN2P3/CNRS ISGC ’ 2019, 5 April 2019, Taipei. Plan. DIRAC: quick overview of basic subsystems Workflow management with the Transformation System Production system Conclusions. Interware.

Download Presentation

DIRAC High-level workflow management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DIRAC High-level workflow management L.Arrabito, J.Bregeon, LUPM/CNRS, A.Tsaregorodtsev, CPPM/IN2P3/CNRS ISGC’2019, 5 April 2019, Taipei

  2. Plan • DIRAC: quick overview of basic subsystems • Workflow management with the Transformation System • Production system • Conclusions

  3. Interware • DIRAC provides all the necessary components to build ad-hoc grid infrastructures interconnecting computing resources of different types, allowing interoperability and simplifying interfaces. This allows to speak about the DIRAC interware.

  4. Users/communities/VOs A framework shared by multiple experiments/projects, both inside HEP, astronomy, and life science Experiment agnostic Extensible Flexible

  5. WMS • Pilot jobs are submitted to computingresources by specialized Pilot Directors • After the start, Pilots check theexecution environment andform the resource description • OS, capacity, disk space, software, etc • The resources description is presented to the Matcher service, which chooses the most appropriate user job from the Task Queue • The user job description is delivered to the pilot, which prepares its execution environment and executes the user application • In the end, the pilot is uploading the results and output data to a predefined destination

  6. DMS • DIRAC provides an abstraction of a File Catalog and StorageElement services with their specific implementation: • File Catalogs: DFC, LFC, Transformation, … • Storage Elements: most of the popular access protocols (DIPS, SRM, Xrootd, HTTP,…). Many implementations rely on gfal2 library • DataManagerAPI is a single client interface for logical data operations

  7. RMS • Request Management System allows to queue requests for arbitraryoperations, including data operations • Requests are executed by dedicated agents • Possiblyusingexternal services • Exampleoperations • Wrappedany DIRAC service RPC call • ReplicateAndRegister(e.g. using FTS) • RemoveFile/RemoveReplica • Custom (using plugin mechanism),e.g. ReplicateUsingAnotherService

  8. Massive computing and data operations • Workload, Data and Request Management systems are basic service for managing jobs and files • For large communitiesthereis a need for automaticcreation of a large number of jobs and automaticexecution of multiple data operations • This isachieved by using the DIRAC Transformation System (TS)

  9. What is Transformation • Transformation is defined by • Input data filter: Metadata query defining the input dataset • Task template for generating jobs or data management requests+ configurable plugins (see below) • A list of files satisfying the input data filter together with their status

  10. TS Plugins • Transformation definitionincludesspecification of plugins to beused–recipes for taskgeneration • Examples: • Grouping input files for jobs • Default – group files according to replica location • BySize– group files untiltheyreach a certain size • ByShare– group files according to predefinedshare and location • Data replication • Broadcast–take files at a given SE and broadcast to a givennumber of locations • Job generation • BySE– set job destination according to input data location • ByJobType–differentrules for site destination as specified in the CS • Cf. Ruciopolicies

  11. Example of ByJobTypeplugin configuration • Define the rules for each JobType, e.g.: • ‘Merge’ jobs having input data at LCG.SARA.nl can run both at LCG.SARA.nl and LCG.NICKHEF.nl

  12. TS architecture

  13. Transformations for data management • Transformation system generatesrequests: • ReplicateAndRegister • Remove • etc • Requests are executed asynchronously by the DMS optionally using external services, e.g. FTS

  14. Data driven transformations • The TS implementsalso the FileCatalog interface • Each time a new data isregistered, the TS getsthis information and updates Transformation File Tables if needed • The correspondingtasks are thencreatedwith a minimal latency • Transformations canbedefined for not yetavailable data • Will come intoeffect as soon as the new data arrive • Allow to program data production scenarios • Transformations creatingcomputingtasks and data management requestscanbegroupedtogether in a single workflow

  15. Massive job submission • Jobs created by the TS are actuallyparametric jobs • Common definitionspecified in the job template • Specificdetails are provided as job parameters • This allows to the DIRAC WMS bulk job submissionmechanism • Up to 100 jobs submitted per sec • Transaction protection to avoidincompletesubmissions

  16. Workflows • Transformations process input data and produce output data which in turncanbespecified as input data for yetanother Transformation • Allow to definechains and graphs of Transformations of arbitrarycomplexity • Transformations creatingcomputingtasks and data management requestscanbegroupedtogether in a single workflow

  17. Workflow: CTA example • CTA MC Production workflow:

  18. TS: whyitis not enough ? • TS automatize a single step of workflow execution • Need to monitor tens of transformations at once • Manually defining each transformation (job description, input data filter, ...) is error prone • A higher level System is needed to automatize the execution of full workflows • LHCb, ILC, Belle II developed specific Production Systems on the top of the TS • Found many commonalities • A common general Production System can benefit to several communities

  19. Production System • Enhancement of the transformation definition to characterize its inputs and outputs through meta-queries • A production is a set of transformations with their associations (‘links’) • It is specified through a description consisting of several ‘production steps’ • Each production step corresponds to a transformation with the eventual specification of a ‘linked’ transformation • Two transformations are linked if their Input and Output meta-queries intersect

  20. Production System • Available in the DIRAC v7r0 release • Automatic transformation instantiation based on the production definition • Fully data-driven • Support of multiple workflow schemes • Examples of simple workflows

  21. PS Architecture • User provides a production description • All the transformations of the workflows and their meta-data queries • The ProdValidator utility checks the description validity • Veryfing links between transformations • If valid, the production is stored into the DB • The user activates the production • the ProdTransManager utility creates the associated transformations • Implemented in the DIRAC software framework • Easily extendable using plugin mechanisms

  22. Monitoring transformations • Example, CTA Transformation Monitor • Monitor status of each Transformation • Number of jobs in different states • Overallprogress

  23. Production example(1) • CTA recent production (oct-dec 2018) • Goal: evaluate performances of differentcamera+telescopeconfigurations (for smalltelescopesonly) • Total jobs = 563 000 • Total disk = 592 TB distributed in 3 SEs • 1.3 M of replicas in 64 ‘datasets’ • Workflow • Air shower simulation -> 360 TB of ‘corsika’ data • Telescope simulation processing corsikadata for 5 different telescope+camera configurations -> 230 TB of ‘simtel’ data • Processing of ‘simtel’ data for event reconstruction -> 0.6 TB • Realized with 68 transformations

  24. Production example(2) • All the LHCb data management and processingoperations are performedusing DIRAC TS with a custom Production System on top of it • Centralized production management: a team of 3-4 persons • Custom Production System includescomplexprocedures for production runsapproval, results validation, etc

  25. Conclusions • Large user communitiesneedtools for managing massive work and data flows • DIRAC provides Transformation System for creation, execution and monitoring automated data drivenworkflows • Production System isdevelopedbased on the experience of several large HEP and Astrophysicsexperiments • Available as one of the DIRAC Core service, Production System helps in description, instantiation and execution of complexworkflows

  26. Questions ?

  27. Backup slides

  28. Bulk data transfers • Replication/Removal Requests with multiple files are stored in the RMS • By users, data managers, Transformation System • The Request Executing Agent invokes a Replication Operation executor • Performs the replication itself or • Delegates replication to an external service • E.g. File Transfer Service (FTS) • A dedicated FTSManager service keeps track of the submitted FTS requests • FTSMonitor Agent monitors the request progress, updates the FileCatalog with the new replicas • Other data moving services can be connected as needed • EUDAT, OneData

  29. Example transformation definition • MC simulation • You want to generatemanyidentical jobs with a varyingparameter (and no input files) • The varyingparametershouldbebuiltfrom @{JOB_ID}, which corresponds to the TaskID, and it’sused in the job workflow, e.g.: • Create a MC Transformation

More Related