530 likes | 542 Views
Explore the latest features of Ganga 6, a comprehensive update on output handling, user code parallelization, and upcoming developments by Alex Richards from ICL. Learn how to utilize new functionalities with practical examples.
E N D
Introduction to 6 Alexander Richards (a.richards@imperial.ac.uk) Thanks to Patrick Owen, ICL for some of the slides content
Outline Introduction Output control Post processing Parallel user code Future work Alex Richards, ICL
Outline Introduction Output control Post processing Parallel user code Future work Alex Richards, ICL
Introduction • Ganga 6 is the first major release of Ganga for ~4 years • Several new features have been implemented which will be introduced in the coming sections. • There have been some teething problems but I think we are over them now and I recommend people switch to Ganga 6. • The purpose of this talk is to provide an update of the new functionality that users will find in Ganga 6 complete with examples to hopefully make things more obvious • If you require a basic introduction to Ganga in general please see: • https://indico.cern.ch/getFile.py/access?contribId=10&sessionId=4&resId=1&materialId=slides&confId=159521 • As always help is available from: • Interactively using help(object) • Online via Ganga manuals or FAQs: • http://ganga.web.cern.ch/ganga/user/index.php • https://twiki.cern.ch/twiki/bin/view/LHCb/GangaLHCbFAQ • Via the LHCb DAST mailing list (lhcb-distributed-analysis@cern.ch)
Outline Introduction Output control Post processing Parallel user code Future work Alex Richards, ICL
A more generic output definition system • Previously: • Two separate job attributes ‘outputsandbox’ and ‘outputdata’. • Limited scope for output destination automation for a given job. • Outputsandbox -> local • Outputdata -> DIRAC SE / mass storage (depending on backend) • Not obvious where to expect any given data. • Any future expansion would have been a hack around this structure. • Motivation: • To provide a general output definition system that should be obvious to the user where their data will end up. • To provide a system which will be scalable enough to incorporate any future changes/additions to the output specification • To give the user more flexibility about where a given jobs data will end up. i.e. decouple the dependence of outputdata location on backend where the job runs • To provide the user with useful metadata about any given piece of data
Implementation • The implementation involves: • The amalgamating of outputsandbox and outputdata into the single job attribute ‘outputfiles’ • This outputfiles list attribute can take any type of outputfile object. It is this object that is responsible for making sure that the jobs output turns up in the right place. • The currently defined outputfile objects include: • SandboxFile -> return file to local worksapce • DiracFile -> upload file to DIRAC SE • LCGSEFile -> upload to grid SE (for LHCb, just use DiracFile) • MassStorageFile -> upload to mass storage (e.g. castor) • Can change copy command etc from config such that this can also just represent a scratch file on disk – may be making a new file type with this setup as default Alex Richards, ICL
ClassDiracFile(IOutputFile): namePattern = ‘ ’ localDir = None lfn = ‘ ’ guid = ‘ ’ locations = [ ] def get(): ‘’’ get the file locally’’’ def put(): ‘’’ upload the file to DIRAC SE’’’ Outputfile objects • All outputfiles should have as a minimum a ‘namePattern’ attribute • localDir attribute is the local directory which the file will come back to or be uploaded from. Default None means that it will be the jobs working dir ( or current working dir for standalone, see later )
Example • We will see here how to replicate the behaviour of the outputsandbox and outputdata attributes. Normal job setup here Alex Richards, ICL
Automatic type detection • To ease the users entry of these file types, automatic type detection is employed which by default will send all *.dst files to DIRAC while returning everything else to the local workspace. • For example, we can reproduce our previous example in a shortened quicker form: Normal job setup here Alex Richards, ICL
Using config file to change the auto file type detection • [Output] • MassStorageFile = {'fileExtensions‘ : ['*.dummy'], • 'backendPostprocess':{ 'LSF‘ : 'WN', 'LCG‘ : 'client', • 'CREAM‘ : 'client', 'Localhost‘ : 'WN‘ }, • 'uploadOptions':{ 'mkdir_cmd‘ : 'nsmkdir', 'cp_cmd‘ : 'rfcp', 'ls_cmd‘ : 'nsls', • 'path‘ : /castor/cern.ch/user/i/idzhunov/ganga}} • DiracFile = {'fileExtensions‘ : ['*.dst'], • 'backendPostprocess':{ 'Dirac‘ : 'WN', 'LSF‘ : 'WN', • 'LCG‘ : 'WN', 'CREAM‘ : 'WN', 'Localhost‘ : 'WN'}, • 'uploadOptions‘ : {}} • If the namePattern of an outputfile doesn’t match any of the fileExtensions then it defaults to a SandboxFile • Uploading either happens on the WN or if backend is set to client then it’s performed via a two step process. • Firstly the file returns locally • File is uploaded to destination from the local client
Wildcards • Wildcards are fully supported and can be used as the files namePattern. • Once a job is completed an outputfile with a wildcard in the namePattern is automatically expanded. • e.g. DiracFile(‘*.root’) turns into [ DiracFile(‘a.root’), DiracFile(‘b.root’) ] • For consistency with the original job, a copy of this job will reduce the wildcard outputfile back to it’s original state. In this case DiracFile(‘*.root’). Then when this copy has finished running it too will be expanded.
Getting a jobs files • Getting a file back locally should be just a case of calling the file objects ‘get’ method. • If running a DIRAC job then • Alternatively can use the ‘getOutputData’ method from the DIRAC backend to get all DiracFiles locally Alex Richards, ICL
[output] #AutoRemoveFilesWithJob = False # AutoRemoveFileTypes = [‘DiracFile’] Removing a DIRAC SE file • If you wish to remove a DiracFile from DIRAC SE can use the ‘remove’ method (This method not available for all outputfile types) • Note if in your .gangarc file you set the config variable AutoRemoveFilesWithJob = True (False by default) then removing a job will automatically remove it’s DiracFiles as well as any other outputfile types defined in the AutoRemoveFileTypes config var Alex Richards, ICL
Filtering of job.outputfiles list • To ease the users experience of the outputfiles attribute and to avoid the need for users to code some list filters themselves, a ‘get’ method has been added to the GangaList • This method takes one argument and in the case of the outputfiles list will return a subset based on the argument given. • For outputfiles, this argument can be either: • A string (including wildcards) which will match the namePattern attribute • A class type which will match the class type • A class instance which will perform a == type match (which should match type and namePattern attribute) Alex Richards, ICL
Examples • By way of example lets set up the following: • Now we try the filter system: • Firstly a string • Secondly using a string with wildcard
Examples • Thirdly using a type • Finally using a instance of an outputfile Alex Richards, ICL
Labouring the point… • As shown previously, the getOutputData method of the Dirac backend now loops over the jobs DiracFile outputfiles and gets the data back locally. • Reminder: • Using the outputfiles filters one can achieve the same thing for any backend in one line with: Alex Richards, ICL
Standalone • Can be done standalone as well • Here with the localDir = None (default) the file is returned to and uploaded from the current working dir. • Can even store these outputfile objects in the box for later downloading or just for record keeping. • Note that unless stored in the box, standalone files will not be persisted and therefore information like an uploaded files LFN will be lost between Ganga sessions.
Examples – Uploading to / downloading from CASTOR • By way of example lets look at uploading to CASTOR • Uploading done using the ‘put’ method • Once uploaded the MassStorageFile should have filled metadata • Can now retrieve this file again locally using the ‘get’ method
Examples – DiracFile • Using the same syntax we can upload a file to DIRAC SE • localDir default is from the current dir. • Again using ‘put’ to perform the upload • DiracFile has different metadata from MassStorageFile • Retrieve the file locally with ‘get’ method • Remove the file from DIRAC SE with the ‘remove’ method
Examples – Setting up DiracFile for existing LFN • If you know the LFN of a file on DIRAC SE you can manually make a DiracFile object to represent it • Using the ‘getMetadata’ method you can automatically populate the DiracFile’s metadata • The using the ‘get’ method can get it as normal Alex Richards, ICL
Outline Introduction Output control Post processing Parallel user code Future work Alex Richards, ICL
Post completion job activities • Previously: • Only thing that a user could specify to happen when a job finished was the merging of outputdata. • This was done using a merger object attached to the jobs merger attribute • A job was marked as completed or failed purely on the return status (either exit code or DIRAC status) • The only way to know if a job had finished was to look at Ganga regularly • Motivation: • To provided a more powerful and flexible set of post job tools that the user can use as and when they see fit. • To provide a more general attribute framework to use with these tools which will allow for the easy addition of not only new tools but new types as well • To provide tools to allow the final status of a job to be determined in a smarter and more customisable way • To provide an email notification system for users to inform them of job progress.
Implementation • The implementation involves: • Replacing the restrictive job.merger attribute with the far more general job.postprocessors. • Job.postprocessors is a list into which you can add multiple postprocessor objects. • Migrating the mergers to use this new framework • Currently there are three types of postprocessors: • Mergers -> as before with addition of LHCbFileMerger • Checkers -> objects which can fail the jobs based on some criteria • Notifier -> an email notification system • Postprocessors are executed by class in the order merger->checker->notifier and within class type by original list order. Alex Richards, ICL
Mergers • Minor changes in the mergers from the user point of view: • Mergers have been re-written but interface stays the same • MultipleMerger is removed, just add the separate mergers to the postprocessors list • DSTMerger is removed, instead use the new LHCbFileMerger which should merge all LHCb file types (.dst, .sim etc.) • For more info use: • Attached to jobs now via the postprocessors attribute as with all postprocessor objects Alex Richards, ICL
Checkers • The following are types of checkers currently in Ganga 6 • FileChecker -> checks the list of output files and fails the job if a particular string is found (or not found). • LHCbMetaDataChecker -> can fail the job based on the jobs metadata • CustomChecker -> mimicking that of CustomMerger, gives the user the ability to define custom actions to perform when checking jobs • RootFileChecker -> checks all root files are mergeable and that they all have the same structure and fails the job when this is not the case. • New (not in release yet) so not really discussed in this talk. • Care is needed as Auto resubmission will happen if a job is failed even by a checker. This may not make sense as if job ran properly but user fails it in a checker, resubmitted job will always fail. Alex Richards, ICL
FileChecker • The FileChecker can fail a job based on the presence or absence of a search string. • Specify the files to search through as a list with the ‘files’ attribute • The text strings to search for should be defined as a list with the ‘searchStrings’ attribute • Whether it fails the job on first appearance of this string or first absence can be specified with the Boolean ‘failIfFound’ attribute.
LHCbMetaDataChecker • LHCb Gaudi type jobs populate the metadata attribute upon completion. • The LHCbMetaDataChecker allows the user to check a simple expression involving the jobs metadata. • The expression uses keywords to represent the metadata items • Currently only three understood • 'inputevents' = j.events['input'] • 'outputevents' = j.events['output'] • 'lumi' = j.lumi (float of the value)
CustomChecker Checker.py • CustomChecker much like the CustomMerger allows users to have their own code run as a checker • It requires that they create a python file that defines a check function taking one argument which is the job to be checked. • Return should be a Boolean representing if the job should be marked completed (True) or failed (False). • Then attach the CustomChecker to your job, pointing to your file as the module attribute Alex Richards, ICL
Notifier • Attach a Notifier object to your job to receive emails. • Default behaviour is to email when master jobs are completed/failed, and when sub jobs have failed. • Important: emails will only be sent when Ganga is running, so this is only useful if you run Ganga in the background. • Please don't reply to the emails, if you have questions email the DAST list instead: lhcb-distributed-analysis@cern.ch Alex Richards, ICL
Outline Introduction Output control Post processing Parallel user code Future work Alex Richards, ICL
Parallel processing of user tasks • Previously: • While monitoring was performed on a separate thread, all user activity happened in a synchronous way. • This on occasion meant the user having to effectively wait some time for something to complete (that could have been performed without further user input e.g. In the background) before again being able to interact with Ganga. • While the monitoring was threaded, it to acted in a synchronous way when dealing with completed jobs i.e. downloading outputs and performing actions on these jobs sequentially. Alex Richards, ICL
Parallel processing of user tasks • Motivation: • Provide the user with a convenient way to asynchronously execute commands via Ganga-aware threads. • Provide the ability for the monitoring to be truly independent of the post job activities like sandbox downloading via asynchronous use of Ganga-aware threads. • Balance performance issues by having a thread pool of a fixed size and the ability to queue up execution tasks. • Provide a way to asynchronously execute and capture output from shell commands as well as running python commands in a separate process within the DIRAC environment. (experts only) • Provide a convenient user friendly and familiar way to monitor the status of the thread pool and queues. Alex Richards, ICL
[DIRAC] #NumWorkerThreads = 5 Implementation • The implementation involves: • The addition of a new command similar to jobs but called ‘queues’ that will allow the user to inspect the current state of the threads in the pool as well as the state of the queues • Allowing the queues command to also act as the interface for users to define execution blocks (python callable) to be added to the queue for asynchronous execution. • Allowing the user to be able to specify the number of worker threads in the thread pool via a config option (default is 5) Alex Richards, ICL
Queues • By using the queues command we can see the current status of the thread pool and queues for both the user threads as well as the monitoring threads • Only the user thread pool can be manipulated via the queues interface. • The monitoring thread pool is there for information only Alex Richards, ICL
Adding code to run asynchronously • To add a python callable object to the queue simple pass it to the queues ‘add’ method along with any arguments required • It doesn’t have to be user made callable objects, any callable will be fine. This includes methods commonly used with jobs like remove, submit etc. • Be careful of the common pitfall of adding the return of a callable ! ( not asynchronous )
queues.add( f( 123 ) ) f( 123 ) returns None Adding code to run asynchronously • To add a python callable object to the queue simple pass it to the queues ‘add’ method along with any arguments required • It doesn’t have to be user made callable objects, any callable will be fine. This includes methods commonly used with jobs like remove, submit etc. • Be careful of the common pitfall of adding the return of a callable ! ( not asynchronous )
Busy worker threads Example only, DON’T DO THIS! • Last example finishes too quickly to show the worker busy so re-define function f • Now using queues command can see a worker thread busy processing the function Alex Richards, ICL
Queuing up executable items • Users may add more callables that there are threads available. • All additions go into the queue from which threads pick items when idle. • Here can see that all workers are busy and there are three extra items in the queue waiting to be processed
Purging the queue • Unfortunately threads cannot be interrupted so make sure that whatever you get them to execute will not get stuck • The queue however can be purged using the queues ‘purge’ method Alex Richards, ICL
Exiting Ganga with active workers or full queue • As queues’ user threads are Ganga-aware you will be warned if they are still active when you request Ganga exit just as with other Ganga threads. • Forcing the exit will terminate all the threads immediately irrespective of what they were doing. • This is most obvious to the user if jobs are finalising as they can get stuck in completing. • Items in the queue will be dropped and lost just as if purged. • Exception to this is job_finalisation in the monitoring queue. Here queue items will be dropped at exit but reinstated at start up without a call to DIRAC to check the status.
Outline Introduction Output control Post processing Parallel user code Future work Alex Richards, ICL
Future plans • Implementing auto resubmit tool as a dedicated postprocessor • Unifying the approach to job input with the new output framework • Adding the possibility to have locked ‘named’ templates which can represent anything from debug cases to best practise examples. • Ability for external package authors (e.g. DaVinci) to maintain best practise / recommended workflow named templates which could be imported into Ganga at build time.
Finally… • For more info on the outputsystem stuff see: • https://twiki.cern.ch/twiki/bin/view/ArdaGrid/GangaOutputTutorial • For more info on the postprocessing stuff see: • https://twiki.cern.ch/twiki/bin/view/ArdaGrid/PostProcessTutorial Alex Richards, ICL
Backup ( Advanced features )
Backup • Getting all user LFNs: • Users can obtain a list of all files on DIRAC SE using the ‘getDiracFiles’ command. • This returns a GangaList of ready made DiracFile objects • This uses the dirac command dirac-dms-user-lfns and time taken will depend on the number of lfns the user has • DIRAC environment: • All DIRAC commands are performed in a subprocess exposed to the LHCbDirac environment. • Users can run python commands directly in this environment using the diracAPI type commands
from Dirac.Interfaces.API.Dirac import Dirac from Dirac.Interfaces.API.DiracAdmin import DiracAdmin from LHCbDirac.Interfaces.API.DiracLHCb import DiracLHCb dirac = Dirac() diraclhcb = DiracLHCb() Backup • For the diracAPI command, returned objects is tried (in this order): • The object passed to output() (unless output redefined) • The evaluated stdout • The stdout as string • Note the environment contains the definitions: • Along with some Ganga functions which can be seen in: • GangaDirac/Lib/Server/DiracCommands.py • GangaLHCb/Lib/Server/DiracLHCbCommands.py
Backup • As well as the straight diracAPI command, two other varients exist. • diracAPI_async • diracAPI_interactive • diracAPI_async allows users to put their command onto the queue for asynchronous execution. • Note that there will be no return here so best used with commands where return is unimportant e.g. kill • diracAPI_interactive allows users to interactively supply commands and retrieve output from the subprocess using sockets. • This is useful for interactive debugging • In this case a separate prompt appears
def addProcess( self, command, timeout, env, cwd, shell, priority, callback_func, callback_args, callback_kwargs ) Backup • Generalised process control: • The examples above have been provided as more specific cases • They are however just specialisations of the more general control system accessed via the queues monitoring system with the method ‘addProcess’. • The user can attach to the queue any command line or python commands which will be executed within the DIRAC environment as described earlier. • They will be executed within a separate DIRAC aware subprocess. • Note: There will be nothing returned as the result of execution goes to a dedicated callback function.