1 / 19

PROOF developments

PROOF developments. G. Ganis CAF meeting, ALICE offline week , 11 July 2008. Overview. Recent / Current developments focus mostly on Solving Instabilities and improving on error recovery Improving the user interface Resource control in multiuser

lemuel
Download Presentation

PROOF developments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PROOF developments G. Ganis CAF meeting, ALICE offline week , 11 July 2008

  2. Overview • Recent / Current developments focus mostly on • Solving Instabilities and improving on error recovery • Improving the user interface • Resource control in multiuser • CAF is one of the main source of feedback to • Understand problems • spot missing functionality G. Ganis, CAF, Alice offline week

  3. Today’s Subjects • Stability issue • New XrdProofd plug-in • Related issues • New Log box • Monitoring of the memory consumption • Dataset management • Schedulingdevelopments G. Ganis, CAF, Alice offline week

  4. New XrdProofd plug-in (1) • Addresses stability issues observed typically after a failure and the attempt to reset the session • We traced-back these to deadlock situations due to concurrent actions not well protected • New plug-in implements re-designed interaction between components significantly reducing locks • The changes for the user are minimal • But the level of asynchronism introduced may confuse people looking at the process tables, as the processes are cleaned with some delay G. Ganis, CAF, Alice offline week

  5. New XrdProofd plug-in (2) New features • Resiliance to xrootd failures/glitches • Applications attempt to restore the connections for 10 mins • Solves the problem of restarting xrootd to change the configuration • Directive to define workers in the xrootd config file • Example: on CAF DEV the workers are define with • Get rid of proof.conf xpd.worker master lxb6043 xpd.worker worker lxb60[41-42,44] xpd.worker worker lxb60[41-42,44] G. Ganis, CAF, Alice offline week

  6. Related Improvements • Automatic shutdown of orphalin sessions • Get rid of proofserv processes hanging around • Improved notification in case of a worker death G. Ganis, CAF, Alice offline week

  7. New Log Dialog box A. Kreshuk • Using TProof::Mgr(master)->GetSessionLogs() • Should work even if the session hangs G. Ganis, CAF, Alice offline week

  8. Memory usage monitoring A. Kreshuk • Worker: RAM vs events proc • Master: RAM vs object merged • Should allow to spot easily mem leaks • Additional analysis w/ another tool: TMemStat? G. Ganis, CAF, Alice offline week

  9. Memory consumption monitoring A. Kreshuk • Normal level • Workers monitor their memory usage and save info in the log file • Client get warned of high usage • The session may be eventually killed • Advanced level • Possibility to save in a dedicated tree (TProofStats) very detailed information (e.g. interface to Marian Ivanov’s memsta tool) • To be run as second pass when a problem shows up • First version in SVN the coming days G. Ganis, CAF, Alice offline week

  10. Dataset management (1) JFGO • Hot topic for T2/T3 • Dataset: metadata about a set of files • TFileCollection: list of TFileInfo • TFileInfo • UUID, TUrl’s of the file • TFileInfoMeta: one per Ttree with name, entries, … • Data-sets are identified by name • Info may come from different places: catalogs, SQL databases, file systems G. Ganis, CAF, Alice offline week

  11. Dataset manager (2) JFGO • TProofDataSetManager: abstract interface describing the basic functionality • RegisterDataSet, GetDataSet, VerifyDataSet, … • VerifyDataSet opens the files, i.e. may trigger staging • TProofDataSetManagerFile: implementation handling information via ROOT files datasetname.root • Stored on the master on dedicated subdirectory • <DatsetDir>/group/user/dataset G. Ganis, CAF, Alice offline week

  12. Dataset manager (3) JFGO • TProofDataSetManagerFile is what is used on CAF • Users can register, scan, get • Verify is disallowed (to avoid staging overload) • It is run by a dedicated daemon (JFGO) • Datasets can be processed by name • Provide a way to cache the information needed at the validation step, speeding this up considerably • TProofDataSetManager can be used also locally to organize your datasets or chains. • No need of a dedicated macro to create the chain (CreateESDchain) G. Ganis, CAF, Alice offline week

  13. Dataset manager (4) JFGO • ATLAS is very interested • They are oriented a MySQL backend and validity tokens for the dataset • Will provide TProofDataSetManagerSQL • Other issues raised by ATLAS • Possibility to use multiple dataset sources, e.g. file and SQL based concurrently • problem of the datasets in federated clusters (multi-masters) which is challenging on the PROOF side too G. Ganis, CAF, Alice offline week

  14. Scheduling developments J. Iwaszkiewicz • Control resources and how they are used • Improving efficiency • assigning to a job those nodes that have data which needs to be analyzed. • Implementing different scheduling policies • e.g. fair share, group priorities & quotas • Efficient use even in case of congestion G. Ganis, CAF, Alice offline week

  15. Scheduling developments (2) • Assigning a set of workers for a job based on: • The data set location • User priority (Quota + historical usage)‏ • Can be taken for external source • The current load of the cluster • Create (priority) queues for queries that cannot be started G. Ganis, CAF, Alice offline week

  16. Scheduling developments (3) • Implementation exists with: • # of Workers ≈ relativePriority * nFreeCPUs • Assign least loaded workers first • Missing pieces • Dynamic worker setup (advanced prototype exists) • Worker nodes auto-registration • Improved load monitoring • Support for “put-on-hold” submission (prototype) G. Ganis, CAF, Alice offline week

  17. Dataset Lookup 2: dataset 3: file locations Client PROOF master 4: Job info Scheduler 1: Job {dataset, …} Load, history, policy, … 5: workers 6: workers Start workers Scheduling schema G. Ganis, CAF, Alice offline week

  18. Other developments • PROOFLITE • Version of PROOF optimized for multicore machines with workers started directly by the ROOT session (no daemon) • Useful to quickly test code in a real PROOF environment • Will be used to study I/O issues in multicore • Almost ready to go into the trunk • PROOF / Condor integration • Possible ATLAS model for T3 farms not dedicated to PROOF • Condor provides mechanism to give high priority to PROOF queries when required by suspending/hibernating batch jobs G. Ganis, CAF, Alice offline week

  19. Questions? • Credits • G.G., J. Iwaszkiewizc, A. Kreshuk, F. Rademakers • M. Meoni, J.F. Grosse-Oetringhaus (ALICE) • F.Furano, A. Peters (CERN/IT) • A. Hanushevsky (SLAC) G. Ganis, CAF, Alice offline week

More Related