1 / 23

New CERN CAF facility: parameters, usage statistics, user support

New CERN CAF facility: parameters, usage statistics, user support. Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008. Outline. New CAF: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Users, Groups Machines, Files, Disks, Datasets, CPUs

Download Presentation

New CERN CAF facility: parameters, usage statistics, user support

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. New CERN CAF facility:parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008

  2. Outline • New CAF: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions

  3. New CAF • Timeline • 28.09 startup of the new CAF cluster • 01.10 1st day with users on the new cluster • 07.10 old CAF dismissed by IT • Usage • 26 workers instead of 33 (but much faster, see later) • Head node is « alicecaf » instead of « lxb6046 » • GSI based authentication, AliEn certificate needed • Announced since July but many last-minute users with AliEn account != afs account or server certificate unknown • Datasets clean up, staged only latest data production (First physics - stage 3) • AF v4-15 meta package redistributed

  4. Technical Differences • Cmsd (Cluster Management Service Daemon) • Why? Olbd not supported any longer • What? Dynamic load balancing of files and data name-space • How? Stager daemon can benefits from: • bulk prepare replaces touch file • bulk prepare allows "co-locate" files on the same node • GSI authentication • Secure communication using user certificates and LDAP based configuration management

  5. Architectural Differences • Why « only » 26 workers? • You could use 104 if you are alone • With 26 workers 4 users can effectively run concurrently • Estimate average of 8 concurrent users… • Processing units 6.5x faster than old CAF

  6. Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions

  7. CAF1 vs CAF2 (Processing Rate) • Test Dataset • First physics (stage 3) pp, Pythia6, 5kG, 10TeV • /COMMON/COMMON/LHC08c11_10TeV_0.5T • 1840 files, 276k events • Tutorial task that runs over ESDs and displays Pt distribution • Other comparison test:RAW data reconstruction (Cvetan)

  8. Reminder • The test is dependent on the file distribution for the used dataset • Parallel code: • Creation of workers • Files validation (workers opening the files) • Events loop (execution of the selector on the dataset) • Serial code: • Initialization of PROOF master, session and query objects • Files look up • Packetizer (file slices distribution) • Merging (biggest task)

  9. Task executed 5 times and averaged

  10. Processing Rate Comparison (1) • The final average rate is the only important information • Final tail reflects the fact one by one workers stop working • data unevenly distributed • A longer tail shows a worker overloaded on the last packet(s) • 3 workers maximum helping on the same «slow» packet 104 workers, 200k evs 104 workers, 276k evs

  11. Processing Rate Comparison (2) ___104 workers ___ 26 workes ___ 33 workers • Events/sec • MB/sec • #events • #events

  12. Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Users/Groups • Machines, Files, Disks, Datasets, CPUs • Staging problems • Conclusions

  13. CAF Usage • Available resources in CAF must be fairly used • Highest attention to how disks and CPUs are used • Users are grouped (sub-detectors / physics working groups) • Each group • has a disk space (quota) which is used to stage datasets from AliEn • has a CPU fairshare target (priority) to regulate concurrent queries

  14. CAF Groups • 19 registered groups • 145 (60) registered users • In brackets () the situation at the previous offline week

  15. CAFStatusTable

  16. Files Distribution Max: 1863 Min: 1727 Max difference: 8% Nodes with more files can produce tails in processing rate Above a defined threshold files are not stored any longer

  17. Disk Usage Min: 105 Max: 116 Max difference: 10%

  18. Dataset Monitoring • - 28TB disk space for staging • - PWG0: 4TB • - PWG1: 1TB • - PWG2: 1TB • - PWG3: 1TB • - PWG4: 1TB • - ITS: 0.2TB • - COMMON: 2TB

  19. CPU Quotas • - default group is not the most consuming anymore

  20. Outline • CAF2: features • CAF1 vs CAF2 • processing rate comparison • Current Statistics • Users, Groups • Machines, Files, Disks, Datasets, CPUs • File Staging • Conclusions

  21. File Stager • CAF intensively uses 'prepare’ • 0-size files in Castor2 cannot be staged, but replicas are ok • Check at stager level to avoid spawning infinite prepare on the same empty file unable toget online • Loop over the replicas (CERN, if any, taken first) replica[i] in Castor && size==0? • Skip it • Add to StageLIST replica[i] is not staged? STOP Copy replica (API service) • File corrupted. Skip it STOP Stage StageLIST

  22. Outline • CAF2: features • CAF1 vs CAF2 • Processing Rate comparison • Current Statistics • Files Distribution • Users/Groups • Staging • Conclusions

  23. Conclusions • If (ever) you cannot connect just drop a mail and wait for… … « please try again » • CAF Usage • Subscribe to alice-project-analysis-task-force@cern.ch using CERN SIMBA (http://listboxservices.web.cern.ch/listboxservices) • Web page at http://aliceinfo.cern.ch/Offline/Analysis/CAF • CAF tutorial once a month • New CAF • Faster machines, more space, more fun • Shaky behavior due to higher user activity is under intensive investigation • Credits • PROOF Team and IT for the prompt support

More Related