1 / 23

LCG Job Reliability

Julia Andreeva, Benjamin Gaidioz, Juha Herrala, Birger Koblitz, Massimo Lamanna, Pablo Saiz, Andrea Sciaba`. LCG Job Reliability. Table of contents. Motivation for Job Reliability List of most common errors Retry count No compatible resources CondorG queue Errors per site Conclusions.

Download Presentation

LCG Job Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Julia Andreeva, Benjamin Gaidioz, Juha Herrala, Birger Koblitz, Massimo Lamanna, Pablo Saiz, Andrea Sciaba` LCG Job Reliability

  2. Table of contents • Motivation for Job Reliability • List of most common errors • Retry count • No compatible resources • CondorG queue • Errors per site • Conclusions

  3. Situation at the last LHCC referees’ meeting

  4. Popular “last messages” Data recorded in the experiment dashboards Initially only data from CMS (dashboard) Now more and more data from ATLAS as well CMS: mostly analysis; ATLAS: dominated by production We expect to have “all” type of jobs soon “Job RetryCount” family 48757 Job proxy is expired 17465 Cannot plan: BrokerHelper: no compatible resources 16646 Job got an error while in the CondorG queue 5694 Cannot retrieve previous matches for … 2410 Job successfully submitted to Globus 948 Unable to receive data 291

  5. Job Retry Count The Logging&Bookkeping is triggered via messages generated at a change of state of the system The last message is not necessary important… The access to the full log is by default granted only to the user submitting the job

  6. Logging&Bookkeeping A first analysis done in CMS SC3 (last year) Parse the log generated in verbose mode Improvements On the long run, L&B could provide more info (digest its own log and answer) Problems/limitation The messages are effectively available via R-GMA. Some problems with lost messages (Not impacting “statistical” analysis)

  7. Cannot plan: BrokerHelper: no compatible resources RB #Jobs #Failures % lcg16.sinp.msu 1516 1213 80.01 prod-rb-01.pd 504 133 26.39 rb.isabella.gr 147 15 10.20 rb106.cern.ch 33312 3337 10.02 gdrb01.cern.ch 199116 5762 2.89 gdrb06.cern.ch 94550 2541 2.69 rb01.pic.es:90 5143 120 2.33 gdrb08.cern.ch 161843 2934 1.81 lcgrb01.gridpp 3398 40 1.18 egee-rb-03.cnaf 31217 356 1.14 gdrb03.cern.ch 165124 1849 1.12 egee-rb-01.cnaf 22232 54 0.24 laranja.iihe 14794 6 0.04 grid-rb0.desy 8324 1 0.01

  8. Cannot plan: BrokerHelper: no compatible resources in gdrb01.cern.ch (time dependence)

  9. BDII oscillation Number of CEs in the ATLAS BDII over a period ~ 6h

  10. BDII oscilations

  11. A possible explanation: How BDII works Top-Level BDII harvests data from other BDIIs by: Forks 200+ processes contacting a BDII each retrieving all data Start new empty LDAP server locally Fill in all harvested (ldif) data Switch to new LDAP server through port-forwarding Possible problems: Short timeout for processes gathering data(In order to meet the deadline for the switch) Forking 200+ processes may make the system unresponsive and make processes miss their deadline Currently studying instrumented BDII

  12. Job got an error while in the CondorG queue RB #Jobs #Failures %Failed gdrb08.cern.ch 161811 2733 1.69 gdrb03.cern.ch 165124 1280 0.78 gdrb01.cern.ch 199107 1237 0.62 gdrb06.cern.ch 94533 1039 1.10

  13. Job got an error while in the CondorG queue Checking in the RB: edg-get-logging-info and edg-get-job-status 4 reasons for this error: 77%: Proxy expired (either while job running or while the job is waiting in the queue) 13%: The job manager could not lock the state lock file 8.5%: Unspecified gridmanager error 1.5%: Cannot plan: BrokerHelper: no compatible resource

  14. Job got an error while in the CondorG queue  Proxy

  15. Job got an error while in the CondorG queue  Proxy Strong suspect that the error was introduced by using VOMS certificates

  16. Logging and Bookkeeping provides records the path of a job in the system as a series of states Not always straightforward to follow… Moving back in the graph, one can try to identify important messages

  17. 37048 "Unspecified gridmanager error" 11515 "Globus error 24: the job manager detected an invalid script response" 6687 "Globus error 131: the user proxy expired job is still running" 1790 "76 cannot access cache files in ~/.globus/.gass_cache check permissions quota and disk space" 1578 "Globus error 3: an I/O operation failed" 1342 "7 authentication failed: GSS Major Status: Authentication Fail…" 981 "Globus error 12: the connection to the server failed check host and port" 584 "Globus error 7: authentication failed: GSS Major Status: Unexpected Gatekeeper or Service Name… " 581 "Globus error 158: the job manager could not lock the state lock file" 473 "7 authentication failed: GSS Major Status: Unexpected Gatekeeper or Service Name…" 470 "10 data transfer to the server failed" 456 "93 the gatekeeper failed to find the requested service"

  18. Snapshot (last meeting)

  19. CMS SC4 jobs (few days ago) A few misconfigured worker nodes spotted and fixed  dramatic improvement of the site behaviour

  20. Example 20th of June JobRobot (CMS SC4) 6 sites showing problems Top “good” sites (“grid” efficiency) MIT = 99.6% DESY = 100.% Bari = 100 % Pisa = 100% FNAL = 100% ULB-VUB = 96.8% KBFI = 100% CNAF = 99.6 ITEP = 100%

  21. CMS SC4 Day Top “good” sites Comments Overall efficiency * June 17 FNAL DESY Legnaro 88% June 19 UNL Pisa Bari bad sites 72% June 20 MIT Bari Pisa bad sites 64% June 21 Pasadena Bari Pisa Match problems 69% June 22 Pasadena Bari FNAL Match problems 73% June 23 MIT Padadena PIC Bad site 77% * All sites in (“good” and “bad” ones)

  22. Snapshots of CMS analysis Date Top “good” site grid efficiency Overall grid efficiency Comments April 24 95%-100% 50% Match problems May 24 84%-99% 60% 1 bad site June 21 99-100% 87% Last days ~100% Lots of job pending April 24 No entries May 24 95%-100% 25%-50% Big site problems June 23 84%-99% 91% Very stable June 24 99-100% Lots of job running

  23. Conclusions Work in progress Data from CMS and ATLAS Preparing/adapting tools to identify problems Identify problems Making them visible is an incentive to resolution Components to control/instrument clearly visible The frequency table will be discussed in the TCG (next meetings)  priorities Per site views show the importance to keep configuration problems under control Making summaries available to operations on a regular basis

More Related