LCG Job Reliability

Julia Andreeva, Benjamin Gaidioz, Juha Herrala, Birger Koblitz, Massimo Lamanna, Pablo Saiz, Andrea Sciaba` LCG Job Reliability

Table of contents • Motivation for Job Reliability • List of most common errors • Retry count • No compatible resources • CondorG queue • Errors per site • Conclusions

Situation at the last LHCC referees’ meeting

Popular “last messages” Data recorded in the experiment dashboards Initially only data from CMS (dashboard) Now more and more data from ATLAS as well CMS: mostly analysis; ATLAS: dominated by production We expect to have “all” type of jobs soon “Job RetryCount” family 48757 Job proxy is expired 17465 Cannot plan: BrokerHelper: no compatible resources 16646 Job got an error while in the CondorG queue 5694 Cannot retrieve previous matches for … 2410 Job successfully submitted to Globus 948 Unable to receive data 291

Job Retry Count The Logging&Bookkeping is triggered via messages generated at a change of state of the system The last message is not necessary important… The access to the full log is by default granted only to the user submitting the job

Logging&Bookkeeping A first analysis done in CMS SC3 (last year) Parse the log generated in verbose mode Improvements On the long run, L&B could provide more info (digest its own log and answer) Problems/limitation The messages are effectively available via R-GMA. Some problems with lost messages (Not impacting “statistical” analysis)

Cannot plan: BrokerHelper: no compatible resources RB #Jobs #Failures % lcg16.sinp.msu 1516 1213 80.01 prod-rb-01.pd 504 133 26.39 rb.isabella.gr 147 15 10.20 rb106.cern.ch 33312 3337 10.02 gdrb01.cern.ch 199116 5762 2.89 gdrb06.cern.ch 94550 2541 2.69 rb01.pic.es:90 5143 120 2.33 gdrb08.cern.ch 161843 2934 1.81 lcgrb01.gridpp 3398 40 1.18 egee-rb-03.cnaf 31217 356 1.14 gdrb03.cern.ch 165124 1849 1.12 egee-rb-01.cnaf 22232 54 0.24 laranja.iihe 14794 6 0.04 grid-rb0.desy 8324 1 0.01

Cannot plan: BrokerHelper: no compatible resources in gdrb01.cern.ch (time dependence)

BDII oscillation Number of CEs in the ATLAS BDII over a period ~ 6h

BDII oscilations

A possible explanation: How BDII works Top-Level BDII harvests data from other BDIIs by: Forks 200+ processes contacting a BDII each retrieving all data Start new empty LDAP server locally Fill in all harvested (ldif) data Switch to new LDAP server through port-forwarding Possible problems: Short timeout for processes gathering data(In order to meet the deadline for the switch) Forking 200+ processes may make the system unresponsive and make processes miss their deadline Currently studying instrumented BDII

Job got an error while in the CondorG queue RB #Jobs #Failures %Failed gdrb08.cern.ch 161811 2733 1.69 gdrb03.cern.ch 165124 1280 0.78 gdrb01.cern.ch 199107 1237 0.62 gdrb06.cern.ch 94533 1039 1.10

Job got an error while in the CondorG queue Checking in the RB: edg-get-logging-info and edg-get-job-status 4 reasons for this error: 77%: Proxy expired (either while job running or while the job is waiting in the queue) 13%: The job manager could not lock the state lock file 8.5%: Unspecified gridmanager error 1.5%: Cannot plan: BrokerHelper: no compatible resource

Job got an error while in the CondorG queue  Proxy

Job got an error while in the CondorG queue  Proxy Strong suspect that the error was introduced by using VOMS certificates

Logging and Bookkeeping provides records the path of a job in the system as a series of states Not always straightforward to follow… Moving back in the graph, one can try to identify important messages

37048 "Unspecified gridmanager error" 11515 "Globus error 24: the job manager detected an invalid script response" 6687 "Globus error 131: the user proxy expired job is still running" 1790 "76 cannot access cache files in ~/.globus/.gass_cache check permissions quota and disk space" 1578 "Globus error 3: an I/O operation failed" 1342 "7 authentication failed: GSS Major Status: Authentication Fail…" 981 "Globus error 12: the connection to the server failed check host and port" 584 "Globus error 7: authentication failed: GSS Major Status: Unexpected Gatekeeper or Service Name… " 581 "Globus error 158: the job manager could not lock the state lock file" 473 "7 authentication failed: GSS Major Status: Unexpected Gatekeeper or Service Name…" 470 "10 data transfer to the server failed" 456 "93 the gatekeeper failed to find the requested service"

Snapshot (last meeting)

CMS SC4 jobs (few days ago) A few misconfigured worker nodes spotted and fixed  dramatic improvement of the site behaviour

Example 20th of June JobRobot (CMS SC4) 6 sites showing problems Top “good” sites (“grid” efficiency) MIT = 99.6% DESY = 100.% Bari = 100 % Pisa = 100% FNAL = 100% ULB-VUB = 96.8% KBFI = 100% CNAF = 99.6 ITEP = 100%

CMS SC4 Day Top “good” sites Comments Overall efficiency * June 17 FNAL DESY Legnaro 88% June 19 UNL Pisa Bari bad sites 72% June 20 MIT Bari Pisa bad sites 64% June 21 Pasadena Bari Pisa Match problems 69% June 22 Pasadena Bari FNAL Match problems 73% June 23 MIT Padadena PIC Bad site 77% * All sites in (“good” and “bad” ones)

Snapshots of CMS analysis Date Top “good” site grid efficiency Overall grid efficiency Comments April 24 95%-100% 50% Match problems May 24 84%-99% 60% 1 bad site June 21 99-100% 87% Last days ~100% Lots of job pending April 24 No entries May 24 95%-100% 25%-50% Big site problems June 23 84%-99% 91% Very stable June 24 99-100% Lots of job running

Conclusions Work in progress Data from CMS and ATLAS Preparing/adapting tools to identify problems Identify problems Making them visible is an incentive to resolution Components to control/instrument clearly visible The frequency table will be discussed in the TCG (next meetings)  priorities Per site views show the importance to keep configuration problems under control Making summaries available to operations on a regular basis

LCG Job Reliability

LCG Job Reliability

Presentation Transcript

LCG Applications Area

LCG Deployment

A proposal for improving Job Reliability Monitoring

LCG DER

LCG-SPI: SW-Testing LCG Applications Area

LCG Incident Response

LCG-France

LCG-ES Plans of Spanish Groups for LCG

LCG Security Coordination

LCG Gridview / LCG SAM use cases

LCG Security

LCG-1 Status

LCG-1 Status

LCG Job Submission

A Statistical Analysis of Job Performance on LCG Grid

Restructuring Agriculture LCG

A Statistical Analysis of Job Performance on LCG Grid

LCG-1 Status