1 / 13

physics piquet services during the CERN annual closure

physics piquet services during the CERN annual closure. Why this report?. Services in ground state, with little excitement Simple use of services, mainly production users No (or: fewer) experts improving services 

jerrold
Download Presentation

physics piquet services during the CERN annual closure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. physics piquet services during the CERN annual closure

  2. Why this report? • Services in ground state, with little excitement • Simple use of services, mainly production users • No (or: fewer) experts improving services  • Easier to identify problems in our working practices (missing procedures, inter-service issues, etc) • Piquet coverage for physics services provided • This report is based on the experiences of PDB, GMoD, SMoD piquets • Experiences may be interesting for Piquet Working Group Physics services during the CERN Annual Closure 2006

  3. CC operators Sysadmin team Service managers Service experts services support flows Lemon alarms User/experiment problem reports • 24 x 7 coverage • 1st level alarm handling • Driven by procedures • Piquet service • Manage Hardware repairsfor important machines • PDB, GMoD, SMoD piquet • Entry point for support lines Still many of them! Problem reports come to the Service Managers via many different flows, using many different tools, directly and indirectly. This still needs some tuning Physics services during the CERN Annual Closure 2006

  4. 93 alarms/day 26 alarms/day 82 alarms/day …and one incident… Number of operator alarms Physics services during the CERN Annual Closure 2006

  5. Grid production activity! Activity on LXBATCH Physics services during the CERN Annual Closure 2006

  6. CASTOR activity CMS most active, data export of 100 – 250 MB/s Steady ATLAS activity, with an interruption between Jan 6 - 8 Physics services during the CERN Annual Closure 2006

  7. General observations • Most services ran without particular problems… • Usual hardware and software failures handled in the usual way • CC operators and SysAdmin piquet handled most of the alarms • Service infrastructure largely in place, including alarms, procedures, documentation • We have added some automatic recovery actions and procedures • Experiments expressed their thanks Physics services during the CERN Annual Closure 2006

  8. Grid Services • GMoD’s handled quite some different problems • System crash on rb114, hardware failure on gdrb06, … • Service restarts, high loads, full filesystems • RB’s, CE’s, BDII, LB, … • GGUS tickets, interaction with service experts • FTS: GMoD reports stable running, with ~100% service availability Physics services during the CERN Annual Closure 2006

  9. Oracle RAC • 5 service degradations between Dec 26 and Dec 31 • One node in a cluster gets stuck (RHES-3?), and needs a reboot • Dec 26: itrac16 hung + reboot (Atlas RAC) • Dec 26: itrac20 hung + reboot (Atlas RAC), problematic to boot node • Dec 27: itrac16 hung + reboot (Atlas RAC) • Dec 29: itrac04 hung + reboot (LCG RAC) • Dec 31: itrac11 hardware failure, affecting LHCb RAC • All single node failures, causing temporary service degradation • Normal procedures applied successfully • H/W and O/S of these RACs are being upgraded Physics services during the CERN Annual Closure 2006

  10. Alice grid jobs on lxbatch • On Dec 29, 200 lxbatch nodes running alicesgm jobs went into a high load, and needed to be rebooted • Same for 2 of the Alice VO boxes • Normal flow worked fine: • Operator alerted SysAdmin, who escalated to the SMoD. • SMoD alerted Alice, and asked for the nodes to be rebooted • Problem understood by Alice, and solved • Shame about the other jobs on the worker nodes… Physics services during the CERN Annual Closure 2006

  11. LFC-lhcb degraded • Jan 2: lfc processes on lfc104 start to time out, monitoring does not pick this up…This degrades the service (2 nodes in load-balanced pair) • Two lhcb users report the problem: • Mail to lfc.support@cern.ch Remedy ticket for the SMoD • Restores service on Jan 4 • GGUS ticket  GMoD • Forwarded later to lfc.support@cern.ch • Q: Can we streamline the workflow? Physics services during the CERN Annual Closure 2006

  12. CASTORATLAS stager database • Two problems, interfering destructively… • Dec 28: Vendor fixes trivial hardware problem • but the machine remains out of the alarm handling… • Jan 1: A new hardware problem develops • goes unnoticed… • Jan 4: A high CPU load is investigated by DES • nothing found • Jan 5: the machine crashes, noticed by chance a few hours later • SMoD, Castor and Oracle experts check, service partially restored • Jan 8: the reason for the high load is found & fixed • This required expert level intervention! Physics services during the CERN Annual Closure 2006

  13. Conclusions • Services ran (in general) stably • And they were being used! • Few service degradations, spread over different services • Service infrastructure is in place, and it is working • Several punctual improvements deployed • No gaping holes, but some small ones To-do: make sure that it still works under normal conditions  Physics services during the CERN Annual Closure 2006

More Related