Loading in 2 Seconds...
Loading in 2 Seconds...
Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008
Best Practices – General COD activity • Follow up tickets assigned to developers • Use the CIC mailing lists • Report problems with tests in CIC mailing list so that other COD are aware of them • Read BROADCAST messages such as downtime announcements for Core OPS tools. • Escalate properly the tickets • Minimize number of tickets per site • Use alarm masking • Answer comments of site administrators and try to change the template escalation mails. • Do not leave overdue tickets or alarms open at the end of the shift • Report inactivity
Best Practices – Hand Over Logs • For OPS Meeting • List of Sites for escalation (ROC, Site, GGUS #, reason) • Operational tools problems • Issues with COD procedures and OPS manual • Problems in Grid Core services • Tickets that need attention • For COD use • Identification of complex tickets – new use cases identification • Strange alarms – not easily transformed into tickets • Open ticket not related to alarms • New issues arising in the OPS meeting • Lead team update the log after the weekly OPS meeting with all this info
Best practises – Details (1) Operational use cases If COD shifter detects an operational use case, it is recommended to create an entry for it in the tWiki use cases page: https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus COD lead team is supposed to verify and update the tWiki page during its shift and to raise it at WLCG Ops meeting. Handover For tickets in last escalation step, put in handover the following: ROC Site name GGUS ticket number short summary of the reason why site is asked for suspension Need to follow up the tickets which are in last escalation step “Case transferred to political instances” when the ROC said it has to discuss with its site ROC should give an answer during the week following the WLCG Ops meeting verify that site is really suspended by ROC after the WLCG Ops meeting if the decision is suspension. If still no action (no answer and/or no suspension), put again in next handover
Point to discuss at COD meeting and to raise at ROC managers meeting Last escalation step/Site suspension follow-up (use case #9 on twiki page) Context: Follow-up of last escalation step by OCC and ROC not correctly done. When last step is reached, as stated in Operational Manual, ROC should normally discuss in private with its site, and then tell at next Weekly Operation meeting if the site should be suspend or not. Most of the time, at Weekly Operation meeting, ROC says that it has too discuss, and then no more news. The site stay in last escalation step during several weeks. ➔ COD work seems not to be considered. At WLCG Ops meeting on Nov. 17th, regarding suspension of IPTA-LCG2 site, ROC North said it has not seen CODs' mails nor CODs' tickets because it has too much mails... This is not acceptable!
Point to discuss at COD meeting and to raise at ROC managers meeting Some example of "long" last step: GGUS #40521: RU-Phys-SPbSU (1 month and a half) 25/09/2008: last escalation step 06/10/2008: raised at WLCG Ops meeting 06/11/2008: still in last step and not suspended 06/11/2008: Cyril L'Orphelin (COD-FR) send mail to Maite, Steve and Nick 06/11/2008: Maite sent mail to Russian ROC 06/11/2008: site suspended by Russian ROC GGUS #42015: ITPA-LCG2 (4 weeks) 24/10/2008: last escalation step 27/10/2008: raised at WLCG Ops meeting 03/11/2008: raised again at WLCG Ops meeting 07/11/2008: still in last step and not suspended 10/11/2008: raised again at WLCG Ops meeting 17/11/2008: still in last step and not suspended. ROC suspended. ROC North is present at WLCG Ops meeting and will check with site. 18/11/2008: finally fixed by site
Point to discuss at COD meeting and to raise at ROC managers meeting In Operational Manual: "If no progress is made, COD make sure that OMC is informed of the situation, and the site status is set to “suspended” in GOCDB by COD unless OMC say differently." Proposed solution: As COD has rights to suspend a site if ROC is not present at Weekly Operation meeting or has not send a mail about that problem, COD suspends the site if ROC is present and asks for discussion with its site, OCC should put an action on ROC in the list of actions of the Weekly Operation meeting so it will be followed at next meeting. Answer or suspension by ROC should be done within the next 3 days: as acknowledgement, a mail should be sent to both OCC and COD mailing lists. In case not, the site is suspended by COD after these 3 days. ➔ we should agree on a solution to propose to ROC managers (the one above or another). Then it should be discuss at ROC manager meeting here in Abingdon to be later integrated in the Ops manual.
Best practices Alarm handling and masking Shifter duty should not be a competition of who will process the higher number of alarms Before assigning a ticket to an alarm, check if it is related to another alarm with “Related alarms” table Mask alarm instead of creating 2 tickets for the same problem 2 alarms which can be masked by the current one
Point to discuss at COD meeting Training for new COD members Every new COD member should be trained on the COD tasks by his COD team or by another team We need to define how to do it. Probably on demand training Training materials available from the COD dashboard?