Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008. Best Practices – General COD activity. Follow up tickets assigned to developers Use the CIC mailing lists Report problems with tests in CIC mailing list so that other COD are aware of them
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Best Practices and Use cases
David Bouvet, Ioannis Liabotis
COD – 18, Abingdon, 02/12/2008
Operational use cases
If COD shifter detects an operational use case, it is recommended to create an entry for it in the tWiki use cases page: https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus
COD lead team is supposed to verify and update the tWiki page during its shift and to raise it at WLCG Ops meeting.
For tickets in last escalation step, put in handover the following:
GGUS ticket number
short summary of the reason why site is asked for suspension
Need to follow up the tickets which are in last escalation step “Case transferred to political instances”
when the ROC said it has to discuss with its site
ROC should give an answer during the week following the WLCG Ops meeting
verify that site is really suspended by ROC after the WLCG Ops meeting if the decision is suspension.
If still no action (no answer and/or no suspension), put again in next handover
Last escalation step/Site suspension follow-up (use case #9 on twiki page)
Context: Follow-up of last escalation step by OCC and ROC not correctly done. When last step is reached, as stated in Operational Manual, ROC should normally discuss in private with its site, and then tell at next Weekly Operation meeting if the site should be suspend or not. Most of the time, at Weekly Operation meeting, ROC says that it has too discuss, and then no more news. The site stay in last escalation step during several weeks.
➔ COD work seems not to be considered. At WLCG Ops meeting on Nov. 17th, regarding suspension of IPTA-LCG2 site, ROC North said it has not seen CODs' mails nor CODs' tickets because it has too much mails... This is not acceptable!
Some example of "long" last step:
GGUS #40521: RU-Phys-SPbSU (1 month and a half)
25/09/2008: last escalation step
06/10/2008: raised at WLCG Ops meeting
06/11/2008: still in last step and not suspended
06/11/2008: Cyril L'Orphelin (COD-FR) send mail to Maite, Steve and Nick
06/11/2008: Maite sent mail to Russian ROC
06/11/2008: site suspended by Russian ROC
GGUS #42015: ITPA-LCG2 (4 weeks)
24/10/2008: last escalation step
27/10/2008: raised at WLCG Ops meeting
03/11/2008: raised again at WLCG Ops meeting
07/11/2008: still in last step and not suspended
10/11/2008: raised again at WLCG Ops meeting
17/11/2008: still in last step and not suspended. ROC suspended. ROC North is present at WLCG Ops meeting and will check with site.
18/11/2008: finally fixed by site
In Operational Manual: "If no progress is made, COD make sure that OMC is informed of the situation, and the site status is set to “suspended” in GOCDB by COD unless OMC say differently."
Proposed solution: As COD has rights to suspend a site
if ROC is not present at Weekly Operation meeting or has not send a mail about that problem, COD suspends the site
if ROC is present and asks for discussion with its site, OCC should put an action on ROC in the list of actions of the Weekly Operation meeting so it will be followed at next meeting. Answer or suspension by ROC should be done within the next 3 days: as acknowledgement, a mail should be sent to both OCC and COD mailing lists. In case not, the site is suspended by COD after these 3 days.
➔ we should agree on a solution to propose to ROC managers (the one above or another).
Then it should be discuss at ROC manager meeting here in Abingdon to be later integrated in the Ops manual.
Alarm handling and masking
Shifter duty should not be a competition of who will process the higher number of alarms
Before assigning a ticket to an alarm, check if it is related to another alarm with “Related alarms” table
Mask alarm instead of creating 2 tickets for the same problem
2 alarms which can be masked by the current one
Training for new COD members
Every new COD member should be trained on the COD tasks by his COD team or by another team
We need to define how to do it.
Probably on demand training
Training materials available from the COD dashboard?