Best practices and use cases david bouvet ioannis liabotis cod 18 abingdon 02 12 2008
This presentation is the property of its rightful owner.
Sponsored Links
1 / 9

Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008 PowerPoint PPT Presentation


  • 39 Views
  • Uploaded on
  • Presentation posted in: General

Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008. Best Practices – General COD activity. Follow up tickets assigned to developers Use the CIC mailing lists Report problems with tests in CIC mailing list so that other COD are aware of them

Download Presentation

Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Best practices and use cases david bouvet ioannis liabotis cod 18 abingdon 02 12 2008

Best Practices and Use cases

David Bouvet, Ioannis Liabotis

COD – 18, Abingdon, 02/12/2008


Best practices general cod activity

Best Practices – General COD activity

  • Follow up tickets assigned to developers

  • Use the CIC mailing lists

  • Report problems with tests in CIC mailing list so that other COD are aware of them

  • Read BROADCAST messages such as downtime announcements for Core OPS tools.

  • Escalate properly the tickets

  • Minimize number of tickets per site

  • Use alarm masking

  • Answer comments of site administrators and try to change the template escalation mails.

  • Do not leave overdue tickets or alarms open at the end of the shift

  • Report inactivity


Best practices hand over logs

Best Practices – Hand Over Logs

  • For OPS Meeting

    • List of Sites for escalation (ROC, Site, GGUS #, reason)

    • Operational tools problems

    • Issues with COD procedures and OPS manual

    • Problems in Grid Core services

    • Tickets that need attention

  • For COD use

    • Identification of complex tickets – new use cases identification

    • Strange alarms – not easily transformed into tickets

    • Open ticket not related to alarms

    • New issues arising in the OPS meeting

  • Lead team update the log after the weekly OPS meeting with all this info


Best practises details 1

Best practises – Details (1)

Operational use cases

If COD shifter detects an operational use case, it is recommended to create an entry for it in the tWiki use cases page: https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus

COD lead team is supposed to verify and update the tWiki page during its shift and to raise it at WLCG Ops meeting.

Handover

For tickets in last escalation step, put in handover the following:

ROC

Site name

GGUS ticket number

short summary of the reason why site is asked for suspension

Need to follow up the tickets which are in last escalation step “Case transferred to political instances”

when the ROC said it has to discuss with its site

ROC should give an answer during the week following the WLCG Ops meeting

verify that site is really suspended by ROC after the WLCG Ops meeting if the decision is suspension.

If still no action (no answer and/or no suspension), put again in next handover


Point to discuss at cod meeting and to raise at roc managers meeting

Point to discuss at COD meeting and to raise at ROC managers meeting

Last escalation step/Site suspension follow-up (use case #9 on twiki page)

Context: Follow-up of last escalation step by OCC and ROC not correctly done. When last step is reached, as stated in Operational Manual, ROC should normally discuss in private with its site, and then tell at next Weekly Operation meeting if the site should be suspend or not. Most of the time, at Weekly Operation meeting, ROC says that it has too discuss, and then no more news. The site stay in last escalation step during several weeks.

➔ COD work seems not to be considered. At WLCG Ops meeting on Nov. 17th, regarding suspension of IPTA-LCG2 site, ROC North said it has not seen CODs' mails nor CODs' tickets because it has too much mails... This is not acceptable!


Point to discuss at cod meeting and to raise at roc managers meeting1

Point to discuss at COD meeting and to raise at ROC managers meeting

Some example of "long" last step:

GGUS #40521: RU-Phys-SPbSU (1 month and a half)‏

25/09/2008: last escalation step

06/10/2008: raised at WLCG Ops meeting

06/11/2008: still in last step and not suspended

06/11/2008: Cyril L'Orphelin (COD-FR) send mail to Maite, Steve and Nick

06/11/2008: Maite sent mail to Russian ROC

06/11/2008: site suspended by Russian ROC

GGUS #42015: ITPA-LCG2 (4 weeks)‏

24/10/2008: last escalation step

27/10/2008: raised at WLCG Ops meeting

03/11/2008: raised again at WLCG Ops meeting

07/11/2008: still in last step and not suspended

10/11/2008: raised again at WLCG Ops meeting

17/11/2008: still in last step and not suspended. ROC suspended. ROC North is present at WLCG Ops meeting and will check with site.

18/11/2008: finally fixed by site


Point to discuss at cod meeting and to raise at roc managers meeting2

Point to discuss at COD meeting and to raise at ROC managers meeting

In Operational Manual: "If no progress is made, COD make sure that OMC is informed of the situation, and the site status is set to “suspended” in GOCDB by COD unless OMC say differently."

Proposed solution: As COD has rights to suspend a site

if ROC is not present at Weekly Operation meeting or has not send a mail about that problem, COD suspends the site

if ROC is present and asks for discussion with its site, OCC should put an action on ROC in the list of actions of the Weekly Operation meeting so it will be followed at next meeting. Answer or suspension by ROC should be done within the next 3 days: as acknowledgement, a mail should be sent to both OCC and COD mailing lists. In case not, the site is suspended by COD after these 3 days.

➔ we should agree on a solution to propose to ROC managers (the one above or another).

Then it should be discuss at ROC manager meeting here in Abingdon to be later integrated in the Ops manual.


Best practices

Best practices

Alarm handling and masking

Shifter duty should not be a competition of who will process the higher number of alarms

Before assigning a ticket to an alarm, check if it is related to another alarm with “Related alarms” table

Mask alarm instead of creating 2 tickets for the same problem

2 alarms which can be masked by the current one


Point to discuss at cod meeting

Point to discuss at COD meeting

Training for new COD members

Every new COD member should be trained on the COD tasks by his COD team or by another team

We need to define how to do it.

Probably on demand training

Training materials available from the COD dashboard?


  • Login