Wlcg service report
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

WLCG Service Report PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

WLCG Service Report. [email protected] ~~~ WLCG Management Board, 13 th July 2010. WLCG Operations Report – Summary.

Download Presentation

WLCG Service Report

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Wlcg service report

WLCG Service Report

[email protected]

~~~

WLCG Management Board, 13th July 2010


Wlcg operations report summary

WLCG Operations Report – Summary

The response to alarms – expert intervention & problem resolution – continues to be(well) within targets. Should we establish rather a metric related to the frequencyand nature of such alarms? (Want to see progress – even if slow…)


Wlcg service report

0.1

1.3

1.1

1.4

1.2

0.1

4.1

4.1

4.1

3.1

4.2


Wlcg service report

Analysis of the availability plots

COMMON FOR THE ALL EXPERIMENTS

0.1 CERN-PROD: Castor related problem to export data from T0, all attempts of writing and reading from T0 have been timing out. Problem was identified regarding very high levels of logging. Logging daemon reset.

ATLAS

1.1 NDGF-T1: Schedule downtime from 1200Hrs to 1400 Hrs. Upgrade of dCache on head nodes as well as OS patching. Aiming to keep actual outage much shorter, if all goes well.

1.2 Taiwan-LCG2: Temporary SRMv2 Test timeout.

1.3 INFN-T1: SRM Test terminated due to temporary communication error.

1.4 NIKHEF: SRM was overloaded with ls operations by a biomed user. Other users got time outs. Fixed by asking user to stop.

ALICE

Nothing to report.

CMS

3.1 KIT: Problem with CMS head nodes for dCache - down for about 3 hours. H/W failure.

LHCb

4.1 CERN-PROD: SAM tests failing against CERN since a week due to a diskserver in the lhcbuser pool used for the tests that has a filesystem problem.

4.2 CNAF: LHCb storage out due to network (switch) failures. Fixed early morning around.


Wlcg service report

0.2

0.2

2.1

0.2

1.1

0.1

1.3

1.3

2.2

0.1

1.2

3.2

3.1

0.2

4.1

4.1

4.1

4.1


Wlcg service report

Analysis of the availability plots

COMMON FOR THE ALL EXPERIMENTS

0.1 RAL-LCG2: Unscheduled outage. Site in downtime due to site wide networking issue.

0.2 FZK-LCG2: GridKa had a complete power failure. Compute node down till Monday.

ATLAS

1.1 INFN-T1: Temporary test failures

1.2 TAIWAN-LCG2: Temporary test failures

1.3 RAL-LCG2: Some problems with ATLAS s/w server end of week and into weekend

ALICE

2.1 FZK-LCG2: Momentarily VOBOX-Proxy-Registration test failure

2.2 NIKHEF: alice-box-proxyrenewal service text failed

CMS

3.1 KIT: Temporary test failures.

3.2 CERN: Problems with the srm-cern which caused transfers to CERN to fail.

LHCb

4.1 NIKHEF: SRM outage. Extended until Monday - difficult to pinpoint and reproduce it. Vendor suspects firmware issue.


Ggus summary 2 weeks

GGUS summary (2 weeks)

8


Wlcg service report

Support-related events since last MB

The SIR by KIT for the 2010/05/12 .de DNS incident is still pending. Details in savannah:114518

Prolonged infrastructure downtimes should IOHO be included as part of “WLCG prolonged downtime strategies”  WLCG T1SCM

The cases of failing GGUS email notifications To SARA and From CERN are now traced down to parsing scripts in both locations and fixed. Successful ALARM test tickets GGUS:59769 and GGUS:59775 confirm this. Details in savannah:115137

The GridKa cooling system failure incident of 2010/07/10 requires a SIR.

WLCG MB Report WLCG Service Report


Atlas alarm cern castor

https://gus.fzk.de/ws/ticket_info.php?ticket=59441

ATLAS ALARM->CERN CASTOR

WLCG MB Report WLCG Service Report


Cms alarm cern afs

https://gus.fzk.de/ws/ticket_info.php?ticket=59547

CMS ALARM->CERN AFS

WLCG MB Report WLCG Service Report


Lhcb alarm infn t1 storm

https://gus.fzk.de/ws/ticket_info.php?ticket=59643

LHCB ALARM->INFN-T1 STORM

WLCG MB Report WLCG Service Report


Atlas alarm cern castor srm

https://gus.fzk.de/ws/ticket_info.php?ticket=59848

ATLAS ALARM->CERN CASTOR SRM

WLCG MB Report WLCG Service Report


Atlas alarm cern castor srm1

https://gus.fzk.de/ws/ticket_info.php?ticket=59850

ATLAS ALARM->CERN CASTOR SRM

WLCG MB Report WLCG Service Report


Alarm summary

Alarm Summary


Summary

Summary

  • Good response to GGUS alarms continues – frequency high (but bearable in the short term?) for support staff as well as for users…

  • No significant reduction can be expected without an analysis of where the most impact could be achieved – and change – which comes with risk

  • Good match between Site Usability plots and problems reported through daily meetings


Workshop actions draft

Workshop Actions - Draft

  • SIR template and MoU-based wording to categorize service degradation / downtimes;

  • Monitoring;

  • Prolonged site (service) downtimes;

  • Squid “as a WLCG service”

  • None of these are new items – most have been discussed explicitly at WLCG T1 SCM meetings earlier this year

  • Need to review summary slides to ensure that list is exhaustive and prioritize – including matching against EGI InSPIRE SA3 manpower (now largely in place or agreed)

  • Also proposed to hold daily WLCG operations meetings – chaired e.g. by a Tier1 – when CERN closed (Jeune Genevois etc.)


  • Login