Tier 1 overview
Download
1 / 14

- PowerPoint PPT Presentation


  • 162 Views
  • Uploaded on

Tier-1 Overview. Andrew Sansum 21 November 2007. Overview of Presentations. Morning Presentations Overview (Me) Not really overview – at request of Tony mainly MoU commitments CASTOR (Bonny) Storing the data and getting it to tape Grid Infrastructure (Derek Ross) Grid Services

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tier 1 overview l.jpg

Tier-1 Overview

Andrew Sansum

21 November 2007


Overview of presentations l.jpg
Overview of Presentations

  • Morning Presentations

    • Overview (Me)

      • Not really overview – at request of Tony mainly MoU commitments

    • CASTOR (Bonny)

      • Storing the data and getting it to tape

    • Grid Infrastructure (Derek Ross)

      • Grid Services

      • dCache future

      • Grid Only Access

    • Fabric Talk (Martin Bly)

      • Procurements

      • Hardware infrastructure (inc Local Network)

      • Operation

  • Afternoon Presentations

    • Neil (RAL benefits)

    • Site Networking (Robin Tasker)

    • Machine Rooms (Graham Robinson)


What i ll cover l.jpg
What I’ll Cover

  • Mainly going to cover MoU commitments

    • Response Times

    • Reliability

    • On-Call

    • Disaster planning

  • Also cover staffing


Gridpp2 team organisation l.jpg

Grid Services

Grid/exp Support

Fabric

(H/W and OS)

CASTOR

SW/Robot

Ross

Condurache

Hodges

Klein (PPS)

Vacancy

Bly

Wheeler

Vacancy

Thorne

White (OS support)

Adams (HW support)

Corney (GL)

Strong (Service Manager)

Folkes (HW Manager)

deWitt

Jensen

Kruk

Ketley

Jackson (CASE)

Prosser (Contractor)

(Nominally 5.5 FTE)

Project Management (Sansum/Gordon/(Kelsey)) (1.5 FTE)

Database Support (0.5 FTE) (Brown)

Machine Room operations (1.5 FTE)

Networking Support (0.5 FTE)

GRIDPP2 Team Organisation


Staff evolution to gridpp3 l.jpg
Staff Evolution to GRIDPP3

  • Level

    • GRIDPP2 (13.5 GRIDPP + 3.0 e-Science)

    • GRIDPP3 (17.0 GRIDPP + 3.4 e-Science)

  • Main changes

    • Hardware repair effort 1->2 FTE

    • New incident response team (2 FTE)

    • Extra castor effort (0.5 FTE) (but this is already effort that has been working on CASTOR unreported.

    • Small changes elsewhere

  • Main problem

    • We have injected 2 FTE of effort temporarily into CASTOR. Long term GRIDPP3 plan funds less effort than current experience suggests that we need.


Wlcg gridpp mou expectations l.jpg
WLCG/GRIDPP MoU Expectations

[1] Prime service hours are 08:00-18:00 during the working week of the centre, except public holidays.


Response time l.jpg
Response Time

  • Time to acknowledge fault ticket

  • 12-48 hour response time outside prime shift

  • On-call system should easily cover this provided possible to automatically classify problem tickets by level of service required.

  • Cover during prime shift more challenging (2-4 hours) but is already a routine task for Admin on Duty

  • To hit availability target must be much faster (2 hours or less)


Reliability l.jpg
Reliability

  • Have made good progress in last 12 months

    • Prioritised issues affecting SAM test failures.

    • Introduced “issue tracking” and weekly reviews of outstanding issues.

    • Introduced resilience into trouble spots (but more still to do)

    • Moved services to appropriate capacity hardware, seperated services, etc etc.

    • Introduced new team role: “Admin on Duty”. Monitoring farm operation, ticket progression, EGEE broadcast info.

  • Best Tier-1 averaged over last 3 months (other than CERN).



Mou commitments availability l.jpg
MoU Commitments (Availability)

  • Really reliability (availability while scheduled up)

  • Still tough – 97-99% service availability will be hard (1% is just 87 hours per year).

    • OPN reliability predicted to be 98% without resilience, site SJ5 connection is much better (Robin will discuss).

    • Most faults (75%) will fall outside normal working hours

    • Software components still changing (eg CASTOR upgrades, WMS) etc.

    • Many faults in 2008 will be “new” only emerging as WLCG ramps up to full load.

    • Emergent faults can take a long time to diagnose and fix (days)

  • To improve on current availability will need to:

    • Improve automation

    • Speed up manual recovery process

    • Improve monitoring further

    • Provide on-call


On call l.jpg
On-Call

  • On-Call will be essential in order to meet response and availability targets.

  • On-Call project now running (Matt Hodges), target is to have on-call operational by March 2008.

  • Automation/recovery/monitoring all important parts of on-call system. Avoid callouts by avoiding problems.

  • May be possible to have some weekend on-call cover before March for some components.

  • On-call will continue to evolve after March as we learn from experience.


Disaster planning i l.jpg
Disaster Planning (I)

  • Extreme end of availability problem. Risk analysis exists, but aging and not fully developed.

  • Highest Impact risks:

    • Extended environment problem in machine room

      • Fire

      • Flood

      • Power Failure

      • Cooling failure

    • Extended network failure

    • Major data loss through loss of CASTOR metadata

    • Major security incident (site or Tier-1)


Disaster planning ii l.jpg
Disaster Planning (II)

  • Some disaster plan components exist

    • Disaster plan for machine room. Assuming equipment is undamaged, relocate and endeavour to sustain functions but at much reduced capacity.

    • Datastore (ADS) disaster recovery plan developed and tested

    • Network plan exists

    • Individual Tier-1 systems have documented recovery processes and fire-safe backups or can be instanced from kickstart server. Not all these are simple nor are all fully tested.

  • Key Missing Components

    • National/Global services (RGMA/FTS/BDII/LFC/…). Address by distributing elsewhere. Probably feasible and is necessary – 6 months.

    • CASTOR – All our data holdings depend on integrity of catalogue. Recover from first principles not tested. Is flagged as a priority area but balance against need to make CASTOR work.

    • Second – independent Tier-1 build infrastructure to allow us to rebuild Tier-1 at new physical location. Would allow us to address major issues such as fire. Major project – priority?


Conclusions l.jpg
Conclusions

  • Made a lot of progress in many areas this year. Availability improving, hardware reliable, CASTOR working quite well and upgrades on-track.

  • Main challenges for 2008 (data taking)

    • Large hardware installations and almost immediate next procurement

    • CASTOR at full load

    • On-call and general MoU processes


ad