Tier 1 overview
1 / 14

- PowerPoint PPT Presentation

  • Uploaded on

Tier-1 Overview. Andrew Sansum 21 November 2007. Overview of Presentations. Morning Presentations Overview (Me) Not really overview – at request of Tony mainly MoU commitments CASTOR (Bonny) Storing the data and getting it to tape Grid Infrastructure (Derek Ross) Grid Services

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '' - andrew

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tier 1 overview l.jpg

Tier-1 Overview

Andrew Sansum

21 November 2007

Overview of presentations l.jpg
Overview of Presentations

  • Morning Presentations

    • Overview (Me)

      • Not really overview – at request of Tony mainly MoU commitments

    • CASTOR (Bonny)

      • Storing the data and getting it to tape

    • Grid Infrastructure (Derek Ross)

      • Grid Services

      • dCache future

      • Grid Only Access

    • Fabric Talk (Martin Bly)

      • Procurements

      • Hardware infrastructure (inc Local Network)

      • Operation

  • Afternoon Presentations

    • Neil (RAL benefits)

    • Site Networking (Robin Tasker)

    • Machine Rooms (Graham Robinson)

What i ll cover l.jpg
What I’ll Cover

  • Mainly going to cover MoU commitments

    • Response Times

    • Reliability

    • On-Call

    • Disaster planning

  • Also cover staffing

Gridpp2 team organisation l.jpg

Grid Services

Grid/exp Support


(H/W and OS)






Klein (PPS)






White (OS support)

Adams (HW support)

Corney (GL)

Strong (Service Manager)

Folkes (HW Manager)





Jackson (CASE)

Prosser (Contractor)

(Nominally 5.5 FTE)

Project Management (Sansum/Gordon/(Kelsey)) (1.5 FTE)

Database Support (0.5 FTE) (Brown)

Machine Room operations (1.5 FTE)

Networking Support (0.5 FTE)

GRIDPP2 Team Organisation

Staff evolution to gridpp3 l.jpg
Staff Evolution to GRIDPP3

  • Level

    • GRIDPP2 (13.5 GRIDPP + 3.0 e-Science)

    • GRIDPP3 (17.0 GRIDPP + 3.4 e-Science)

  • Main changes

    • Hardware repair effort 1->2 FTE

    • New incident response team (2 FTE)

    • Extra castor effort (0.5 FTE) (but this is already effort that has been working on CASTOR unreported.

    • Small changes elsewhere

  • Main problem

    • We have injected 2 FTE of effort temporarily into CASTOR. Long term GRIDPP3 plan funds less effort than current experience suggests that we need.

Wlcg gridpp mou expectations l.jpg
WLCG/GRIDPP MoU Expectations

[1] Prime service hours are 08:00-18:00 during the working week of the centre, except public holidays.

Response time l.jpg
Response Time

  • Time to acknowledge fault ticket

  • 12-48 hour response time outside prime shift

  • On-call system should easily cover this provided possible to automatically classify problem tickets by level of service required.

  • Cover during prime shift more challenging (2-4 hours) but is already a routine task for Admin on Duty

  • To hit availability target must be much faster (2 hours or less)

Reliability l.jpg

  • Have made good progress in last 12 months

    • Prioritised issues affecting SAM test failures.

    • Introduced “issue tracking” and weekly reviews of outstanding issues.

    • Introduced resilience into trouble spots (but more still to do)

    • Moved services to appropriate capacity hardware, seperated services, etc etc.

    • Introduced new team role: “Admin on Duty”. Monitoring farm operation, ticket progression, EGEE broadcast info.

  • Best Tier-1 averaged over last 3 months (other than CERN).

Mou commitments availability l.jpg
MoU Commitments (Availability)

  • Really reliability (availability while scheduled up)

  • Still tough – 97-99% service availability will be hard (1% is just 87 hours per year).

    • OPN reliability predicted to be 98% without resilience, site SJ5 connection is much better (Robin will discuss).

    • Most faults (75%) will fall outside normal working hours

    • Software components still changing (eg CASTOR upgrades, WMS) etc.

    • Many faults in 2008 will be “new” only emerging as WLCG ramps up to full load.

    • Emergent faults can take a long time to diagnose and fix (days)

  • To improve on current availability will need to:

    • Improve automation

    • Speed up manual recovery process

    • Improve monitoring further

    • Provide on-call

On call l.jpg

  • On-Call will be essential in order to meet response and availability targets.

  • On-Call project now running (Matt Hodges), target is to have on-call operational by March 2008.

  • Automation/recovery/monitoring all important parts of on-call system. Avoid callouts by avoiding problems.

  • May be possible to have some weekend on-call cover before March for some components.

  • On-call will continue to evolve after March as we learn from experience.

Disaster planning i l.jpg
Disaster Planning (I)

  • Extreme end of availability problem. Risk analysis exists, but aging and not fully developed.

  • Highest Impact risks:

    • Extended environment problem in machine room

      • Fire

      • Flood

      • Power Failure

      • Cooling failure

    • Extended network failure

    • Major data loss through loss of CASTOR metadata

    • Major security incident (site or Tier-1)

Disaster planning ii l.jpg
Disaster Planning (II)

  • Some disaster plan components exist

    • Disaster plan for machine room. Assuming equipment is undamaged, relocate and endeavour to sustain functions but at much reduced capacity.

    • Datastore (ADS) disaster recovery plan developed and tested

    • Network plan exists

    • Individual Tier-1 systems have documented recovery processes and fire-safe backups or can be instanced from kickstart server. Not all these are simple nor are all fully tested.

  • Key Missing Components

    • National/Global services (RGMA/FTS/BDII/LFC/…). Address by distributing elsewhere. Probably feasible and is necessary – 6 months.

    • CASTOR – All our data holdings depend on integrity of catalogue. Recover from first principles not tested. Is flagged as a priority area but balance against need to make CASTOR work.

    • Second – independent Tier-1 build infrastructure to allow us to rebuild Tier-1 at new physical location. Would allow us to address major issues such as fire. Major project – priority?

Conclusions l.jpg

  • Made a lot of progress in many areas this year. Availability improving, hardware reliable, CASTOR working quite well and upgrades on-track.

  • Main challenges for 2008 (data taking)

    • Large hardware installations and almost immediate next procurement

    • CASTOR at full load

    • On-call and general MoU processes