Tier 1 status
1 / 37

Tier-1 Status - PowerPoint PPT Presentation

  • Uploaded on

Tier-1 Status. Andrew Sansum GRIDPP18 21 March 2007. Staff Changes. Steve Traylen left in September Three new Tier-1 staff Lex Holt (Fabric Team) James Thorne (Fabric Team) James Adams (Fabric Team) One EGEE funded post to operate a PPS (and work on integration with NGS): Marian Klein.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Tier-1 Status' - lucia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tier 1 status

Tier-1 Status

Andrew Sansum


21 March 2007

Staff changes
Staff Changes

  • Steve Traylen left in September

  • Three new Tier-1 staff

    • Lex Holt (Fabric Team)

    • James Thorne (Fabric Team)

    • James Adams (Fabric Team)

  • One EGEE funded post to operate a PPS (and work on integration with NGS):

    • Marian Klein

Team organisation

Grid Services



(H/W and OS)






Klein (PPS)


Bly (team leader)




White (OS support)

Adams (HW support)

Corney (GL)

Strong (Service Manager)

Folkes (HW Manager)






2.5 FTE effort

Project Management (Sansum/Gordon/(Kelsey))

Database Support (Brown)

Machine Room operations

Networking Support

Team Organisation

Hardware deployment cpu
Hardware Deployment - CPU

  • 64 Dual core/dual CPU Intel Woodcrest 5130 systems delivered in November (about 550 KSI2K)

  • Completed acceptance tests over Christmas and into production mid January

  • CPU farm capacity now (approximately):

    • 600 systems

    • 1250 cores

    • 1500 KSI2K

Hardware deployment disk
Hardware Deployment - Disk

  • 2006 was a difficult year with deployment hold-ups:

    • March 2006 delivery: 21 servers, Areca RAID controller – 24*400GB WD (RE2) drives. Available: January 2007

    • November 2006 delivery: 47 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted February 2007 (but still deploying to CASTOR)

    • January 2007 delivery:39 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted March 2007. Ready to deploy to CASTOR

Disk deployment issues
Disk Deployment - Issues

  • March 2006 (Clustervision) delivery:

    • Originally delivered with 400GB WD400YR drives

    • Many drive ejects under normal load test (had worked OK when we tested in January).

    • Drive specification found to have changed – compatibility problems with RAID controller (despite drive being listed as compatible)

    • Various firmware fixes tried – improvements but not fixed.

    • August 2006 WD offer to replace for 500YS drive

    • September 2006 – load test of new configuration begin to show occasional (but unacceptably frequent) drive ejects (different problem).

    • Major diagnostic effort by Western Digital – Clustervision also trying various fixes lots of theories – vibration, EM noise, protocol incompatability – various fixes tried (slow as failure rate quite low)..

    • Fault hard to trace, various theories and fixes tried but eventually traced (early December) to faulty firmware.

    • Firmware updated and load test shows problem fixed (mid Dec). Load test completes in early January and deployment begins.

Disk deployment cause
Disk Deployment - Cause

  • Western digital working at 2 sites – logic analysers on SATA interconnect.

  • Eventually fault traced to a “missing return” in the firmware:

    • If drive head stays too long in one place it repositions to allow lubricant to migrate.

    • Only shows up under certain work patterns

    • No return following reposition and 8 seconds later controller ejects drive

Hardware deployment tape
Hardware Deployment - Tape

  • SL8500 tape robot upgraded to 10000 slots in August 2006.

  • GRIDPP buy 3 additional T10K tape drives in February 2007 (now 6 drives owned by GRIDPP)

  • Further purchase of 350TB tape media in February 2007.

  • Total Tape capacity now 850-900TB (but not all immediately allocated – some to assist CASTOR migration – some needed for CASTOR operations.

Hardware deployment network
Hardware Deployment - Network

  • 10GB line from CERN available in August 2006

  • RAL was scheduled to attach to Thames Valley Network (TVN) at 10GB by November 2006:

    • Change of plan in November – I/O rates from Tier-1 already visible to UKERNA. Decide to connect T1 by 10Gb resilient connection direct into SJ5 core (planned mid Q1)

    • Connection delayed but now scheduled for end of March

  • GRIDPP load tests identify several issues at RAL firewall. These resolved but plan is now to bypass the firewall for SRM traffic from SJ5.

  • A number of internal Tier-1 topology changes while we have enhanced LAN backbone to 10Gb in preparation to SJ5



CPUs +


CPUs +





Tier 2

N x 1Gb/s

2 x 5510

+ 5530

3 x 5510

+ 5530





to SJ4



4 x 5530






5 x 5510

+ 5530

6 x 5510

+ 5530



CPUs +


CPUs +


Tier 1

Tier-1 LAN

New machine room
New Machine Room

  • Tender underway, planned completion: August 2008

  • 800M**2 can accommodate 300 racks + 5 robots

  • 2.3MW Power/Cooling capacity (some UPS)

  • Office accommodation for all E-Science staff

  • Combined Heat and Power Generation (CHP) on site

  • Not all for GRIDPP (but you get most)!

Last 12 months cpu occupancy
Last 12 months CPU Occupancy

+260 KSI2K

May 2006

+550 KSI2K

January 2007

Recent cpu occupancy 4 weeks
Recent CPU Occupancy (4 weeks)

Air-conditioning Work (300KSI2K offline)

Cpu efficiencies1
CPU Efficiencies

CMS merge jobs – hang on CASTOR

ATLAS/LHCB jobs hanging on dCache

Babar jobs running slow – reason unknown

3d service
3D Service

  • Used by ATLAS and LHCB to distribute conditions data by Oracle streams

  • RAL one of 5 sites who deployed a production service during Phase I.

  • Small SAN cluster – 4 nodes, 1 Fibre channel RAID array.

  • RAL takes a leading role in the project.


  • Reliability matters to the experiments.

    • Use the SAM monitoring to identify priority areas

    • Also worrying about job loss rates

  • Priority at RAL to improve reliability:

    • Fix the faults that degrade our SAM availability

    • New exception monitoring and automation system based on Nagios

  • Reliability is improving, but work feels like an endless treadmill. Fix one fault and find a new one.

Reliability ce
Reliability - CE

  • Split PBS server and CE long time ago

  • Split CE and local BDII

  • Site BDII times out on CE info provider

    • CPU usage very high on CE info provider “starved”

    • Upgraded CE to 2 cores.

  • Site BDII still times out on CE info provider

    • CE system disk I/O bound

    • Reduce load (changed backups etc)

    • Finally replaced system drive with faster model.

Job scheduling
Job Scheduling

  • Sam Jobs failing to be scheduled by MAUI

    • SAM tests running under operations VO, but share gid with dteam. dteam has used all resource – thus MAUI starts no more jobs

    • Change scheduling and favour ops VO (Long term plan to split ops and dteam)

  • PBS server hanging after communications problems

    • Job stuck in pending state jams whole batch system (no job starts – site unavailable!)

    • Auto detect state of pending jobs and hold – remaining jobs start and availability good

    • But now held jobs impact ETT and we receive less work from RB – have to delete held jobs

Jobs de queued at ce
Jobs de-queued at CE

  • Jobs reach the CE and are successfully submitted to the scheduler but shortly afterwards CE decides to de-queue the job.

    • Only impacts SAM monitoring occasionally

    • May be impacting users more than SAM but we cannot tell from our logs

    • Logged a GGUS ticket but no resolution


  • RB running very busy for extended periods during the summer:

    • Second RB (rb02) added early November but no transparent way of advertising. Needs UIs to manually configure (see GRIDPP wiki).

  • Jobs found to abort on rb01 linked to size of database

    • Database needed cleaning (was over 8GB)

  • Job cancels may (but not reproducibly) break RB (RB may go 100% CPU bound) – no fix to this ticket.

Rb load
RB Load

rb02 High CPU Load

rb02 deployed

Drained to fix hardware

Top level bdii
Top Level BDII

  • Top level BDII not reliably responding to queries

    • Query rate too high

    • UK sites failing SAM tests for extended periods

  • Upgraded BDII to two servers on DNS round robin

    • Sites occasionally fail SAM test

  • Upgraded BDII to 3 servers (last Friday)

    • Hope problem fixed – please report timeouts.


  • Reasonably reliable service

    • Based on a single server

    • Monitoring and automation to watch for problems

  • At next upgrade (soon) move from single server to two pairs:

    • One pair will handle transfer agents

    • One pair will handle web front end.


  • Problems with gridftp doors hanging

    • Partly helped by changes to network tuning

    • But still impacts SAM tests (and experiments). Decide to move SAM CE replica-manager test from dCache to CASTOR (cynical manoeuvre to help SAM)

  • Had hoped this month’s upgrade to version 1.7 would resolve problem

    • Didn’t help

    • Have now upgraded all gridftp doors to Java 1.5. No hangs since upgrade last Thursday.


  • Autumn 2005/Winter 2005:

    • Decide to migrate tape service to CASTOR

    • Decision that CASTOR will eventually replace dCache for disk pool management - CASTOR2 deployment starts

  • Spring/Summer 2006: Major effort to deploy and understand CASTOR

    • Difficult to establish a stable pre-production service

    • Upgrades extremely difficult to make work – test service down for weeks at a time following upgrade or patching.

  • September 2006:

    • Originally planned we have full production service

    • Eventually – after heroic effort CASTOR team establish a pre-production service for CSA06

  • October 2006

    • But we don’t have any disk – have to – BIG THANK YOU PPD!

    • CASTOR performs well in CSA06

  • November/December work on CASTOR upgrade but eventually fail to upgrade

  • January 2007 declare CASTOR service as production quality

  • Feb/March 2007:

    • Continue work with CMS as they expand range of tasks expected of CASTOR – significant load related operational issues identified (eg CMS merge jobs cause LSF meltdown).

    • Start work with Atlas/LHCB and MINOS to migrate to CASTOR

Castor layout






















service classes



Sl4 and glite
SL4 and gLite

  • Preparing to migrate some batch workers to SL4 for experiment testing.

  • Some gLite testing (on SL3) already underway but becoming increasingly nervous about risks associated with late deployment of forthcoming SL4 gLite release

Grid only
Grid Only

  • Long standing milestone that Tier-1 will offer a “Grid Only” service by the end of August 2007.

  • Discussed at January UB. Considerable discussion WRT what “Grid Only” means.

  • Basic target confirmed by Tier-1 board but details still to be fixed WRT exactly what remains as needed.


  • Last year was a tough year but we have eventually made good progress.

    • A lot of problems encountered

    • A lot accomplished

  • This year focus will be on:

    • Establishing a stable CASTOR service that meets the needs of the experiments

    • Deploying required releases of SL4/gLite

    • meeting (hopefully exceeding) availability targets

    • Hardware ramp up as we move towards GRIDPP3