tier 1 status
Skip this Video
Download Presentation
Tier-1 Status

Loading in 2 Seconds...

play fullscreen
1 / 37

Tier-1 Status - PowerPoint PPT Presentation

  • Uploaded on

Tier-1 Status. Andrew Sansum GRIDPP18 21 March 2007. Staff Changes. Steve Traylen left in September Three new Tier-1 staff Lex Holt (Fabric Team) James Thorne (Fabric Team) James Adams (Fabric Team) One EGEE funded post to operate a PPS (and work on integration with NGS): Marian Klein.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Tier-1 Status' - lucia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
tier 1 status

Tier-1 Status

Andrew Sansum


21 March 2007

staff changes
Staff Changes
  • Steve Traylen left in September
  • Three new Tier-1 staff
    • Lex Holt (Fabric Team)
    • James Thorne (Fabric Team)
    • James Adams (Fabric Team)
  • One EGEE funded post to operate a PPS (and work on integration with NGS):
    • Marian Klein
team organisation
Grid Services



(H/W and OS)






Klein (PPS)


Bly (team leader)




White (OS support)

Adams (HW support)

Corney (GL)

Strong (Service Manager)

Folkes (HW Manager)






2.5 FTE effort

Project Management (Sansum/Gordon/(Kelsey))

Database Support (Brown)

Machine Room operations

Networking Support

Team Organisation
hardware deployment cpu
Hardware Deployment - CPU
  • 64 Dual core/dual CPU Intel Woodcrest 5130 systems delivered in November (about 550 KSI2K)
  • Completed acceptance tests over Christmas and into production mid January
  • CPU farm capacity now (approximately):
    • 600 systems
    • 1250 cores
    • 1500 KSI2K
hardware deployment disk
Hardware Deployment - Disk
  • 2006 was a difficult year with deployment hold-ups:
    • March 2006 delivery: 21 servers, Areca RAID controller – 24*400GB WD (RE2) drives. Available: January 2007
    • November 2006 delivery: 47 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted February 2007 (but still deploying to CASTOR)
    • January 2007 delivery:39 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted March 2007. Ready to deploy to CASTOR
disk deployment issues
Disk Deployment - Issues
  • March 2006 (Clustervision) delivery:
    • Originally delivered with 400GB WD400YR drives
    • Many drive ejects under normal load test (had worked OK when we tested in January).
    • Drive specification found to have changed – compatibility problems with RAID controller (despite drive being listed as compatible)
    • Various firmware fixes tried – improvements but not fixed.
    • August 2006 WD offer to replace for 500YS drive
    • September 2006 – load test of new configuration begin to show occasional (but unacceptably frequent) drive ejects (different problem).
    • Major diagnostic effort by Western Digital – Clustervision also trying various fixes lots of theories – vibration, EM noise, protocol incompatability – various fixes tried (slow as failure rate quite low)..
    • Fault hard to trace, various theories and fixes tried but eventually traced (early December) to faulty firmware.
    • Firmware updated and load test shows problem fixed (mid Dec). Load test completes in early January and deployment begins.
disk deployment cause
Disk Deployment - Cause
  • Western digital working at 2 sites – logic analysers on SATA interconnect.
  • Eventually fault traced to a “missing return” in the firmware:
    • If drive head stays too long in one place it repositions to allow lubricant to migrate.
    • Only shows up under certain work patterns
    • No return following reposition and 8 seconds later controller ejects drive
hardware deployment tape
Hardware Deployment - Tape
  • SL8500 tape robot upgraded to 10000 slots in August 2006.
  • GRIDPP buy 3 additional T10K tape drives in February 2007 (now 6 drives owned by GRIDPP)
  • Further purchase of 350TB tape media in February 2007.
  • Total Tape capacity now 850-900TB (but not all immediately allocated – some to assist CASTOR migration – some needed for CASTOR operations.
hardware deployment network
Hardware Deployment - Network
  • 10GB line from CERN available in August 2006
  • RAL was scheduled to attach to Thames Valley Network (TVN) at 10GB by November 2006:
    • Change of plan in November – I/O rates from Tier-1 already visible to UKERNA. Decide to connect T1 by 10Gb resilient connection direct into SJ5 core (planned mid Q1)
    • Connection delayed but now scheduled for end of March
  • GRIDPP load tests identify several issues at RAL firewall. These resolved but plan is now to bypass the firewall for SRM traffic from SJ5.
  • A number of internal Tier-1 topology changes while we have enhanced LAN backbone to 10Gb in preparation to SJ5


CPUs +


CPUs +





Tier 2

N x 1Gb/s

2 x 5510

+ 5530

3 x 5510

+ 5530





to SJ4



4 x 5530






5 x 5510

+ 5530

6 x 5510

+ 5530



CPUs +


CPUs +


Tier 1

Tier-1 LAN

new machine room
New Machine Room
  • Tender underway, planned completion: August 2008
  • 800M**2 can accommodate 300 racks + 5 robots
  • 2.3MW Power/Cooling capacity (some UPS)
  • Office accommodation for all E-Science staff
  • Combined Heat and Power Generation (CHP) on site
  • Not all for GRIDPP (but you get most)!
last 12 months cpu occupancy
Last 12 months CPU Occupancy

+260 KSI2K

May 2006

+550 KSI2K

January 2007

recent cpu occupancy 4 weeks
Recent CPU Occupancy (4 weeks)

Air-conditioning Work (300KSI2K offline)

cpu efficiencies1
CPU Efficiencies

CMS merge jobs – hang on CASTOR

ATLAS/LHCB jobs hanging on dCache

Babar jobs running slow – reason unknown

3d service
3D Service
  • Used by ATLAS and LHCB to distribute conditions data by Oracle streams
  • RAL one of 5 sites who deployed a production service during Phase I.
  • Small SAN cluster – 4 nodes, 1 Fibre channel RAID array.
  • RAL takes a leading role in the project.
  • Reliability matters to the experiments.
    • Use the SAM monitoring to identify priority areas
    • Also worrying about job loss rates
  • Priority at RAL to improve reliability:
    • Fix the faults that degrade our SAM availability
    • New exception monitoring and automation system based on Nagios
  • Reliability is improving, but work feels like an endless treadmill. Fix one fault and find a new one.
reliability ce
Reliability - CE
  • Split PBS server and CE long time ago
  • Split CE and local BDII
  • Site BDII times out on CE info provider
    • CPU usage very high on CE info provider “starved”
    • Upgraded CE to 2 cores.
  • Site BDII still times out on CE info provider
    • CE system disk I/O bound
    • Reduce load (changed backups etc)
    • Finally replaced system drive with faster model.
job scheduling
Job Scheduling
  • Sam Jobs failing to be scheduled by MAUI
    • SAM tests running under operations VO, but share gid with dteam. dteam has used all resource – thus MAUI starts no more jobs
    • Change scheduling and favour ops VO (Long term plan to split ops and dteam)
  • PBS server hanging after communications problems
    • Job stuck in pending state jams whole batch system (no job starts – site unavailable!)
    • Auto detect state of pending jobs and hold – remaining jobs start and availability good
    • But now held jobs impact ETT and we receive less work from RB – have to delete held jobs
jobs de queued at ce
Jobs de-queued at CE
  • Jobs reach the CE and are successfully submitted to the scheduler but shortly afterwards CE decides to de-queue the job.
    • Only impacts SAM monitoring occasionally
    • May be impacting users more than SAM but we cannot tell from our logs
    • Logged a GGUS ticket but no resolution
  • RB running very busy for extended periods during the summer:
    • Second RB (rb02) added early November but no transparent way of advertising. Needs UIs to manually configure (see GRIDPP wiki).
  • Jobs found to abort on rb01 linked to size of database
    • Database needed cleaning (was over 8GB)
  • Job cancels may (but not reproducibly) break RB (RB may go 100% CPU bound) – no fix to this ticket.
rb load
RB Load

rb02 High CPU Load

rb02 deployed

Drained to fix hardware

top level bdii
Top Level BDII
  • Top level BDII not reliably responding to queries
    • Query rate too high
    • UK sites failing SAM tests for extended periods
  • Upgraded BDII to two servers on DNS round robin
    • Sites occasionally fail SAM test
  • Upgraded BDII to 3 servers (last Friday)
    • Hope problem fixed – please report timeouts.
  • Reasonably reliable service
    • Based on a single server
    • Monitoring and automation to watch for problems
  • At next upgrade (soon) move from single server to two pairs:
    • One pair will handle transfer agents
    • One pair will handle web front end.
  • Problems with gridftp doors hanging
    • Partly helped by changes to network tuning
    • But still impacts SAM tests (and experiments). Decide to move SAM CE replica-manager test from dCache to CASTOR (cynical manoeuvre to help SAM)
  • Had hoped this month’s upgrade to version 1.7 would resolve problem
    • Didn’t help
    • Have now upgraded all gridftp doors to Java 1.5. No hangs since upgrade last Thursday.
  • Autumn 2005/Winter 2005:
    • Decide to migrate tape service to CASTOR
    • Decision that CASTOR will eventually replace dCache for disk pool management - CASTOR2 deployment starts
  • Spring/Summer 2006: Major effort to deploy and understand CASTOR
    • Difficult to establish a stable pre-production service
    • Upgrades extremely difficult to make work – test service down for weeks at a time following upgrade or patching.
  • September 2006:
    • Originally planned we have full production service
    • Eventually – after heroic effort CASTOR team establish a pre-production service for CSA06
  • October 2006
    • But we don’t have any disk – have to – BIG THANK YOU PPD!
    • CASTOR performs well in CSA06
  • November/December work on CASTOR upgrade but eventually fail to upgrade
  • January 2007 declare CASTOR service as production quality
  • Feb/March 2007:
    • Continue work with CMS as they expand range of tasks expected of CASTOR – significant load related operational issues identified (eg CMS merge jobs cause LSF meltdown).
    • Start work with Atlas/LHCB and MINOS to migrate to CASTOR
castor layout





















service classes



sl4 and glite
SL4 and gLite
  • Preparing to migrate some batch workers to SL4 for experiment testing.
  • Some gLite testing (on SL3) already underway but becoming increasingly nervous about risks associated with late deployment of forthcoming SL4 gLite release
grid only
Grid Only
  • Long standing milestone that Tier-1 will offer a “Grid Only” service by the end of August 2007.
  • Discussed at January UB. Considerable discussion WRT what “Grid Only” means.
  • Basic target confirmed by Tier-1 board but details still to be fixed WRT exactly what remains as needed.
  • Last year was a tough year but we have eventually made good progress.
    • A lot of problems encountered
    • A lot accomplished
  • This year focus will be on:
    • Establishing a stable CASTOR service that meets the needs of the experiments
    • Deploying required releases of SL4/gLite
    • meeting (hopefully exceeding) availability targets
    • Hardware ramp up as we move towards GRIDPP3