Tier 1 status
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Tier-1 Status PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on
  • Presentation posted in: General

Tier-1 Status. Andrew Sansum GRIDPP18 21 March 2007. Staff Changes. Steve Traylen left in September Three new Tier-1 staff Lex Holt (Fabric Team) James Thorne (Fabric Team) James Adams (Fabric Team) One EGEE funded post to operate a PPS (and work on integration with NGS): Marian Klein.

Download Presentation

Tier-1 Status

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Tier 1 status

Tier-1 Status

Andrew Sansum

GRIDPP18

21 March 2007


Staff changes

Staff Changes

  • Steve Traylen left in September

  • Three new Tier-1 staff

    • Lex Holt (Fabric Team)

    • James Thorne (Fabric Team)

    • James Adams (Fabric Team)

  • One EGEE funded post to operate a PPS (and work on integration with NGS):

    • Marian Klein


Team organisation

Grid Services

Grid/Support

Fabric

(H/W and OS)

CASTOR

SW/Robot

Ross

Condurache

Hodges

Klein (PPS)

Vacancy

Bly (team leader)

Wheeler

Holt

Thorne

White (OS support)

Adams (HW support)

Corney (GL)

Strong (Service Manager)

Folkes (HW Manager)

deWitt

Jensen

Kruk

Ketley

Bonnet

2.5 FTE effort

Project Management (Sansum/Gordon/(Kelsey))

Database Support (Brown)

Machine Room operations

Networking Support

Team Organisation


Hardware deployment cpu

Hardware Deployment - CPU

  • 64 Dual core/dual CPU Intel Woodcrest 5130 systems delivered in November (about 550 KSI2K)

  • Completed acceptance tests over Christmas and into production mid January

  • CPU farm capacity now (approximately):

    • 600 systems

    • 1250 cores

    • 1500 KSI2K


Hardware deployment disk

Hardware Deployment - Disk

  • 2006 was a difficult year with deployment hold-ups:

    • March 2006 delivery: 21 servers, Areca RAID controller – 24*400GB WD (RE2) drives. Available: January 2007

    • November 2006 delivery: 47 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted February 2007 (but still deploying to CASTOR)

    • January 2007 delivery:39 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted March 2007. Ready to deploy to CASTOR


Disk deployment issues

Disk Deployment - Issues

  • March 2006 (Clustervision) delivery:

    • Originally delivered with 400GB WD400YR drives

    • Many drive ejects under normal load test (had worked OK when we tested in January).

    • Drive specification found to have changed – compatibility problems with RAID controller (despite drive being listed as compatible)

    • Various firmware fixes tried – improvements but not fixed.

    • August 2006 WD offer to replace for 500YS drive

    • September 2006 – load test of new configuration begin to show occasional (but unacceptably frequent) drive ejects (different problem).

    • Major diagnostic effort by Western Digital – Clustervision also trying various fixes lots of theories – vibration, EM noise, protocol incompatability – various fixes tried (slow as failure rate quite low)..

    • Fault hard to trace, various theories and fixes tried but eventually traced (early December) to faulty firmware.

    • Firmware updated and load test shows problem fixed (mid Dec). Load test completes in early January and deployment begins.


Disk deployment cause

Disk Deployment - Cause

  • Western digital working at 2 sites – logic analysers on SATA interconnect.

  • Eventually fault traced to a “missing return” in the firmware:

    • If drive head stays too long in one place it repositions to allow lubricant to migrate.

    • Only shows up under certain work patterns

    • No return following reposition and 8 seconds later controller ejects drive


Disk deployment

Disk Deployment


Hardware deployment tape

Hardware Deployment - Tape

  • SL8500 tape robot upgraded to 10000 slots in August 2006.

  • GRIDPP buy 3 additional T10K tape drives in February 2007 (now 6 drives owned by GRIDPP)

  • Further purchase of 350TB tape media in February 2007.

  • Total Tape capacity now 850-900TB (but not all immediately allocated – some to assist CASTOR migration – some needed for CASTOR operations.


Hardware deployment network

Hardware Deployment - Network

  • 10GB line from CERN available in August 2006

  • RAL was scheduled to attach to Thames Valley Network (TVN) at 10GB by November 2006:

    • Change of plan in November – I/O rates from Tier-1 already visible to UKERNA. Decide to connect T1 by 10Gb resilient connection direct into SJ5 core (planned mid Q1)

    • Connection delayed but now scheduled for end of March

  • GRIDPP load tests identify several issues at RAL firewall. These resolved but plan is now to bypass the firewall for SRM traffic from SJ5.

  • A number of internal Tier-1 topology changes while we have enhanced LAN backbone to 10Gb in preparation to SJ5


Tier 1 status

RAL

Site

CPUs +

Disks

CPUs +

Disks

ADS

Caches

RAL

Tier 2

N x 1Gb/s

2 x 5510

+ 5530

3 x 5510

+ 5530

5510

5530

10Gb/s

1Gb/s

to SJ4

Router

A

4 x 5530

10Gb/s

OPN

Router

10Gb/s

to CERN

5 x 5510

+ 5530

6 x 5510

+ 5530

Oracle

systems

CPUs +

Disks

CPUs +

Disks

Tier 1

Tier-1 LAN


New machine room

New Machine Room

  • Tender underway, planned completion: August 2008

  • 800M**2 can accommodate 300 racks + 5 robots

  • 2.3MW Power/Cooling capacity (some UPS)

  • Office accommodation for all E-Science staff

  • Combined Heat and Power Generation (CHP) on site

  • Not all for GRIDPP (but you get most)!


Tier 1 capacity delivered to wlcg 2006

Tier-1 Capacity delivered to WLCG (2006)


Last 12 months cpu occupancy

Last 12 months CPU Occupancy

+260 KSI2K

May 2006

+550 KSI2K

January 2007


Recent cpu occupancy 4 weeks

Recent CPU Occupancy (4 weeks)

Air-conditioning Work (300KSI2K offline)


Cpu efficiencies

CPU Efficiencies


Cpu efficiencies1

CPU Efficiencies

CMS merge jobs – hang on CASTOR

ATLAS/LHCB jobs hanging on dCache

Babar jobs running slow – reason unknown


3d service

3D Service

  • Used by ATLAS and LHCB to distribute conditions data by Oracle streams

  • RAL one of 5 sites who deployed a production service during Phase I.

  • Small SAN cluster – 4 nodes, 1 Fibre channel RAID array.

  • RAL takes a leading role in the project.


Reliability

Reliability

  • Reliability matters to the experiments.

    • Use the SAM monitoring to identify priority areas

    • Also worrying about job loss rates

  • Priority at RAL to improve reliability:

    • Fix the faults that degrade our SAM availability

    • New exception monitoring and automation system based on Nagios

  • Reliability is improving, but work feels like an endless treadmill. Fix one fault and find a new one.


Reliability ce

Reliability - CE

  • Split PBS server and CE long time ago

  • Split CE and local BDII

  • Site BDII times out on CE info provider

    • CPU usage very high on CE info provider “starved”

    • Upgraded CE to 2 cores.

  • Site BDII still times out on CE info provider

    • CE system disk I/O bound

    • Reduce load (changed backups etc)

    • Finally replaced system drive with faster model.


Ce load

CE Load


Job scheduling

Job Scheduling

  • Sam Jobs failing to be scheduled by MAUI

    • SAM tests running under operations VO, but share gid with dteam. dteam has used all resource – thus MAUI starts no more jobs

    • Change scheduling and favour ops VO (Long term plan to split ops and dteam)

  • PBS server hanging after communications problems

    • Job stuck in pending state jams whole batch system (no job starts – site unavailable!)

    • Auto detect state of pending jobs and hold – remaining jobs start and availability good

    • But now held jobs impact ETT and we receive less work from RB – have to delete held jobs


Jobs de queued at ce

Jobs de-queued at CE

  • Jobs reach the CE and are successfully submitted to the scheduler but shortly afterwards CE decides to de-queue the job.

    • Only impacts SAM monitoring occasionally

    • May be impacting users more than SAM but we cannot tell from our logs

    • Logged a GGUS ticket but no resolution


Tier 1 status

RB

  • RB running very busy for extended periods during the summer:

    • Second RB (rb02) added early November but no transparent way of advertising. Needs UIs to manually configure (see GRIDPP wiki).

  • Jobs found to abort on rb01 linked to size of database

    • Database needed cleaning (was over 8GB)

  • Job cancels may (but not reproducibly) break RB (RB may go 100% CPU bound) – no fix to this ticket.


Rb load

RB Load

rb02 High CPU Load

rb02 deployed

Drained to fix hardware


Top level bdii

Top Level BDII

  • Top level BDII not reliably responding to queries

    • Query rate too high

    • UK sites failing SAM tests for extended periods

  • Upgraded BDII to two servers on DNS round robin

    • Sites occasionally fail SAM test

  • Upgraded BDII to 3 servers (last Friday)

    • Hope problem fixed – please report timeouts.


Tier 1 status

FTS

  • Reasonably reliable service

    • Based on a single server

    • Monitoring and automation to watch for problems

  • At next upgrade (soon) move from single server to two pairs:

    • One pair will handle transfer agents

    • One pair will handle web front end.


Dcache

dCache

  • Problems with gridftp doors hanging

    • Partly helped by changes to network tuning

    • But still impacts SAM tests (and experiments). Decide to move SAM CE replica-manager test from dCache to CASTOR (cynical manoeuvre to help SAM)

  • Had hoped this month’s upgrade to version 1.7 would resolve problem

    • Didn’t help

    • Have now upgraded all gridftp doors to Java 1.5. No hangs since upgrade last Thursday.


Sam availability

SAM Availability


Castor

CASTOR

  • Autumn 2005/Winter 2005:

    • Decide to migrate tape service to CASTOR

    • Decision that CASTOR will eventually replace dCache for disk pool management - CASTOR2 deployment starts

  • Spring/Summer 2006: Major effort to deploy and understand CASTOR

    • Difficult to establish a stable pre-production service

    • Upgrades extremely difficult to make work – test service down for weeks at a time following upgrade or patching.

  • September 2006:

    • Originally planned we have full production service

    • Eventually – after heroic effort CASTOR team establish a pre-production service for CSA06

  • October 2006

    • But we don’t have any disk – have to – BIG THANK YOU PPD!

    • CASTOR performs well in CSA06

  • November/December work on CASTOR upgrade but eventually fail to upgrade

  • January 2007 declare CASTOR service as production quality

  • Feb/March 2007:

    • Continue work with CMS as they expand range of tasks expected of CASTOR – significant load related operational issues identified (eg CMS merge jobs cause LSF meltdown).

    • Start work with Atlas/LHCB and MINOS to migrate to CASTOR


Castor layout

ralsrmb

ralsrmd

ralsrme

ralsrma

ralsrmc

ralsrmf

atlasD1T0prod

cmsFarmRead

atlasD0T1test

atlasD1T0test

atlasD1T0usr

cmswanout

CMSwanin

atlasD1T1

lhcbD1T0

tmpD0T1

prdD0T1

D1T0

D0T1

CASTOR Layout

SRM 1

service classes

Disk

Pools


Tier 1 status

CMS


Phedex rate to castor ral destination

Phedex Rate to CASTOR (RAL Destination)


Phedex rate to castor ral source

Phedex Rate to CASTOR RAL Source


Sl4 and glite

SL4 and gLite

  • Preparing to migrate some batch workers to SL4 for experiment testing.

  • Some gLite testing (on SL3) already underway but becoming increasingly nervous about risks associated with late deployment of forthcoming SL4 gLite release


Grid only

Grid Only

  • Long standing milestone that Tier-1 will offer a “Grid Only” service by the end of August 2007.

  • Discussed at January UB. Considerable discussion WRT what “Grid Only” means.

  • Basic target confirmed by Tier-1 board but details still to be fixed WRT exactly what remains as needed.


Conclusions

Conclusions

  • Last year was a tough year but we have eventually made good progress.

    • A lot of problems encountered

    • A lot accomplished

  • This year focus will be on:

    • Establishing a stable CASTOR service that meets the needs of the experiments

    • Deploying required releases of SL4/gLite

    • meeting (hopefully exceeding) availability targets

    • Hardware ramp up as we move towards GRIDPP3


  • Login