Tier1 report
Download
1 / 17

Tier1 Report - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

Tier1 Report. HEPSysMan @ Cambridge 23rd October 2006 Martin Bly. Overview. Tier-1 Hardware changes Services. RAL Tier-1. RAL hosts the UK WLCG Tier-1 Funded via GridPP2 project from PPARC Supports WLCG and UK Particle Physics users and collaborators VOs:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Tier1 Report' - wynter-conrad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tier1 report

Tier1 Report

HEPSysMan @ Cambridge

23rd October 2006

Martin Bly


Overview
Overview

  • Tier-1

  • Hardware changes

  • Services

HEPSysMan @ Cambridge


Ral tier 1
RAL Tier-1

  • RAL hosts the UK WLCG Tier-1

    • Funded via GridPP2 project from PPARC

    • Supports WLCG and UK Particle Physics users and collaborators

      • VOs:

        • LHC: Atlas, CMS, LHCb, Alice, (dteam, ops)

        • Babar CDF, D0, H1, Zeus

        • bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, …

      • Other experiments:

        • Mice, SNO, UKQCD

      • Theory users

HEPSysMan @ Cambridge


Staff finance
Staff / Finance

  • Bid to PPARC for ‘GridPP3’ project

    • For exploitation phase of LHC

    • September 2007 to March 2011

    • Increase in staff and hardware resources

    • Result early 2007

  • Tier-1 is recruiting

    • 2 x systems admins, 1 x hardware technician

    • 1 x grid deployment

    • Replacement for Steve Traylen to head grid deployment and user support group

  • CCLRC internal reorganisation

    • Business Units

      • Tier1 service is run by E-Science department which is now part of the Facilities Business Unit (FBU)

HEPSysMan @ Cambridge


New building
New building

  • Funding approved for a new computer centre building

    • 3 floors

      • Computer rooms on ground floor, offices above

    • 240m2 low power density room

      • Tape robots, disk servers etc

      • Minimum heat density 1.0 kW/m2, rising to 1.6kW/m2 by 2012

    • 490m2 high power density room

      • Servers, CPU farms, HPC clusters

        • Minimum heat density 1.8kW/m2, rising to 2.8Kw/m2 by 2012

  • UPS computer room

    • 8 racks + 3 telecoms racks

    • UPS system to provide continuous power of 400A/92KVA three phase for equipment plus power to air conditioning (total approx 800A/184KVA)

  • Overall

    • Space for 300 racks (+ robots, telecoms)

    • Power: 2700kVA initially, max 5000kVA by 2012 (inc air-con)

    • UPS capacity to meet estimated 1000A/250KVA for 15-20 minutes for specific hardware for clean shutdown / surviving short breaks

  • Shared with HPC and other CCLRC computing facilities

  • Planned to be ready by summer 2008

HEPSysMan @ Cambridge


Hardware changes
Hardware changes

  • FY05/06 capacity procurement March 06

    • 52 x 1U twin dual-core AMD 270 units

      • Tyan 2882 motherboard

      • 4GB RAM, 250GB SATA HDD, dual 1GB NIC

      • 208 job slots, 200kSI2K

      • Commissioned May 06, running well

    • 21 x 5U 24-bay disk servers

      • 168TB (210TB) data capacity

      • Areca 1170 PCI-X 24-port controller

      • 22 x 400GB (500GB) SATA data drives, RAID 6

      • 2 x 250GB SATA system drives, RAID 1

      • 4GB RAM, dual 1GB NIC

      • Commissioning delayed (more…)

HEPSysMan @ Cambridge


Hardware changes 2
Hardware changes (2)

  • FY 06/07 capacity procurements

    • 47 x 3U 16-bay disk servers: 282TB data capacity

      • 3Ware 9550SX-16ML PCI-X 16-port SATA RAID controller

      • 14 x 500GB SATA data drives, RAID 5

      • 2 x 250GB SATA system drives, RAID 1

      • Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC

      • Delivery expected October 06

    • 64 x 1U twin dual-core Intel Woodcrest 5130 units (550kSI2K)

      • 4GB RAM, 250GB SATA HDD, dual 1GB NIC

      • Delivery expected November 06

  • Upcoming in FY 06/07:

    • Further 210TB disk capacity expected December 06

      • Same spec as above

    • High Availability systems with UPS

      • Redundant PSUs, hot-swap paired HDDs etc

    • AFS replacement

    • Enhancement to Oracle services (disk arrays or RAC servers)

HEPSysMan @ Cambridge


Hardware changes 3
Hardware changes (3)

  • SL8500 tape robot

    • Expanded from 6,000 to 10,000 slots

    • 10 drives shared between all users of service

    • Additional 3 x T10K tape drives for PP

    • More when CASTOR service working

  • STK Powderhorn

    • Decommissioned and removed

HEPSysMan @ Cambridge


Storage commissioning
Storage commissioning

  • Problems with March 06 procurement:

    • WD4000YR on Areca 1170, RAID 6

      • Many instances of multiple drive dropouts

      • Un-warranted drive dropouts and then re-integrating the same drive

    • Drive electronics (ASIC) on 4000YR (400GB) units changed with no change of model designation

      • We got the updated units

    • Firmware updates to Areca cards did not solve the issues

    • WD5000YS (500GB) units swapped-in by WD

      • Fixes most issues but…

    • Status data and logs from drives showing several additional problems

      • Testing under high load to gather statistics

    • Production further delayed

HEPSysMan @ Cambridge


Air con issues
Air-con issues

  • Setup

    • 13 x 80KW units in lower machine room, several paired units work together

  • Several ‘hot’ days (for the UK) in July

    • Sunday: dumped ~70 jobs

      • Alarm system failed to notify operators

      • Pre-emptive automatic shutdown not triggered

      • Ambient air temp reached >35C, machine exhaust temperature >50C !

      • HPC services not so lucky

    • Mid week 1: problems over two days

      • attempts to cut load by suspending batch services to protect data services

      • forced to dump 270 jobs

    • Mid week 2: 2 hot days predicted

      • pre-emptive shutdown of batch services in lower machine room

      • no jobs lost, data services remain available

  • Problem

    • High ambient air temperature tripped high pressure cut-outs in refrigerant gas circuits

    • Cascade failure as individual air-con units work harder

    • Loss of control of machine room temperature

  • Solutions

    • Sprinklers under units

      • Successful but banned due to Health and Safety concerns

    • Up-rated refrigerant gas pressure settings to cope with higher ambient air temperature

HEPSysMan @ Cambridge


Operating systems
Operating systems

  • Grid services, batch workers, service machines

    • SL3, mainly 3.0.3, 3.0.5, 4.2, all ix86

    • SL4 before Xmas

      • Considering x86_64

  • Disk storage

    • SL4 migration in progress

  • Tape systems

    • AIX: caches

    • Solaris: controller

    • SL3/4: CASTOR systems, newer caches

  • Oracle systems

    • RHEL3/4

  • Batch system

    • Torque/MAUI

      • Fare-shares, allocation by User Board

HEPSysMan @ Cambridge


Databases
Databases

  • 3D project

    • Participating since early days

      • Single Oracle server for testing

      • Successful

    • Production service

      • 2 x Oracle RAC clusters

        • Two servers per RAC

          • Redundant PSUs, hot-swap RAID1 system drives

        • Single SATA/FC data array

        • Some transfer rate issues

        • UPS to come

HEPSysMan @ Cambridge


Storage resource management
Storage Resource Management

  • dCache

    • Performance issues

      • LAN performance very good

      • WAN performance and tuning problems

    • Stability issues

    • Now better:

      • increased number of open file descriptors

      • increased number of logins allowed.

  • ADS

    • In-house system many years old

      • Will remain for some legacy services

  • CASTOR2

    • Replace both dCache disk and tape SRMs for major data services

    • Replace T1 access to existing ADS services

    • Pre-production service for CMS

    • LSF for transfer scheduling

HEPSysMan @ Cambridge


Monitoring
Monitoring

  • Nagios

    • Production service implemented

    • 3 servers (1 master + 2 slaves)

    • Almost all systems covered

      • 600+

    • Replacing SURE

    • Add call-out facilities

HEPSysMan @ Cambridge


Networking
Networking

  • All systems have 1Gb/s connections

    • Except oldest fraction of the batch farm

  • 10GB/s links almost everywhere

    • 10Gb/s backbone within Tier-1

      • Complete November 06

      • Nortel 5530/5510 stacks

    • 10Gb/s link to RAL site backbone

      • 10Gb/s backbone links at RAL expected end November 06

      • 10Gb/s link to RAL Tier-2

    • 10Gb/s link to UK academic network SuperJanet5 (SJ5)

      • Expected in production by end of November 06

      • Firewall still an issue

        • Planned bypass for Tier1 data traffic as part of RAL<->SJ5 and RAL backbone connectivity developments

    • 10Gb/s OPN link to CERN active

      • September 06

      • Using pre-production SJ5 circuit

      • Production status at SJ5 handover

HEPSysMan @ Cambridge


Security
Security

  • Notified of intrusion at Imperial College London

  • Searched logs

    • Unauthorised use of account from suspect source

    • Evidence of harvesting password maps

    • No attempt to conceal activity

    • Unauthorised access to other sites

    • No evidence of root compromise

  • Notified sites concerned

    • Incident widespread

  • Passwords changed

    • All inactive accounts disabled

  • Cleanup

    • Changed NIS to use shadow password map

    • Reinstall all interactive systems

HEPSysMan @ Cambridge



ad