Ral site report
Download
1 / 11

RAL Site Report - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

RAL Site Report. HEPiX 20 th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL. Overview. General Hardware Storage Networking …. General. New CEO for STFC John Womersley takes over from Keith Mason on 1st November To 31 st March 2015 Staffing @ Tier1

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' RAL Site Report' - gaston


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ral site report

RAL Site Report

HEPiX 20th Anniversary

Fall 2011, Vancouver

24-28 October

Martin Bly, STFC-RAL


Overview
Overview

  • General

  • Hardware

  • Storage

  • Networking

RAL Site Report - HEPiX Spring 2011


General
General

  • New CEO for STFC

    • John Womersley takes over from Keith Mason on 1st November

      • To 31st March 2015

  • Staffing @ Tier1

    • 5 staff posts open due to staff moving

    • Replacements agreed despite restrictions

    • Recruitments underway

  • Power

    • ‘Partial Discharge’ (arcing) detected in 11kV bus in transformer room

    • Isolated to the join between two bus segments (bus-coupler)

    • Loose bolt in bus bar identified and tightened up – fixed

  • RAL Site Report - HEPiX Spring 2011


    Hardware changes
    Hardware changes

    • Summary of previous report:

      • 13 x Dell R610 tape servers (10GbE) for T10KC drives

      • 14 x T10KC tape drives

      • Arista 7124S 24-port 10GbE switch + twinax copper interconnects

      • 5 x Avaya 5650 switches + various 10/100/1000 switches

    • New since May

      • Various Dell R510s for small data servers for Facilities Data Service, provides interfaces into Castor for RAL site facilities and others.

      • 68 x 40TB 4U servers ordered for capacity storage – two suppliers

        • 10GbE, 2TB HDD, single CPU, 24GB RAM, 2.66PB total

        • Note that disks may be hard to get 

      • 15,000 HEP-SPEC tender completed evaluation, result just announced

    • To come

      • 40GbE/10GbE and 10Gbe/1GbE switches, management switches, more tape servers, T10KC tape drives and tapes, iSCSI arrays, ...

    • Gone: 22 x 10TB servers - 2005 generation

    • To go: 86 x 6TB servers – 2006 generation

    RAL Site Report - HEPiX Spring 2011


    Storage issues
    Storage Issues

    • Issue with some 3ware controllers throwing perfectly healthy WD drives

      • Due to firmware not recognising and handling failure mode on newer WD drives of the same model

      • Firmware update has fixed this, rollout completed

    • Issue with Adaptec controllers and StorageManager software

      • SM reports many SMART errors when drives are healthy

        • reports unhealthy ones too

      • Firmware update has fixed this, rolling out shortly

    • Problem with T10KC drives

      • Early production batch issue

      • Firmware fix

      • No recurrence

    • Production storage now using most recent sets of hardware with older (smaller capacity) hardware ‘spinning reserve’

    RAL Site Report - HEPiX Spring 2011


    Castor status
    Castor Status

    • Castor manages disk and tape storage

      • 18 million files (at Oct 2011)

    • Recent news:

      • Moved to T10KC tape media in production in September (Atlas, LHCb)

      • New (non-Tier1) production instance for Diamond synchrotron

        • Part of a new complete Facilities Data Service which provides data transparent aggregation (StorageD) metadata service (ICAT) and web (TopCAT) and FUSE frontends to access data

    • Coming up (Jan-Mar):

      • Move to new database hardware and better resilient architecture (using DataGuard) over next 6 months

      • Major upgrade of CASTOR with a new optimized scheduler and new tape functionality – better for small files

      • New service ’head nodes’ in test: Dell R410 and Transtec

    RAL Site Report - HEPiX Spring 2011


    Networking
    Networking

    • WAN

      • UK NREN JANET now has a 100Gb/s backbone.

      • Funding for the next upgrade of the NREN SuperJANet6 has recently been approved

    • Site

      • Sporadic packet loss in site core networking (few %)

        • Still present to a very small degree – intermittent problems with access to LFC dropping for remote users (T2s). May be load related.

    • Asymmetric Data Transfer rates in/out of Tier1

      • Many possible causes: Load; FTS settings, disk server settings; TCP/IP tuning, network (LAN & WAN performance)

      • Have modified FTS settings with some success

      • Looking at Tier1-UK Tier2 transfers

    • LAN

      • Another failed 10GbE XFP transceiver, and a death in service of a Nortel 5510

      • Three subnets in use for Tier1

      • Lots of packet discards into stacks, investigating...

    • Developments

      • Looking to provide large bandwidth in Tier1 core with ‘mesh-type’ arrangement linked at multiple 40Gb/s with storage connectivity at 10Gb/s.

    RAL Site Report - HEPiX Spring 2011


    Databases
    Databases

    • Small but significant Oracle installation

      • Castor, 3D, LFC, FTS

    • Castor database server hardware to be replaced

      • Old: 2 x 5-node (32bit) RACs, EMC AX4 arrays

      • New: 2 pairs of 3-node (64bit) RACs, EMC AX4 + Infortrend Arrays

      • Different ASM architecture – single volumes rather than paired

      • Dataguard from Production RAC to Standby RAC for resilience

      • Standby RACs in different building

      • Backups off the Standby set

    • LFC/FTS

      • Standby set to be added to the existing setup, Dataguard and backup as per Castor, single volume data, ASM volume architecture changes

    • 3D

      • ASM volume architecture changes

    RAL Site Report - HEPiX Spring 2011


    Virtualisation
    Virtualisation

    • Evaluated MS Hyper-V for services virtualization platform

      • Beginning to roll out local-storage virtualisation for services that don’t need fast failover

    • Struggled for a long time with iSCSI storage arrays (and poor support)

      • New iSCSI arrays ordered

      • To support fast-failover etc

    • Cloud project

      • Department initiative looking at cloud use

        • Talk by Ian Collier

    RAL Site Report - HEPiX Spring 2011


    Projects
    Projects

    • Quattor

      • Batch and Storage systems under Quattor management

        • ~6200 cores, 700+ systems (batch), 500+ system (storage)

        • Significant time saving

      • Significant rollout on Grid services node types

    • CernVM-FS

      • Major deployment at RAL to cope with software distribution issues

      • More news in talk by Ian Collier later this week

    RAL Site Report - HEPiX Spring 2011


    Questions
    Questions?

    RAL Site Report - HEPiX Spring 2011


    ad