ral site report
Skip this Video
Download Presentation
RAL Site Report

Loading in 2 Seconds...

play fullscreen
1 / 11

RAL Site Report - PowerPoint PPT Presentation

  • Uploaded on

RAL Site Report. HEPiX 20 th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL. Overview. General Hardware Storage Networking …. General. New CEO for STFC John Womersley takes over from Keith Mason on 1st November To 31 st March 2015 Staffing @ Tier1

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' RAL Site Report' - gaston

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
ral site report

RAL Site Report

HEPiX 20th Anniversary

Fall 2011, Vancouver

24-28 October

Martin Bly, STFC-RAL

  • General
  • Hardware
  • Storage
  • Networking

RAL Site Report - HEPiX Spring 2011

  • New CEO for STFC
    • John Womersley takes over from Keith Mason on 1st November
        • To 31st March 2015
  • Staffing @ Tier1
    • 5 staff posts open due to staff moving
    • Replacements agreed despite restrictions
    • Recruitments underway
  • Power
    • ‘Partial Discharge’ (arcing) detected in 11kV bus in transformer room
    • Isolated to the join between two bus segments (bus-coupler)
    • Loose bolt in bus bar identified and tightened up – fixed

RAL Site Report - HEPiX Spring 2011

hardware changes
Hardware changes
  • Summary of previous report:
    • 13 x Dell R610 tape servers (10GbE) for T10KC drives
    • 14 x T10KC tape drives
    • Arista 7124S 24-port 10GbE switch + twinax copper interconnects
    • 5 x Avaya 5650 switches + various 10/100/1000 switches
  • New since May
    • Various Dell R510s for small data servers for Facilities Data Service, provides interfaces into Castor for RAL site facilities and others.
    • 68 x 40TB 4U servers ordered for capacity storage – two suppliers
      • 10GbE, 2TB HDD, single CPU, 24GB RAM, 2.66PB total
      • Note that disks may be hard to get 
    • 15,000 HEP-SPEC tender completed evaluation, result just announced
  • To come
    • 40GbE/10GbE and 10Gbe/1GbE switches, management switches, more tape servers, T10KC tape drives and tapes, iSCSI arrays, ...
  • Gone: 22 x 10TB servers - 2005 generation
  • To go: 86 x 6TB servers – 2006 generation

RAL Site Report - HEPiX Spring 2011

storage issues
Storage Issues
  • Issue with some 3ware controllers throwing perfectly healthy WD drives
    • Due to firmware not recognising and handling failure mode on newer WD drives of the same model
    • Firmware update has fixed this, rollout completed
  • Issue with Adaptec controllers and StorageManager software
    • SM reports many SMART errors when drives are healthy
      • reports unhealthy ones too
    • Firmware update has fixed this, rolling out shortly
  • Problem with T10KC drives
    • Early production batch issue
    • Firmware fix
    • No recurrence
  • Production storage now using most recent sets of hardware with older (smaller capacity) hardware ‘spinning reserve’

RAL Site Report - HEPiX Spring 2011

castor status
Castor Status
  • Castor manages disk and tape storage
    • 18 million files (at Oct 2011)
  • Recent news:
    • Moved to T10KC tape media in production in September (Atlas, LHCb)
    • New (non-Tier1) production instance for Diamond synchrotron
      • Part of a new complete Facilities Data Service which provides data transparent aggregation (StorageD) metadata service (ICAT) and web (TopCAT) and FUSE frontends to access data
  • Coming up (Jan-Mar):
    • Move to new database hardware and better resilient architecture (using DataGuard) over next 6 months
    • Major upgrade of CASTOR with a new optimized scheduler and new tape functionality – better for small files
    • New service ’head nodes’ in test: Dell R410 and Transtec

RAL Site Report - HEPiX Spring 2011

  • WAN
    • UK NREN JANET now has a 100Gb/s backbone.
    • Funding for the next upgrade of the NREN SuperJANet6 has recently been approved
  • Site
    • Sporadic packet loss in site core networking (few %)
      • Still present to a very small degree – intermittent problems with access to LFC dropping for remote users (T2s). May be load related.
  • Asymmetric Data Transfer rates in/out of Tier1
    • Many possible causes: Load; FTS settings, disk server settings; TCP/IP tuning, network (LAN & WAN performance)
    • Have modified FTS settings with some success
    • Looking at Tier1-UK Tier2 transfers
  • LAN
    • Another failed 10GbE XFP transceiver, and a death in service of a Nortel 5510
    • Three subnets in use for Tier1
    • Lots of packet discards into stacks, investigating...
  • Developments
    • Looking to provide large bandwidth in Tier1 core with ‘mesh-type’ arrangement linked at multiple 40Gb/s with storage connectivity at 10Gb/s.

RAL Site Report - HEPiX Spring 2011

  • Small but significant Oracle installation
    • Castor, 3D, LFC, FTS
  • Castor database server hardware to be replaced
    • Old: 2 x 5-node (32bit) RACs, EMC AX4 arrays
    • New: 2 pairs of 3-node (64bit) RACs, EMC AX4 + Infortrend Arrays
    • Different ASM architecture – single volumes rather than paired
    • Dataguard from Production RAC to Standby RAC for resilience
    • Standby RACs in different building
    • Backups off the Standby set
    • Standby set to be added to the existing setup, Dataguard and backup as per Castor, single volume data, ASM volume architecture changes
  • 3D
    • ASM volume architecture changes

RAL Site Report - HEPiX Spring 2011

  • Evaluated MS Hyper-V for services virtualization platform
    • Beginning to roll out local-storage virtualisation for services that don’t need fast failover
  • Struggled for a long time with iSCSI storage arrays (and poor support)
    • New iSCSI arrays ordered
    • To support fast-failover etc
  • Cloud project
    • Department initiative looking at cloud use
      • Talk by Ian Collier

RAL Site Report - HEPiX Spring 2011

  • Quattor
    • Batch and Storage systems under Quattor management
      • ~6200 cores, 700+ systems (batch), 500+ system (storage)
      • Significant time saving
    • Significant rollout on Grid services node types
  • CernVM-FS
    • Major deployment at RAL to cope with software distribution issues
    • More news in talk by Ian Collier later this week

RAL Site Report - HEPiX Spring 2011


RAL Site Report - HEPiX Spring 2011