1 / 21

RAL Tier1/A Site Report

RAL Tier1/A Site Report. Martin Bly HEPiX – Brookhaven National Laboratory 18-20 October 2004. Overview. Introduction Hardware Software Security. RAL Tier1/A. RAL the Tier 1 centre in the UK Supports all VOs but priority to ATLAS, CMS, LHCb LCG Core site Babar collaboration Tier A

maxim
Download Presentation

RAL Tier1/A Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory 18-20 October 2004 HEPiX - Brookhaven

  2. Overview • Introduction • Hardware • Software • Security HEPiX - Brookhaven

  3. RAL Tier1/A • RAL the Tier 1 centre in the UK • Supports all VOs but priority to ATLAS, CMS, LHCb • LCG Core site • Babar collaboration Tier A • Support for other experiments: • D0, H1, SNO, UKQCD, MINOS, Zeus, Theory, … • Various test environments for grid projects HEPiX - Brookhaven

  4. Pre-Grid Upgrade 1 October 2000 1 July 2000 HEPiX - Brookhaven

  5. Post-GRID Upgrade GRID Load 21-28 July Full again in 8 hours! HEPiX - Brookhaven

  6. LCG in Production • Since June Tier1 LCG service has evolved to become a full scale production facility • Sort of sneaked up on us! Gradual change from test/development environment to full scale production. • Availability and reliability of the LCG service are now a high priority for RAL staff. • Now the largest single CPU resource at RAL HEPiX - Brookhaven

  7. GRID Production HEPiX - Brookhaven

  8. Hardware • Main Farms: 884 CPUs, approx 880kSI2K • 312 CPUs x P3 @ 1.4GHz, • 160 CPUs x P4/Xeon @ 2.66GHz, HT off • 512 CPUs x P4/Xeon @ 2.8GHz, HT off • Disk: approx 226TB • 52 x 800GB R5 IDE/SCSI arrays, • 22 x 2TB R5 IDE/SCSI arrays, • 40 x 4TB R5 EonStor SATA/SCSI arrays • Tape: • 6000 slot Powderhorn Silo, 200GB/tape, 8 drives. • Misc: • SUN disk servers, AIX (AFS cell) • 140 CPUs x P3 @ 1GHz HEPiX - Brookhaven

  9. Hardware Issues • CPU and disks delivered June 16 • CPU units: • 6 in 256 failed under testing – memory, motherboard • Installed into production after ~4 weeks • Disk systems: • Riser cards failing. Looks to be the batch. • Issues with EonStor firmware – fixes from vendor • Into production about now HEPiX - Brookhaven

  10. Enhancements • FY 2004/05 CPU/disk procurement starting shortly • expect lower volume of CPU and disk • CPU technology: Xeon/Opteron • Disk technology: SATA/SCSI, SATA/FC, … • Sun systems services and data migrating to SL3 • mail, NIS -> SL3 • data -> RH7.3, SL3 • Due Xmas ’04. • AFS cell migration to SL3/OpenAFS • Investigating SANs, iSCSI, SAS HEPiX - Brookhaven

  11. Environment • Farms dispersed over three machine rooms • Extra temporary air conditioning capacity for summer • Actually survived with it mostly idle! • New air conditioning for lower machine room (A5L), independent from main building air-con system. 5 Units, 400kW; arrives November • Extra power distribution (but not new power) • All new rack kit to be located in A5L, shared with other high availability services (HPC etc). • Issues: • New Nocona chips use more power – and create more heat • Rack weight on raised floors – latest kit is around 8 tonnes • Air con unit weight + power HEPiX - Brookhaven

  12. HEPiX - Brookhaven

  13. Network • Site link – 2.5Gb/s to TVN • Site backbone @ 1Gb/s. • Tier1/A backbone @ 1Gb/s on Summit 7i and 3Com switches. • Latest purchases have single or dual 1Gb/s NIC • All batch workers connected @ 100Mb/s to 3Com fan-out switches with 1Gb/s uplink • Disk servers connected @ 1Gb/s to backbone switches • Upgrades • All new hardware to have 1Gb/s NIC • Upgrade CPU rack network switches where necessary to 1Gb/s fan-out • New backbone switches: • stackable units with 40Gb/s interlink and where possible, with 10Gb/s upgrade path to site router • Joining UKLight network • 10Gb/s • Fewer hops to HEP sites • Multiple Gb/s links to Tier1/A HEPiX - Brookhaven

  14. Software • Transition to SL3 • Farms: • Scientific Linux 3 (Fermi) • Babar batch, prototype frontend • RedHat 7.n • 7.3: LCG batch, Tier1 batch, frontend systems • 7.2: Babar frontend systems • Servers: • SL3 • Systems services (mail, NIS, loggers, scheduler) • Redhat 7.2/7.3 • Disk servers (custom Kernels) • Fedora Core • Consoles, personal desktops • Solaris 2.6, 8, 9 • SUN systems • AIX • AFS cell HEPiX - Brookhaven

  15. Software Issues • SL3 • Easy to install with PXE/Kickstart • Migration of Babar community from RH 7.3 batch service smooth after installation validated by Babar for batch work • Batch system using Torque/Maui versions from LCG rebuilt for SL3, with some local patches to config parameters (more jobs, more classes). Stable. • RedHat 7.n • Security a big concern (!) • Speed of patching • Custom kernels a problem • Enterprise (RHEL, SL) • Disk i/o (both read and write) performance not as good as can be achieved with RH 7.n (9). (SL, 2.4.21-15.0.n) • Need to test the more recent kernels • NFS, LVM and Megaraid controllers don’t mix! HEPiX - Brookhaven

  16. Projects • Quattor • Ongoing preparation for implementation • Infrastructure data challenge • Joining effort to test high speed / high availability / high bandwidth data transfers to simulate LCG requirements • RSS news service • dCache • disk pool manager with SRM combined • Software complex to configure • Multiple layers – difficult to drill down to find exactly why a problem has occurred, somewhat sensitive to hardware/system configurations • Working test deployment • 1 head node, 2 pool nodes • Next steps: • create a multi-terabyte instance for CMS in LCG HEPiX - Brookhaven

  17. Security • Firewall at RAL is default Deny inbound • Keeps many but not all badguys™ out • Specific hosts have inbound Permit for specific ports • Sets of rules for LCG components (CE, SE, RB etc) or services (AFS) • Outbound: generally open, port 80 via cache • X11 port was open but not to Tier1/A (closed 1997!) • Now closed site-wide as of 8th Oct • The badguys™ still get in… HEPiX - Brookhaven

  18. Recent Incident (1) • Keyboard logger installed at a remote site A exposes password of account at remote site B • Access to exposed@siteB • Scans account known_hosts for possible targets • exposed@siteB has ssh keys unprotected by a pass-phrase • Unchallenged access to any account@host on list in known_hosts on which unprotected public key installed • !”£$%^&*#@;¬?>| HEPiX - Brookhaven

  19. Recent Incident (2) • Aug 26 at 23:05 BST, Badguy™ uses unprotected key of compromised account at remote site B to enter two systems at RAL: RedHat 7.2 systems. • Downloads custom IRC bot based on Energy Mech • Contains a klogd binary which is the IRC bot • Possibly tries for privilege escalation • Installs IRC bot (klogd), attempting to usurp the system klogd or possibly other rogue klogds. Fails to kill system klogd. • Two klogd now running: system on owned by root and badguy™ version owned by compromised user. • At some time later the directory containing the bot code (/tmp/.mc) is deleted. HEPiX - Brookhaven

  20. Recent Incident (3) • Oct 7, am: we are told system has been exhibiting suspicious activity by legitimate remote IRC server admins who are monitoring for suspicious activity. Systems removed from network and forensic investigation begins • Dump of bot/klogd process shows 4800+ hosts listed – it appears system was part of an IRC network • Badguy™ bot/klogd listens on ports tcp:8181 and udp:34058 • Contacts IRC servers at 4 addresses (port 6667), as "XzIbIt" • Firewall logs show relatively small amount of traffic from affected host • No trace of root exploits • Second host was a user frontend system: no evidence of any IRC activity or root compromise HEPiX - Brookhaven

  21. Lessons • Unprotected ssh keys are bad news • If it is unprotected on your system then all keys owned everywhere by that user are likely unprotected too • Use ssh-agent or similar • There are still .netrc files in use for production userids • Communication • Lack of news from upstream sites a disappointment • If we had been told of exploit at the remote site and the time frames involved we would have found the IRC bot within hours • Protect infrastructure from user accessible hosts • Firewalling • Staff time: 2-3 staff weeks HEPiX - Brookhaven

More Related