1 / 22

HEPiX Trip Report Jefferson Laboratory 9 -13 October 2006

HEPiX Trip Report Jefferson Laboratory 9 -13 October 2006. Martin Bly – RAL Tier1 HEPSysMan – Cambridge 23 October 2006. Introduction. Site issues Subject talks. Sites: CERN. Successfully negotiated new LCG-wide licences for Oracle All Physics databases now migrated to Oracle RAC hosting

melosa
Download Presentation

HEPiX Trip Report Jefferson Laboratory 9 -13 October 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HEPiX Trip ReportJefferson Laboratory 9 -13 October 2006 Martin Bly – RAL Tier1 HEPSysMan – Cambridge 23 October 2006

  2. Introduction • Site issues • Subject talks HEPSysMan - Cambridge - Autumn 2006

  3. Sites: CERN • Successfully negotiated new LCG-wide licences for Oracle • All Physics databases now migrated to Oracle RAC hosting • SLC4 for LHC start up, SLC3 support ends October 2007 • Lemon Alarm System (LAS) replacing SURE • Central CVS service running well • Looking at Subversion • First Opteron systems in CERN CC • Insecure mail protocols forbidden/blocked • POP/IMAP etc must use SSL • No compromise on performance of disk servers to get ‘fat’ systems HEPSysMan - Cambridge - Autumn 2006

  4. Sites: FermiLab • Multiple 10Gb/s connections to Starlight • Efforts to automate computer security • Replace home-grown tools with commercial utilities • New computer rooms • Overhead power and networking • Plastic curtains to trap cold air in front of machines • US-CMS • 700TB dCache space • Expected to be 2.5Pb by autumn 2007 • 700 node cluster expanding to 1600 nodes • BlueArc NAS for online storage • Expensive… HEPSysMan - Cambridge - Autumn 2006

  5. Sites: GridKa • Issues with recent Opteron procurement • MSI K1-1000D motherboard, AMD 270s • BIOS issues, BMC and NIC firmware updates • Issues with water cooled racks traced to leaks in chillers • NEC supplying 4500TB storage • 28 Storage controllers, RAID 6, 60 file servers • Report on latest benchmarks • Woodcrest performs very well HEPSysMan - Cambridge - Autumn 2006

  6. Sites: NERSCNational Energy Research Scientific Computing Center, Berkely • NERSC Global Filesystem (NFG) in production • 70TB for project file space (subject of separate talk) • Aim to procure ‘just storage’ • 10Gb/s internal/external networks • 10Gb/s ‘jumbo’ network • Cray Hood system • 19000+ CPUs, 70TB disk, 102 cabinets • Nagios for monitoring, extending to the Cray • Computer room full, need more power, space HEPSysMan - Cambridge - Autumn 2006

  7. Sites: INFN • 10Gb/s link to GARR backbone • T2s now at 1Gb/s • GPFS now robust enough to be adopted by many sites • Lustre also being tested by a few sites • Testing iSCSI • Satisfactory but not completely satisfying • Looking at new EMC device and home-grown solutions to try and resolve issues HEPSysMan - Cambridge - Autumn 2006

  8. Sites: GSI Darmstadt • Issues with large storage farm • 100/120 nodes failed to boot after move to new racks • Had been OK for 6 months previously in old racks • Traced to vibration resonance between disk and CPU cooling fans • Issues with cooling in racks • Keep cold and warm air flows separate • Blanking plates important HEPSysMan - Cambridge - Autumn 2006

  9. Sites: SLAC • SLAC now a US-Atlas site • Procurements to start soon • Non-HEP experiment computing building up • Many old clusters being decommissioned to make space • Plan for 150/200-node infiniband cluster • Model check-pointing is a challenge • Testing Lustre • Need to move away from AFS (K4) token passing • SSH/K5 with GSSAPI to pass K5 tickets • New wireless registration scheme to enable users to be contacted should their machine cause problems HEPSysMan - Cambridge - Autumn 2006

  10. Sites: INFN-CNAF • CPU capacity upgrade delayed while cooling system upgraded after cooling issues during summer • Using Quattor/Lemon • CERN customisations sometimes a problem • Staying with SLC3 (v3.0.8 supports Woodcrest) • SLC4 when EGEE moves HEPSysMan - Cambridge - Autumn 2006

  11. Sites: LAL • VMware still preferred Linux-on-desktop solution • Installed gLite3 on SL4 without modification • Using Quattor and Lemon • Having removed CERN specifics HEPSysMan - Cambridge - Autumn 2006

  12. Sites: General • Moving to specifying computing capacity requirements in performance terms for CPUs • Needs ‘common’ benchmarking • Require vendors to do it (and prove it!) • Corresponding interest in benchmarking and how to do it so it means something • 10Gb links now very common • Big Condor pools in use at some sites • Waiting for Grid middleware to be ported to SL4 HEPSysMan - Cambridge - Autumn 2006

  13. Scientific Linux Update • UK top by download (no stats from mirrors) • FTP repository moved from GFS to NAS • New plone version for scientificlinux.org site • SL 4.4 Oct 2006 for i386, x86_64 • SL 3.0.8 release candidate available soon • Now available… • Bug fix repositories for SL variants • bugfixNN where NN is version • SL 3.0.8 should be the last of the 3 series • Support plan as previously published: till Autumn 2007 • Working on SL5 (installers etc) • SL5 alphas to be based on TUV beta releases HEPSysMan - Cambridge - Autumn 2006

  14. Core Services/Infrastructure (1) • Tail of FermiLab’s run in with SpamCop • SpamCop don’t respond to any requests • Takes 24 hrs to ‘fall off’ list • Remove bounce messages and verify local addresses • Trap obvious Spam • Have alternative ip addresses for email gateways • Propose ‘white list’ of HEP sites… HEPSysMan - Cambridge - Autumn 2006

  15. Core Services/Infrastructure (2) • Service Level Status service • CERN tool for displaying the status of services rather than individual nodes • Status defined by managers in terms of dependencies and dependants, and what service availability levels mean • Services and meta-services • Displays Key Performance Indicators of service levels compared to targets HEPSysMan - Cambridge - Autumn 2006

  16. Core Services/Infrastructure (3) • RT used to manage Installation workflow (SLAC) • High Availability methods and experiences at GSI • Scientific Linux Inventory Project (FermiLab) • Need to monitor software inventory and hardware of a machine HEPSysMan - Cambridge - Autumn 2006

  17. Compute Clusters & Storage • Hazards of Fast tape Drives (JLAB) • Is your memory buffer big enough to prevent the tape drive having to stop, rewind and take a run up to speed when more data is available to write? • CERN report 100MB/sec using two stage tape serving, with large (8GB) RAM on the L1 caches • NGF: NERSC’s Global File System (NERSC) • Benchmark Updates (CERN) • Spec.org results unreliable for HEP purposes • Don’t match our conditions • Requires vendors to use ‘fixed’ configuration of SPEC2000 benchmark • HPL used to benchmark ‘power’ perfomance HEPSysMan - Cambridge - Autumn 2006

  18. Security • No Bob Cowles • Therefore, no ‘scare the pants off everyone’ talk • But: • The Stakkato Intrusion • The tail of the long-running intrusion at the Swedish National Supercomputer Centre, 2004-2005 • Network Security Monitoring • How it is done at Brookhaven National Lab, with Sguil HEPSysMan - Cambridge - Autumn 2006

  19. Grid Projects • Issues and problems around Grid Site Management (+discussions) – Ian Bird • Measuring site availability: T1s poor • Instabilities in site availabilities observed • Strategies: • Improve sites, improve job direction • SAM (Site Availability Monitor) • An expansion of SFT functionality • Sensors integrated with submission framework or standalone • Integrated tests done by test job submission • Analysis of job efficiencies (failure rates): reasons non-trivial • Good sites change daily! • Plan to use Job wrappers to test as submitting-VO view rather than OPS-VO view • Better view of system ‘weather’ HEPSysMan - Cambridge - Autumn 2006

  20. IHEPCCC • IHEPCCC discussing collaborating with HEPiX on areas of mutual interest, particularly benchmarking and global file systems • RTAG format proposed • Short-term study groups, report to HEPiX/IHEPCCC • Lots of interest in participating, particularly in benchmarking and discussing whether SPEC2006 is appropriate HEPSysMan - Cambridge - Autumn 2006

  21. Next meetings • Spring 2007: • April 23rd to 27th in DESY Hamburg • Topics suggested included benchmarking, cluster file systems, VoIP and in general, ‘discussion topics’ (as opposed to LCG workshops) likely to attract LCG Tier 2 sites. • Autumn/Fall 2007: • possibly early November at either Berkeley or FermiLab, hopefully in the week preceding Supercomputing’07 in Reno • Spring 2008: • CERN HEPSysMan - Cambridge - Autumn 2006

  22. References • Abstracts and slides form HEPiX Fall 2006: https://indico.fnal.gov/conferenceDisplay.py?confId=384 • Alan Silverman’s comprehensive trip report: https://www.hepix.org/mtg/fall_06_jlab/HEPiX%20_Lab_Trip_Report_silverman.pdf HEPSysMan - Cambridge - Autumn 2006

More Related