1 / 47

HEPiX Report

HEPiX Report. Helge Meinhard, Edoardo Martelli, Giuseppe Lo Presti / CERN-IT Technical Forum/Computing Seminar 11 November 2011. Outline. Meeting organisation ; site reports ( Helge Meinhard ) Networking and security; computing; cloud, grid, virtualisation ( Edoardo Martelli )

holt
Download Presentation

HEPiX Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HEPiX Report Helge Meinhard, Edoardo Martelli, Giuseppe Lo Presti / CERN-IT Technical Forum/Computing Seminar11 November 2011

  2. Outline • Meeting organisation; site reports (HelgeMeinhard) • Networking and security; computing; cloud, grid, virtualisation(EdoardoMartelli) • Storage; IT infrastructure (Giuseppe Lo Presti) • 20 years of HEPiX (HelgeMeinhard) HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  3. HEPiX • Global organisation of service managers and support staff providing computing facilities for HEP • Covering all platforms of interest (Unix/Linux, Windows, Grid, …) • Aim: Present recent work and future plans, share experience, advise managers • Meetings ~ 2 / y (spring in Europe, autumn typically in North America) HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  4. HEPiX Autumn 2011 (1) • Held 24 – 28 October at Simon Fraser University, Vancouver, BC, Canada • Hosted jointly by TRIUMF, SFU, University of Victoria • Excellent local organisation • Steven McDonald and his team proved up to expectations for 20th anniversary meeting • Nice auditorium • Vancouver: very vivid city, all kinds and classes of restaurants. Nice parks, mountains within easy reach… • Banquet at 1’100 m altitude in the snow, Grizzly bears not far • Special session at the occasion of HEPiX’ 20th anniversary • Sponsored by a number of companies HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  5. HEPiX Autumn 2011 (2) • Format: Pre-defined tracks with conveners and invited speakers per track • Extremely rich, interesting and packed agenda • Judging by number of submitted abstracts, no real hot spot: 8 infrastructure, 8 Grid/clouds/virtualisation, 7 network and security, 6 storage, 4 computing… plus 17 site reports • Special track on 20th anniversary with 5 contributions • Some abstracts submitted late (Thu/Fri before meeting!), planning difficult • Full details and slides:http://indico.cern.ch/conferenceDisplay.py?confId=138424 • Trip report by Alan Silverman available, too http://cdsweb.cern.ch/record/1397885 HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  6. HEPiX Autumn 2011 (3) • 98 registered participants, of which 10/11 from CERN • Cass, Lefebure, Lo Presti, Martelli, Meinhard, Rodrigues Moreira, Salter, Schröder, (Silverman), Toebbicke, Wartel • Many sites represented for the first time: Canadian T2s, Melbourne, Ghent, Trieste, Wisconsin, Frascati, … • Vendor representation: AMD, Dell, RedHat • Compare with GSI (spring 2011): 84 participants, of which 14 from CERN; Cornell U (autumn 2010): 47 participants, of which 11 from CERN • Record attendance for a North American meeting! HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  7. HEPiX Autumn 2011 (4) • 55 talks, of which 15 from CERN • Compare with GSI: 54 talks, of which 13 from CERN • Compare with Cornell U: 62 talks, of which 19 from CERN • Next meetings: • Spring 2012: Prague (April 23 to 27) • Autumn 2012: Beijing (hosted by IHEP; date to be decided, probably 2nd half of October) HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  8. Site reports (1): Hardware • CPU servers: same trends • 12...48 core boxes, AMD and Intel mentioned equally frequently, 2...4 GB/core. Some nodes with 128 GB, even 512 GB • Quite a number of problems reported with A-brand suppliers and their products • Disk servers • Still a number of problems in interplay of RAID controllers with disk drives – controllers throwing perfectly healthy drives • Severeness of disk drive supply not yet known at HEPiX • Tapes • A number of sites mentioned T10kC in production (preferred over LTO at major sites such as FNAL) • LTO very popular, many sites investigating (or moving to) LTO5 HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  9. Site reports (2): Software • OS • Quite some sites mentioned migration to RHEL 6 / SL 6 • FNAL hired replacement for Troy Dawson • Triggers bug in Nehalem sleep states • Windows 7 is in production at many sites • Exots: Tru64, Solaris; CentOS • Storage • Lustre: used at at least 7 sites • CVMFS mentioned in at least 6 site reports (of 17) • EOS at CMS T1 at FNAL – they are quite happy • NFS: GSI getting out; BNL reported bad results with NFS 4.1 tests using Netapp and Bluearc HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  10. Site reports (3): Software (cont’d) • Batch schedulers • Grid Engine rather popular. All but IN2P3 going for UNIVA version. In fact, not much mention about Oracle this time at all… • Some (scalability?) problems with PBSpro / Torque-MAUI, negative comments about PBSpro support • Condor, SLURM mentioned – mostly positively • Virtualisation • Many sites experimenting with KVM, XEN on its way out (often linked with SL5 to SL6 migration) • Some very aggressive use of virtualisation (gatekeepers, AFS servers, Condor and ROCKS masters, Lustre MGS, …) • Service management • FNAL, PDSF migrating from Remedy to Service-now HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  11. Site reports (4): Infrastructure • Infrastructure • Cube prototype for FAIR: 2 storeys, 96 racks, PUE 1.07 • LBNL data centre construction hindered by lawsuits • Configuration management • Puppet mentioned a number of times • Chef, cfengine2/3 used as well HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  12. Site reports (5): Miscellaneous • Tendency is multidisciplinary labs • More focus on HPC and GPU than in HEP • IP telephony / VoIP mentioned at least twice • Business continuity is a hot topic for major sites • Dedicated track at next meeting HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  13. Report from HEPiX 2011: Computing, Networking, Security, Clouds, Virtualization Geneva – 11th November 2011 edoardo.martelli@cern.ch

  14. Computing 14

  15. AMD Interlagos New AMD 16 cores processor : Interlagos Interlagos with Bulldozer design: two parallel threads, extended instruction set, power efficiency (unused core are switched off), best value per unit. Better to add cores rather than Hz: 50% more performance require 3 times the power. Evolution: 2005: 2 cores 1,8-3.2GHz 7-13 Gflops, 95W 2007: 3 cores 1.9-2.5GHz 30-20 Gflops, 95W 2008: 4 cores 2.5-2.9GHz 40-46Gflops 95W 2009: 6 cores 1.8-2.8GHz 43-67Gflops 95W 2010: 8-12 cores, 1.8-2.6G 58-120Gflops 105W 15

  16. Intel Sandy Bridge/Dell Stampede DELL is building Stampede. It will be among the top ten of supercomputers Commissioned by TACC (Texas Advancedd Comp Centre); 27.5M USD from NSF. 10petaflops peak. 12800 Intel Sandy Bridge. 272TB of memory. 14PB of storage, with 150GBps Lustre file system. Intel Sandy Bridge can execute one floating point instruction per clock cycle. Will be available 2012Q1. Intel MIC architecture: Many cores with many threads per core. HEPspec: AMD Interlagos has slower single core speed, but the total processor power is higher (16 cores vs 8 of Sandy bridge). 16

  17. CPU benchmarking at Gridka Presented the new generation of chip AMD Interlagos (16 core) and Intel Sandy Bridge (8cores) Benchmarking for tenders is difficult because performance vary depending on the version of the software used and on the OS type (32 or 64 bits). 17

  18. Observations While the aggregated computing capacity of processors are increasing, the single core is getting slower. Thus, single thread application will be executed slower than before. To take advantage of new processors, the applications have to be rewritten to support multi thread. CPU power more abundant than disk space and network bandwidth. 18

  19. Networking and Security 19

  20. LHCONE The WLCG computing model is changing and moving towards a full mesh correlation of the sites. LHCONE is the network dedicate which will interconnect Tier1s and major Tier2s and Tier3s. LHCONE is a network built on top of Open Exchange Points interconnected by long distance links provided by R&E Network Operators. A work in progress 20

  21. IPv6 at CERN and FZU IPv6 deployment has started at CERN and FZU IPv6 is still lacking some functionality, but it will be necessary. Changes to management tools will require time and money It's not only a matter of the network department: developers, sys admins, operations will have to act. 21

  22. HEPiX IPv6 WG 16 groups from Europe and the US and one experiment (CMS) have joined the WG. Testbed activity: an IPv6 VO hosted by INFN have been created with five connected sites. Test of grid data transfer will start next month. If OK, CMS will do data transfer tests from December. Gap analysis activity: the WG will perform a gap analysis about readiness of grid applications. A survey is being prepared. Collaboration with EGI for a source code checker. 22

  23. Computer Security Attackers are becoming professionals, motivated by profits. Trust is being compromised: - Certification Authorities compromised - Social networks used to drive to malicious sites - Popular web sites used to spread infections - Governments using spying softwares Smartphones easier to compromise than personal computer HEP is also a target: CPU power needed to coin bitcoins. Primary infection vector: stolen accounts. 23

  24. Computer Security Attackers are becoming professionals, motivated by profits. Trust is being compromised: - Certification Authorities compromised - Social networks used to drive to malicious sites - Popular web sites used to spread infections - Governments using spying softwares Smartphones easier to compromise than personal computer HEP is also a target: CPU power needed to coin bitcoins. Primary infection vector: stolen accounts. 24

  25. IPv6 Security IPv6 has many security weakness: - by design: it was designed when many IPv4 weaknesses hadn't been yet exploited. - by implementation: many stacks are still partially implemented; specs and RFCs are often inconsistent. - by configuration: with dual-stack, using two protocols at the same time may help to evade packet inspection. The huge address space is more difficult to control or block. Everything will have to be verified. 25

  26. Observations Jefferson Lab was hacked: undetected for 6 weeks, offline for 2 weeks, long time to go back to full speed Lot of interest on LHCONE Most (all) of the new servers come with 10G NIC; thus lot of sites are buying “cheap”, high density 10G switches. No mention of 40G or 100G Not many planning for IPv6, although lot of interest. 26

  27. Grids, Clouds and Virtualization 27

  28. Clouds and Virtualization Presented several tools for cloud management: - Cloudman - OpenNebula - Eucalyptus - Openstack Lxcloud: several tools and hypervisors evaluated (OpenNebula, Openstack, LSF, Amazon EC2) Clouds and virtualization at RAL: Hyper-V was chosen. Now evaluating OpenStack and StratusLab Virtualization WG: working on policy and tools for image distribution. 28

  29. Observations No clear best/preferred tool Many activities on going 29

  30. Thank you 30

  31. HEPiX Fall 2011 Highlights IT infrastructure Storage Giuseppe Lo Presti / IT-DSS CERN, November 11th, 2011

  32. IT Infrastructure • 8 Talks • CERN Computing Facilities • Deska, a Fabric Management Tool • Scientific Linux • SINDES: Secure file storage and transfer • Use of OCS for hw/sw inventory • Configuration Management at GSI • Hardware failures at CERN • TSM Monitoring at CERN HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  33. CERN CC • An overview of the current status and plans of the CC • Cooling issues (similarly to most sites): addressed for now by increasing room temperature and using outside fresh air • Estimated gain: ~GWh per year! • Civil engineering works well advanced, to finish by December 2011 • Some water leaks… • Luckily without any serious consequence to equipment • Large scale hosting off-site: call for tender is out • And an overview of the most common failures in the CC • Largely dominating: hard drive failures • MTTF measured at 320 khours, specs say 1.2 Mhours • A rather long debate after the talk… HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  34. Fabric Management • Different solutions in different centres… • DESKA at FZU, Prague • Chef at GSI, Darmstadt • OCS for hardware inventory at CERN • (I know, not exactly fitting the same scope) • Same issue: no one has a clean solution to be happy with • Complains span from missing features to scalability issues • What follows is an overview of some of the software used at different centres HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  35. Fabric Management • DESKA: a language to describe hw configuration • Based on PostgreSQL + PgPython + git for version control • CLI, Python binding • Not yet deployed, concerns about being ‘too’ flexible • You can describe pretty much anything, what is the real effort in describing a CC? • Chef at GSI • A ‘buzzword bingo’ • Based on the Ruby language • Sysadmins were trained • Tried on real life on a brand new batch cluster • OCS: an external tool being adopted at CERN to do inventory of computing resources HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  36. Scientific Linux • A “standard” update on SL releases and usage • Starting with a quote from Linux Format • if it’s good enough for CERN, it’s good enough for us • Well, kind of… • People • Troy Dawson left Fermilab to join RedHat • Two new members have joined the team • SL 6.1 released in July • Overall world-wide usage greatly increasing • Mostly SL5, SL6 ramping up, SL3(!) still used HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  37. Secure file storage and transfer • With SINDES, the Secure INformationDElivery System • New version 2 • To overcome shortcomings with current version • E.g. lack of flexibility in authorizations • A number of new features • E.g. plug-ins for authentication and authorization, versioning • To be deployed at CERN during 2012 HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  38. Storage and File Systems • 6 Talks • Storage at TRIUMF • EMI, the 2nd year • Storage WG update • Migrating from dCache to Hadoop • CASTOR and EOS at CERN • CVMFS update HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  39. Storage at TRIUMF • Disk: 2.1 PB Usable (ATLAS) • Tape: 5.5 PB on LTO4 & LTO5 cartridges • Using IBM high-density library • Quite painful experience during 2010, issues with tapes inventory only fixed after IBM released firmware in Oct 2010 • Optimizing tape read performance • Tapeguy, in-house development • Reorders staging requests to minimize mounts • Provided they’re large enough. Not always the case… HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  40. EMI Status • A (partially political) update on EMI by P. Fuhrmann • Goal: bringing together different existing grid/DM middleware's, and ensure long term operations • However, long term planning still not clear • First release just out • Highlights on Data Management • pNFS is a ‘done deal’ • WebDAV frontend for LFC and SEs with http redirects • Completely ignoring the SRM semantics • dCache labs (preliminary): a data access abstraction layer to plug in any storage • Working on a proof of concept with Hadoop HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  41. Storage WG Update • Goal: compare storage solutions adopted by HEP • Report about recent (October 2011) tests at FZK • AFS, NFS, xroot, Lustre, GPFS • Use cases: taken from ATLAS and CMS • Disclaimer: moving target! • Andrei provides details on the setup for each FS • Results: quite a number of plots • Xroot recovering the (previous) gap • CMS use case is CPU bound client-side • Next candidate to test: Swift (OpenStack), probably HDFS, … HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  42. Migrating from dCache to HDFS • Report about the experience at a Tier2 • UW Madison, part of US CMS • 1.1 PB usable storage • Very happy with dCache, still willing to migrate to Hadoop • And a technical opportunity came in Spring 2011: migrating to Hadoop in less time than converting dCache to Chimera? • Many constraints: being rollback capable, idempotent, online to the maximum extent… • Exploiting the Hadoop FUSE plugin • Took 2 months, one day downtime • Now ‘happy’, and able to leverage experience in cloud computing when hiring HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  43. CASTOR and EOS at CERN • Recap on strategy: CASTOR for the Tier0, EOS for end-user analysis • Recent improvements in CASTOR • Transfer Manager for disk scheduling • Buffered Tape Marks for improving tape migration • EOS is being moved into a production service • A review of the basic design principles • Ramping up installed capacity, migrating CASTOR pools • A few comments on EOS • J.Gordon: “It seems you like doing many things from scratch”… • Support for SRM/BeStMan • … HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  44. To conclude… Vancouver Downtown view from Grouse Mountains HEPiX Banquet, October 27th, 2011 HEPiX Fall 2011 Highlights – Giuseppe Lo Presti

  45. 20th anniversary (1) • Banquet on Thursday night • Warm thanks to Alan for 20 years of pivotal role for HEPiX (“HEPiX elder statesman”) • 5 talks on Friday morning – quite some early HEPiX attendants present • Alan Silverman: HEPiX from the beginning • Les Cottrell: Networking • Thomas Finnern: HEPi-X-perience • Rainer Toebbicke: 20 years of AFS at CERN • Corrie Kost: A personal overview of computing HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  46. 20th anniversary (2) • HEPiX from the beginning – Alan Silverman • Learning from previous experience of HEP-wide collaboration on VM and VMS • Parallel meetings in Europe and North America until 1995 • Windows (HEPNT) joined 1997 • HEPiX working groups: HEPiX scripts; AFS; large cluster SIG; mail; X11 scripts; security; benchmarking; storage; virtualisation; IPv6 • Another success story: adoption of Scientific Linux HEP-wide • Alan’s personal rankings: • Most western meeting(s): Vancouver (not much more so than SLAC) • Most eastern meeting: Taipei • Most northern meeting: Umeå • Most southern meeting: Rio (most dangerous as well…) • Most secure meeting: BNL • Most exotic meeting: Taipei • … HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

  47. 20th anniversary (3) • Alan’s conclusion: HEPiX gives value to the labs for the money they spend • Michel Jouvin, current European co-chair: “HEPiX is healthy after 20 years with plenty of topics to discuss for the next 20!” HEPiX report – Helge.Meinhard at cern.ch – 11-November-2011

More Related