1 / 22

Prague Site Report

Prague Site Report. Jiří Chudoba Institute of Physics, Prague. 23.4.2012 Hepix meeting, Prague. Local Organization. Institute of Physics: 2 locations in Prague, 1 in Olomouc 786 employees (281 researchers + 78 doctoral students) Department of Networking and Computing Techniques (SAVT)

genica
Download Presentation

Prague Site Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prague Site Report Jiří Chudoba Institute of Physics, Prague 23.4.2012 Hepixmeeting, Prague

  2. Local Organization • Institute of Physics: • 2 locations in Prague, 1 in Olomouc • 786 employees (281 researchers + 78 doctoral students) • Department of Networking and Computing Techniques (SAVT) • networking up to offices, mail and web servers, central services • Computing centre (CC) • large scale calculations • part of SAVT (except leader – Jiri Chudoba) • Division of Elementary Particle Physics • Section Department of detector development and data processing • head Milos Lokajicek • started large scale calculations, later transferred to CC • the biggest hw contributor (LHC computing) • participates in the CC operation Jiri.Chudoba@fzu.cz

  3. Server room I • Server room I (Na Slovance) • 62 m2, ~20 racks 350 kVA motor generator, 200 + 2 x 100 kVA UPS, 108 kW air cooling, 176 kW water cooling • continuous changes • hosts computing servers and central services Jiri.Chudoba@fzu.cz

  4. Other server rooms • New server room for SAVT • located next to server room I • independent UPS (24 kW now, max 64 kW n+1), motor generator (96 kW), cooling 25 kW (n+1) • dedicated for central services • 16 m2, now 4 racks (room for 6) • very high reliability required • first servers moved in last week • Server room Cukrovarnicka • another building in Prague • 14 m2, 3 racks (max 5), 20 kW central UPS, 2x8 kW cooling • backup servers and services • Server room UTIA • 3 racks, 7 kW cooling, 3 + 5x1.5 kW UPS • dedicated to Department of Condensed Matter Theory Jiri.Chudoba@fzu.cz

  5. Jiri.Chudoba@fzu.cz

  6. Clusters in CC - Dorje • Dorje: Altix ICE8200, 1.5 rack • 512 cores on 64 diskless WN, IB, 2 disk arrays (6+14 TB) • only local users, solid state physics, condense matter theory • 1 admin for administration and user support • relatively small number of jobs, MPI jobs up to 256 processes • Torque + Maui, SLES10 SP2, SGI Tempo, MKL, OpenMPI, ifort • users run mostly: Wien2k, vasp, fireball, apls Jiri.Chudoba@fzu.cz

  7. Cluster LUNA • 2 servers SunFire X4600 • 8 CPUs 32 cores, 256 GB RAM • 4 servers SunFire V20z, V40z • Operated by CESNET Metacentrum – distributed computing activity of the NGI_CZ • Metacentrum • 9 locations • 3500 cores • 300 TB Jiri.Chudoba@fzu.cz

  8. Cluster Thsun, Small group servers • Thsun • “private” cluster • small number of users • power users with root privileges • 12 servers of variable hw • servers for groups • managed by groups in collaboration with CC Jiri.Chudoba@fzu.cz

  9. Cluster Golias • Upgraded every year – several subclusters of the identical hw • 3812 cores, 30700 HS06 • almost 2 PB disk space • the newest (March 2012) subclusterrubus: • 23 nodes SGI Rackable C1001-G13 • 2x (Opteron 6274 16 cores)64 GB RAM, 2x SAS 300 GB • 374 W (full load) • 232 HS06 per node, 5343 HS06 total Jiri.Chudoba@fzu.cz

  10. Golias shares Planned vs real usage (walltime) Subclusters contribution to the total performance Jiri.Chudoba@fzu.cz

  11. WLCG Tier2 • cluster Golias@FZU + xrootd servers @Rez • 2012 pledges: • ATLAS 10000 HS06, 1030 TiB; 11861 HS06 available, 1300 TB av. • ALICE 5000 HS06, 420 TiB; 7564 HS06, 540 TB available • delivery of almost 600 TB delayed due to floods • 66% efficiency is assumed for WLCG accounting • sometimes under 100% of pledges • Low cputime/walltime ratio for the ALICE • not only on our site • Tests with limits on number of concurrent jobs (last week) • “no limit” (about 900 jobs) – 45% • limit 600 jobs - 54 % Jiri.Chudoba@fzu.cz

  12. Utilization • Very high average utilization • several different projects, different tools for production • D0 – production submitted locally by 1 user • ATLAS – panda, ganga, local users; DPM • ALICE – VO box; xrootd D0 ALICE ATLAS Jiri.Chudoba@fzu.cz

  13. Networking • CESNET upgraded our main CISCO router • 6506 -> 6509 • supervisorSUP720 -> SUP2T • new 8x 10G X2 card • planned upgrade of power supplies 2x3kW -> 2x6 kW • (2 cards 48x1 Gbps, 1 card 4x10 Gbps, FW service module) Jiri.Chudoba@fzu.cz

  14. Jiri.Chudoba@fzu.cz

  15. Jiri.Chudoba@fzu.cz

  16. Jiri.Chudoba@fzu.cz

  17. External connection • Exclusive: 1 Gbps (to FZK) + 10 Gbps (CESNET) • Shared: 10 Gbps (PASNET – GEANT) PASNET link FZK -> FZU FZU -> FZK • Not enough for ATLAS T2D limit (5 MB/s to/from T1s) • Perfsonar installed Jiri.Chudoba@fzu.cz

  18. Miscellaneous items • Torque server performance • W jobs, sometimes long response time • divide Golias in 2 clusters with 2 torque instances? • memory limits for ATLAS and ALICE queues • CVMFS • used by ATLAS, works well • some older nodes have too small disks -> excluded for ATLAS • Management • Cfengine v2 used for production • Puppet used for IPv6 testbed • 2 new 64 core nodes • SGI Rackable H2106-G7, 128 GB RAM, 4x Opteron 6274 2.2GHz, 446 HS06 • frequent crashes when loaded with jobs • Another 2 servers with Intel SB expected • small subclusters with different hw Jiri.Chudoba@fzu.cz

  19. Water cooling • Active vs passive cooling doors • 1 new rack with cooling doors • 2 new cooling doors on APC racks Jiri.Chudoba@fzu.cz

  20. Water cooling good sealing crucial worker nodes diskservers on off (divider added) rubus01 diskservers Jiri.Chudoba@fzu.cz

  21. Distributed Tier2, Tier3s • Networking infrastructure (provided by CESNET) connects all Prague institutions involved • Academy of Sciences of the Czech Republic • Institute of Physics (FZU, Tier-2) • Nuclear Physics Institute • Charles University in Prague • Faculty of Mathematics and Physics • Czech Technical University in Prague • Faculty of Nuclear Sciences and Physical Engineering • Institute of Experimental and Applied physics • Now only NPI hosts resources visible in Grid • Many reasons why others do not: manpower, suitable rooms, lack of IPv4 addresses • Data Storage group at CESNET • deployment for LHC projects discussed Jiri.Chudoba@fzu.cz

  22. Thanks to my colleagues for help with preparation of these slides: • Marek Eliáš • Lukáš Fiala • Jiří Horký • Tomáš Hrubý • Tomáš Kouba • Jan Kundrát • Miloš Lokajíček • Petr Roupec • Jana Uhlířová • Ota Velínský Jiri.Chudoba@fzu.cz

More Related