1 / 24

INFN-T1 site report

INFN-T1 site report. Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014. Outline. Common services Network Farming Storage. Common services. Cooling problem in march. Problem at cooling system, we had to switch the whole center off

anneke
Download Presentation

INFN-T1 site report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

  2. Outline • Common services • Network • Farming • Storage Andrea Chierici

  3. Common services

  4. Cooling problem in march • Problem at cooling system, we had to switch the whole center off • Obviously the problem happened on Sunday at 1am  • Took almost a week to completely recover and have our center 100% back on-line • But LHC exp. opened after 36h • We learned a lot from this (see separate presentation) Andrea Chierici

  5. New dashboard Andrea Chierici

  6. Example: Facility Andrea Chierici

  7. Installation and configuration • CNAF seriously evaluating to move to puppet + foreman as common installation and configuration infrastructure • INFN-T1 historically a quattor supporter • New man power, wider user base and activities pushing us to change • Quattor would stay around as much as needed • at least 1 year to allow for the migration of some critical services Andrea Chierici

  8. Heartbleed • No evidence of compromised nodes • Updated SSL and certificates on bastions hosts and critical services (grid nodes, Indico, wiki) • Some hosts were not exposeddue to older version installed Andrea Chierici

  9. Grid Middleware status • EMI-3 update status • All core services updated • All WNs updated • Some legacy services (mainly UIs) still at EMI-1/2, will be phased out asap Andrea Chierici

  10. Network

  11. Cisco7600 NEXUS WAN Connectivity • RAL • PIC • TRIUMPH • BNL • FNAL • TW-ASGC • NDFGF LHC OPN IN2P3 SARA LHC ONE GARR Bo1 General IP 10 Gb/s CNAF-FNAL CDF (Data Preservation) 10 Gb/s For General IP Connectivity 10Gb/s 40Gb/s 10Gb/s 40 Gb Physical Link (4x10Gb) shared for LHCOPN and LHCONE. T1 resources Andrea Chierici

  12. Current connection model LHCOPN/ONE INTERNET 10Gb/s 4X10Gb/s cisco 7600 10Gb/s bd8810 nexus 7018 10Gb/s Disk Servers 2x10Gb/s Up to 4x10Gb/s Oldresources 2009-2010 4X1Gb/s Farming Switch Farming Switch 20 Worker Nodes per switch WorkerNodes • Core switches and routers are fully redundant (power, CPU, fabrics) • Every Switch is connected with load sharing on different port modules • Core switches and routers have a strict SLA (next solar day) for maintenance Andrea Chierici

  13. Farming

  14. Computing resources • 150K HS-06 • Reduced compared to last WS • Old nodes have been phased-out(2008 and 2009 tender) • Whole farm running on SL6 • Supporting a few VOs that still require sl5 via WNODeS Andrea Chierici

  15. New CPU tender • 2014 tender delayed • Funding issues • We were running over-pledged resources • Trying to take into account TCO (energy consumption) not only sales price • Support will cover 4 years • Trying to open it as much as possible • Last tender only 2 bidders • “Relaxed” support constrains • Would like to have a way to easily share specs, experiences and hints about other sites procurements Andrea Chierici

  16. Monitoring & Accounting (1) Andrea Chierici

  17. Monitoring & Accounting (2) Andrea Chierici

  18. New activities (last ws) • Did not migrate to Grid Engine, we stick to LSF • Mainly INFN-wide decision • Man power • Testing zabbix as a platform for monitoring computing resources • More time required • Evaluating APEL as an alternative to DGAS for grid accounting system not done yet Andrea Chierici

  19. New activities • Configure Ovirt cluster to manage service VMs done • standard libvirt mini-cluster for backup, with GPFS shared storage • Upgrade LSF to v.9 • Setup of a new HPC cluster (Nvidia GPUs + Intel MIC) • Multicore task force • Implement log analysis system (logstash, kibana) • Move some core grid services to OpenStack infrastructure (first one will be site-BDII) • Evaluation of Avoton CPU (see separate presentation) • Add more VOs to WNODeS Andrea Chierici

  20. Storage

  21. Storage Resources • Disk Space: 15 PB-N (net) on-line • 4 EMC2 CX3-80 + 1 EMC2 CX4-960 (~1,4 PB) + 80 servers (2x1 gbps connections) • 7 DDN S2A 9950 + 1 DDN SFA 10K + 1 DDN SFA 12K(~13.5PB) + ~90 servers (10 gbps) • Upgrade of the latest system (DDN SFA 12K) was completed 1Q 2014. Aggregate bandwidth: 70 GB/s • Tape library SL8500 ~16 PB on line with 20 T10KB drives, 13 T10KC drives and 2 T10KD drives • 7500 x 1 TB tape capacity, ~100MB/s of bandwidth for each drive • 2000 x 5 TB tape capacity, ~200MB/s of bandwidth for each drive The 2000 tapes can be ‘‘re-used’’ with the T10KD tech with 8.5 TB tape capacity • Drives interconnected to library and servers via dedicated SAN (TAN). 13 Tivoli Storage manager HSM nodes access the shared drives • 1 Tivoli Storage Manager (TSM) server common to all GEMSS instances • A tender for additional 3000 x 5TB/8.5TB tape capacity for 2014-2017 is ongoing • All storage systems and disk-servers on SAN (4Gb/s or 8Gb/s) Andrea Chierici

  22. Storage Configuration • All disk space is partitioned in ~10 GPFS clusters served by ~170 servers • One cluster per main experiment (LHC) • GPFS deployed on SAN implements a full High Availability system • System scalable to tens of PBs and able to serve thousands of concurrent processes with an aggregate bandwidth of tens GB/s • GPFS coupled with TSM offers a complete HSM solution: GEMSS • Access to storage granted through standard interfaces (posix, SRM, XRootD and WebDAV) • FS directly mounted on WNs Andrea Chierici

  23. Storage research activities • Studies on more flexible and user-friendly methods for accessing storage over WAN • Storage federations, based on http/WebDAV for Atlas (production) and LHCb (testing) • Evaluation of different file systems (CEPH) and storage solutions (EMC2 Isilon over OneFS). • Integration between GEMSS Storage System and Xrootd in order to match the requirements of CMS, Atlas, Alice and LHCb using ad-hoc Xrootd modifications • This is currently in production Andrea Chierici

  24. LTDP • Long Term Data preservation (LTDP) for CDF experiment • FNAL-CNAF Data Copy Mechanism is completed • Copy of the data will follow this timetable: • end 2013 - early 2014 → All data and MC user level n-tuples (2.1 PB) • mid 2014 → All raw data (1.9 PB) + Databases • Bandwidth of 10 Gb/s reserved on transatlantic Link CNAF ↔ FNAL • 940 TB already at CNAF • code preservation: CDF legacy software release (SL6) under test • analysis framework: in the future, CDF services and analysis computing resources will possibly be instantiated on demand on pre-packaged VMs in a controlled environment Andrea Chierici

More Related