Infrastructure and provisioning at the fermilab high density computing facility
Download
1 / 25

Infrastructure and Provisioning at the Fermilab High Density Computing Facility - PowerPoint PPT Presentation


  • 319 Views
  • Uploaded on

Infrastructure and Provisioning at the Fermilab High Density Computing Facility Steven C. Timm Fermilab HEPiX conference May 24-26, 2004 Outline Current Fermilab facilities Expected need for future Fermilab facilities Construction activity at High Density Computing Facility

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Infrastructure and Provisioning at the Fermilab High Density Computing Facility' - libitha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Infrastructure and provisioning at the fermilab high density computing facility l.jpg

Infrastructure and Provisioning at the Fermilab High Density Computing Facility

Steven C. Timm

Fermilab

HEPiX conference

May 24-26, 2004

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Outline l.jpg
Outline Computing Facility

  • Current Fermilab facilities

  • Expected need for future Fermilab facilities

  • Construction activity at High Density Computing Facility

  • Networking and power infrastructure

  • Provisioning and management at remote location

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


A cast of thousands l.jpg
A cast of thousands…. Computing Facility

  • HDCF design done by Fermilab Facilities Engineering,

  • Construction by outside contractor

  • Managed by CD Operations (G. Bellendir et al)

  • Requirements planning by taskforce of Computing Division personnel including system administrators, department heads, networking, facilities people.

  • Rocks development work by S. Timm, M. Greaney, J. Kaiser

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Current fermilab facilities l.jpg
Current Fermilab facilities Computing Facility

  • Feynman Computing Center built in 1988 (to house large IBM-compatible mainframe).

  • ~18000 square feet of computer rooms

  • ~200 tons of cooling

  • Maximum input current 1800A

  • Computer rooms backed up with UPS

  • Full building backed up with generator

  • ~1850 dual-CPU compute servers, ~200 multi-TB IDE RAID servers in FCC right now

  • Many other general-purpose servers, file servers, tape robots.

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Current facilities continued l.jpg
Current facilities continued Computing Facility

  • Satellite computing facility in former experimental hall “New Muon Lab”

  • Historically for Lattice QCD clusters (208512)

  • Now contains >320 other nodes waiting for construction of new facility

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


The long hot summer l.jpg
The long hot summer Computing Facility

  • In summer it takes considerably more energy to run the air conditioning.

  • Dependent on shallow pond for cooling water.

  • In May building has already come within 25A (out of 1800) from having to shut down equipment to shed power load and avoid brownout.

  • Current equipment exhausts the cooling capacity of Feynman computing center as well as the electric

  • No way to increase either in existing building without long service outages.

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Computers just keep getting hotter l.jpg
Computers just keep getting hotter Computing Facility

  • Anticipate that in fall ’04 we can buy dual Intel 3.6 GHz “Nocona” chip, ~105W apiece

  • Expect at least 2.5A current draw per node, maybe more, 12-13 kVA per rack of 40 nodes.

  • In FCC we have 32 computers per rack, 8-9 kVA

  • Have problems cooling the top nodes even now.

  • New facility will have 5x more cooling, 270 tons for 2000 square feet

  • New facility will have up to 3000A of electrical current available.

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


We keep needing more computers l.jpg
We keep needing more computers Computing Facility

  • Moore’s law doubling time isn’t holding true in commodity market

  • Computing needs are growing faster than Moore’s law and must be met with more computers

  • 5 year projections are based on plans from experiments.

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Fermi cycles as a function of time l.jpg
Fermi Cycles as a function of time Computing Facility

Y=R*2^(X/F): Moore’s law says F=1.5 years,

F=2.02 years and growing. 1000 Fermi Cycles  PIII 1 GHz

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Fermi cycles per ampere as function of time l.jpg
Fermi Cycles per ampere as function of time Computing Facility

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Fermi cycles per dollar as function of time l.jpg
Fermi cycles per dollar as function of time Computing Facility

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Strategy l.jpg
Strategy: Computing Facility

  • Feynman center will be UPS and generator-backed facility for important servers

  • New HDCF will have UPS for graceful shutdown but no generator backup. Designed for high-density compute nodes (plus a few tape robots).

  • 10-20 racks of existing 1U will be moved to new facility and reracked

  • Anticipate10-15 racks of new purchase this fall also in new building

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Location of hdcf l.jpg
Location of Computing FacilityHDCF

1.5 miles away from FCC

No administrators will be housed there—will manage “lights out”

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Floor plan of hdcf l.jpg
Floor plan of HDCF Computing Facility

Room for 72 racks in each of 2 computer rooms.

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Cabling plan l.jpg
Cabling plan Computing Facility

Network Infrastructure

Will use bundles of individual Cat-6 cables

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Current status l.jpg
Current status Computing Facility

  • Construction began early May

  • Occupancy Nov/Dec 2004 (est).

  • Phase I+II, space for 56 racks at that time.

  • Expected cost: US$2.8M.

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Power console infrastructure l.jpg
Power/console infrastructure Computing Facility

  • Cyclades AlterPath series

  • Includes console servers, network-based KVM adapters, and power strips

  • Alterpath ACS48 runs PPC Linux

  • Supports Kerberos 5 authentication

  • Access control can be divided by each port

  • Any number of power strip outlets can be associated with each machine on each console port.

  • All configurable via command line or Java-based GUI

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Power console infrastructure18 l.jpg
Power/console infrastructure Computing Facility

PM-10 Power strip

120VAC 30A

10 nodes/circuit

Four units/rack

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Installation with npaci rocks l.jpg
Installation with NPACI-Rocks Computing Facility

  • NPACI (National Partnership for Advanced Computational Infrastructure), lead institution is San Diego Supercomputing Center

  • Rocks—”ultimate cluster-in-a-box tool.” Combines Linux distribution, database, highly modified installer, and a large amount of parallel computing applications such as PBS, Maui, SGE, MPICH, Atlas, PVFS.

  • Rocks 3.0 based on Red Hat Linux 7.3

  • Rocks 3.1 and greater based on SRPMS of Red Hat Enterprise Linux 3.0.

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Rocks vs fermi linux comparison l.jpg
Rocks vs. Fermi Linux comparison Computing Facility

Fermi Linux REDHAT Rocks 3.0

R

E

D

H

A

T

7.

3

Adds:

Workgroups

Yum

OpenAFS

Fermi Kerberos/

OpenSSH

Adds:

Extended kickstart

HPC applications

MySQL database

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Rocks fermi architecture application l.jpg

Expects all compute nodes on private net behind a firewall Computing Facility

Reinstall node if any changes

All network services (DHCP, DNS, NIS) supplied by the frontend.

Nodes on public net

Users won’t allow downtime for frequent reinstall

Use yum and other Fermi Linux tools for security updates

Configure ROCKS to use our external network services

Rocks Fermiarchitecture Application

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Fermi extensions to rocks l.jpg
Fermi extensions to Rocks Computing Facility

  • Fermi production farms currently have 752 nodes all installed with Rocks

  • This Rocks cluster has the most CPU’s registered of any cluster at rocksclusters.org

  • Added extra tables to database for customizing kickstart configuration (we have 14 different disk configurations)

  • Added Fermi Linux comps files to have all Fermi workgroups available in installs, and all added Fermi RPMS

  • Made slave frontend install servers during mass reinstall phases. During normal operation one install server is enough.

  • Added logic to recreate kerberos keytabs

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


S m a r t monitoring l.jpg
S.M.A.R.T Monitoring Computing Facility

  • “smartd” daemon from smartmontools package gives early warning of disk failures

  • Disk failures are ~70% of all hardware failures in our farms over last 5 years.

  • Run short self-test on all disks every day

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Temperature power monitoring l.jpg
Temperature/power monitoring Computing Facility

  • Wrappers for lm_sensors feed NGOP and Ganglia.

  • Measure average temperature of nodes over a month

  • Alarm when 5C or 10C above average

  • Page when 50% of any group at 10C above average

  • Automated shutdown script activates when any single node is over emergency temperature.

  • Building-wide signal will provide notice that we are on UPS power and have 5 minutes to shut down.

  • Automated OS shutdown and SNMP poweroff scripts

S. Timm--Infrastructure and Provisioning at Fermilab HDCF


Reliability is key l.jpg
Reliability is key Computing Facility

  • Can only successfully manage remote clusters if hardware is reliable

  • All new contracts are written with vendor providing 3 year warranty parts and labor…they only make money if they build good hardware

  • 30-day acceptance test is critical to identify hardware problems and fix them before production begins.

  • With 750 nodes and 99% reliability, still 8 nodes would be down a day.

  • Historically reliability is closer to 96% but new Intel-based Xeon nodes are much better.

S. Timm--Infrastructure and Provisioning at Fermilab HDCF