using description based methods for provisioning clusters l.
Skip this Video
Loading SlideShow in 5 Seconds..
Using Description-Based Methods for Provisioning Clusters PowerPoint Presentation
Download Presentation
Using Description-Based Methods for Provisioning Clusters

Loading in 2 Seconds...

play fullscreen
1 / 37

Using Description-Based Methods for Provisioning Clusters - PowerPoint PPT Presentation

  • Uploaded on

Using Description-Based Methods for Provisioning Clusters. Philip M. Papadopoulos San Diego Supercomputer Center University of California, San Diego. Computing Clusters. How I spent my Weekend Why should you care about about software configuration Description based configuration

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Using Description-Based Methods for Provisioning Clusters

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using description based methods for provisioning clusters

Using Description-Based Methods for Provisioning Clusters

Philip M. Papadopoulos

San Diego Supercomputer Center

University of California, San Diego

computing clusters
Computing Clusters
  • How I spent my Weekend
  • Why should you care about about software configuration
  • Description based configuration
    • Taking the administrator out of cluster administration
  • XML-based assembly instructions
  • What’s next
a story from this last weekend
A Story from this last Weekend
  • Attempt a TOP 500 Run on a two fused 128 node PIII (1GHz, 1GB mem) clusters
    • 100 Mbit ethernet, Gigabit to frontend.
    • Myrinet 2000. 128 port switch on each cluster
  • Questions
    • What LINPACK performance could we get?
    • Would Rocks scale to 256 nodes?
    • Could we set up/teardown and run benchmarks in the allotted 48 hours?
      • Would we get any sleep?
  • Fri: Started 5:30pm. Built new frontend. Physical rewiring of myrinet, added ethernet switch.
  • Fri: Midnight. Solved some ethernet issues. Completely reinstalled all nodes.
  • Sat: 12:30a Went to sleep.
  • Sat: 6:30a. Woke up. Submitted first LINPACK runs (225 nodes)

New Frontend

8 Cross Connects (Myrinet)

128 nodes (120 on Myrinet)

128 nodes (120 on Myrinet)

mid experiment
  • Sat 10:30a. Fixed a failed disk, and a bad SDRAM. Started Runs on 240 nodes. Went shopping at 11:30a
  • 4:30p. Examined Output. Started ~6hrs of Runs
  • 11:30p. Examined more visits. Submitted final large runs. Went to sleep
  • 285 GFlops
  • 59.5% Peak
  • Over 22 hours of continuous computing

240 Dual PIII (1Ghz, 1GB) - Myrinet

final clean up
Final Clean-up
  • Sun 9a. Submitted 256 node Ethernet HPL
  • Sun 11:00a. Added GigE to frontend. Started complete reinstall of all 256 nodes.
    • ~40 Minutes for Complete reinstall (return to original system state)
    • Did find some (fixable) issues at this scale
      • Too Many open database connections (fixing for next release)
      • DHCP client not resilient enough. Server was fine.
  • Sun 1:00p. Restored wiring, rebooted original frontends. Started reinstall tests. Some debug of installation issues uncovered at 256 nodes.
  • Sun 5:00p. Went home. Drank Beer.
installation reboot performance
Installation, Reboot, Performance

32 Node Re-Install

  • < 15 minutes to reinstall 32 node subcluster (rebuilt myri driver)
  • 2.3min for 128 node reboot




Start HPL

key software components
Key Software Components
  • HPL + ATLAS. Univ. of Tennessee
  • The truckload of open source software that we all build on
  • Redhat linux distribution and installer
    • de facto standard
  • NPACI Rocks
    • Complete cluster-aware toolkit that extends a Redhat Linux distribution
  • (Inherent Belief that “simple and fast” can beat out “complex and fast”)
why should you care about cluster configuration administration
Why Should You Care about Cluster Configuration/Administration?
  • “We can barely make clusters work.” – Gordon Bell @ CCGCS02
  • “No Real Breakthroughs in System Administration in the last 20 years” – P. Beckman @ CCGCS02
  • System Administrators love to continuously twiddle knobs
    • Similar to a mechanic continually adjusting air/fuel mixture on your car
    • Or worse: Randomly unplugging/plugging spark plug wires
  • Turnkey/Automated systems remove the system admin from the equation
    • Sysadmin doesn’t like this.
    • This is good
clusters aren t homogeneous
Clusters aren’t homogeneous
  • Real clusters are more complex than a pile of identical computing nodes
    • Hardware divergence as cluster changes over time
    • Logical heterogeneity of specialized servers
      • IO servers, Job Submission, Specialized configs for apps, …
  • Garden-variety cluster owner shouldn’t have to fight the provisioning issues.
    • Get them to the point where they are fighting the hard application parallelization issues
    • They should be able to easily follow software/hardware trends
  • Moderate-sized (upto ~128 nodes) clusters are the “standard” grid endpoints
    • A non-uber administrator handles two 128 node Rocks clusters at SIO (260:1 admin to system ratio)
npaci rocks toolkit rocks npaci edu
NPACI Rocks Toolkit –
  • Techniques and software for easy installation, management, monitoring and update of Linux clusters
    • A complete cluster-aware distribution and configuration system.
  • Installation
    • Bootable CD + floppy which contains all the packages and site configuration info to bring up an entire cluster
  • Management and update philosophies
    • Trivial to completely reinstall any (all) nodes.
    • Nodes are 100% automatically configured
    • RedHat Kickstart to define software/configuration of nodes
    • Software is packaged in a query-enabled format
    • Never try to figure out if node software is consistent
      • If you ever ask yourself this question, reinstall the node
    • Extensible, programmable infrastructure for all node types that make up a real cluster.
tools integrated
Tools Integrated
  • Standard cluster tools
    • MPICH, PVM, PBS, Maui (SSH, SSL -> Red Hat)
  • Rocks add ons
    • Myrinet support
      • GM device build (RPM), RPC-based port-reservation (usher-patron)
      • Mpi-launch (understands port reservation system)
    • Rocks-dist – distribution work horse
    • XML (programmable) Kickstart
    • eKV (console redirect to ethernet during install)
    • Automated mySQL database setup
    • Ganglia Monitoring (U.C. Berkeley and NPACI)
    • Stupid pet administration scripts
  • Other tools
    • PVFS
    • ATLAS BLAS, High Performance Linpack
support for myrinet
Support for Myrinet
  • Myrinet device driver must be versioned to the exact kernel version (eg. SMP,options) running on a node
    • Source is compiled at reinstallation on every (Myrinet) node (adds 2 minutes installation) (a source RPM, by the way)
    • Device module is then installed (insmod).
    • GM_mapper run (add node to the network)
  • Myrinet ports are limited and must be identified with a particular rank in a parallel program
    • RPC-based reservation system for Myrinet ports
    • Client requests port reservation from desired nodes
    • Rank mapping file (gm.conf) created on-the-fly
    • No centralized service needed to track port allocation
    • MPI-launch hides all the details of this
  • HPL (LINPACK) comes pre-packaged for Myrinet
  • Build your Rocks cluster, see where it sits on the Top500
key ideas
Key Ideas
  • No difference between OS and application software
    • OS installation is completely disposable
    • Unique state that is kept only at a node is bad
      • Creating unique state at the node is even worse
  • Software bits (packages) are separated from configuration
    • Diametrically opposite from “golden image” methods
  • Description-based configuration rather than image-based
    • Installed OS is “compiled” from a graph.
  • Inheritance of software configurations
    • Distribution
    • Configuration
  • Single step installation of updated software OS
    • Security patches pre-applied to the distribution not post-applied on the node
rocks extends installation as a basic way to manage software on a cluster

Rocks extends installation as a basic way to manage software on a cluster

It becomes trivial to insure software consistency across a cluster

rocks disentangles software bits distributions and configuration
Rocks Disentangles Software Bits (distributions) and Configuration

Collection of all possible

software packages

(AKA Distribution)

Descriptive information to

configure a node





Compute Node

IO Server

Web Server

managing software distributions
Managing Software Distributions

Collection of all possible

software packages

(AKA Distribution)

Descriptive information to

configure a node




Compute Node

IO Server

Web Server

rocks dist repeatable process for creation of localized distributions
Rocks-dist Repeatable process for creation of localized distributions
  • # rocks-dist mirror
    • Rocks mirror
      • Rocks 2.2 release
      • Rocks 2.2 updates
  • # rocks-dist dist
    • Create distribution
      • Rocks 2.2 release
      • Rocks 2.2 updates
      • Local software
      • Contributed software
  • This is the same procedure NPACI Rocks uses.
    • Organizations can customize Rocks for their site.
  • Iterate, extend as needed
description based configuration
Description-based Configuration

Collection of all possible

software packages

(AKA Distribution)

Descriptive information to

configure a node




Compute Node

IO Server

Web Server

description based configuration20
Description-based Configuration
  • Infrastructure that “describes“ the roles of cluster nodes
    • Nodes are assembled using Red Hat's kickstart
      • Components are defined in XML files that completely describe and configure software subsystems
      • Software bits are in standard RPMs.
  • Kickstart files are built on-the-fly, tailored for each node through Database interaction


what is a kickstart file
What is a Kickstart file?

Setup/Packages (20%)

Package Configuration (80%)


zerombr yes

bootloader --location mbr --useLilo


auth --useshadow --enablemd5

clearpart --all

part /boot --size 128

part swap --size 128

part / --size 4096

part /export --size 1 --grow

lang en_US

langsupport --default en_US

keyboard us

mouse genericps/2

timezone --utc GMT

rootpw --iscrypted nrDG4Vb8OjjQ.









cat > /etc/nsswitch.conf << 'EOF'

passwd: files

shadow: files

group: files

hosts: files dns

bootparams: files

ethers: files


cat > /etc/ntp.conf << 'EOF'



fudge stratum 10

authenticate no

driftfile /etc/ntp/drift


/bin/mkdir -p /etc/ntp

cat > /etc/ntp/step-tickers << 'EOF'



/sbin/hwclock --systohc

Portable (ASCII), Not Programmable, O(30KB)

what are the issues
What are the Issues
  • Kickstart file is ASCII
    • There is some structure
      • Pre-configuration
      • Package list
      • Post-configuration
  • Not a “programmable” format
    • Most complicated section is post-configuration
      • Usually this is handcrafted
    • Really Want to be able to build sections of the kickstart file from pieces
      • Straightforward extension to new software, different OS
focus on the notion of appliances

Focus on the notion of “appliances”

How do you define the configuration of nodes with special attributes/capabilities

assembly graph of a complete cluster
Assembly Graph of a Complete Cluster

- “Complete” Appliances (compute, NFS, frontend, desktop, …)

- Some key shared configuration nodes (slave-node, node, base)

describing appliances
Describing Appliances
  • Purple appliances all include “slave-node”
    • Or derived from slave-node
  • Small differences are readily apparent
    • Portal, NFS has “extra-nic”. Compute does not
    • Compute runs “pbs-mom”, NFS, Portal do not
  • Can compose some appliances
    • Compute-pvfs IsA compute and IsA pvfs-io
architecture dependencies
Architecture Dependencies
  • Focus only on the differences in architectures
    • logically, IA-64 compute node is identical to IA-32
  • Architecture type is passed from the top of graph
  • Software bits (x86 vs. IA64) are managed in the distribution
xml used to describe modules
Abstract Package Names, versions, architecture




Allow an administrator to encapsulate a logical subsystem

Node-specific configuration is retrieved from our database

IP Address

Firewall policies

Remote access policies

<?xml version="1.0" standalone="no"?>

<!DOCTYPE kickstart SYSTEM "@KICKSTART_DTD@" [<!ENTITY ssh "openssh">]>


<description> Enable SSH </description>

<package>&ssh; </package>

<package> &ssh;-clients</package>

<package> &ssh;-server</package>

<package> &ssh;-askpass</package>

<!-- include XFree86 packages for xauth -->




cat &gt; /etc/ssh/ssh_config &lt;&lt; 'EOF' <!-- default client setup -->

Host *

CheckHostIP no

ForwardX11 yes

ForwardAgent yes

StrictHostKeyChecking no

UsePrivilegedPort no

FallBackToRsh no

Protocol 1,2

EOF </post> </kickstart>

XML Used to Describe Modules
space time and http
Space-Time and HTTP

Node Appliances



IP + Kickstart URL

Kickstart RQST

Generate File



Request Package

Serve Packages


Install Package

  • HTTP:
  • Kickstart URL (Generator) can be anywhere
  • Package Server can be (a different) anywhere

Post Config


subsystem replacement is easy
Subsystem Replacement is Easy
  • Binaries are in de facto standard package format (RPM)
  • XML module files (components) are very simple
  • Graph interconnection (global assembly instructions) is separate from configuration
  • Examples
    • Replace PBS with Sun Grid Engine
    • Upgrade version of OpenSSH or GCC
    • Turn on RSH (not recommended)
    • Purchase commercial compiler (recommended)
  • 100s of clusters have been built with Rocks on a wide variety of physical hardware
  • Installation/Customization is done in a straightforward programmatic way
    • Scaling is excellent
  • HTTP is used as a transport for reliability/performance
    • Configuration Server does not have to be in the cluster
    • Package Server does not have to be in the cluster
    • (Sounds grid-like)
what s still missing
What’s still missing?
  • Improved Monitoring
    • Monitoring Grids of Clusters
    • “Personal” cluster monitor
  • Straightforward Integration with Grid (Software)
    • Will use NMI (NSF Middleware Initiative) software as a basis (A Grid endpoint should no harder to setup than a cluster)
  • Any sort of real IO story
    • PVFS is a toy (no reliability)
    • NFS doesn’t scale properly (no stabilty)
    • MPI-IO is only good for people willing to retread large sections of code. (most users want read/write/open close).
  • Real parallel job control
    • MPD looks promising
integration with grid software
Integration with Grid Software
  • As part of NMI, SDSC is working on a package called gridconfig
    • Provide a higher level tool for creating and consistently managing configuration files for software components
      • Allows config data (eg path names, policy files, etc). To be shared among unrelated components
      • Does NOT require changes to config file formats
      • Globus GT2 has over a dozen configuration files
        • Over 100 configurable parameters
        • Many files share similar information
  • Will be part of NMI-R2
some gridconfig details
Some Gridconfig Details

Top Level Components

Globus Details (example)

accepting configuration
Accepting Configuration
  • Parameters from previous screens stored in a simple schema (mySQL database is backend)
  • From these parameters, native format configuration files are written
rocks near term
Rocks Near Term
  • Version 2.3 in alpha now, Beta by end 1st week of October
    • RedHat 7.3
    • Integration with Sun Grid Engine (SCS – Singapore)
    • Resource target for SCE (P. Uthayopas, KU, Thailand)
    • Bug fixes, additional software components
    • Some low-level Engineering changes on interaction with Redhat installer
    • Working toward NMI Ready (point release in Dec. early Jan).
parting thoughts
Parting Thoughts
  • Why is grid software so hard to successfully deploy?
  • If large cluster configurations can be really thought of as disposable, shouldn’t virtual organizations be just as easily constructed?