Bill Camp, Jim Tomkins
1 / 50

Bill Camp, Jim Tomkins & Rob Leland - PowerPoint PPT Presentation

  • Uploaded on

Bill Camp, Jim Tomkins & Rob Leland. How Much Commodity is Enough? the Red Storm Architecture. William J. Camp, James L. Tomkins & Rob Leland CCIM , Sandia National Laboratories Albuquerque, NM Sandia MPPs (since 1987). 1987: 1024-processor nCUBE10 [512 Mflops]

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Bill Camp, Jim Tomkins & Rob Leland' - micah

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

How much commodity is enough the red storm architecture

How Much Commodity is Enough?the Red Storm Architecture

William J. Camp, James L. Tomkins & Rob Leland

CCIM, Sandia National Laboratories

Albuquerque, NM

Sandia mpps since 1987
Sandia MPPs (since 1987)

1987: 1024-processor nCUBE10 [512 Mflops]

  • 1990--1992 + +: 2 1024-processor nCUBE-2 machines [2 @ 2 Gflops]

  • 1988--1990: 16384-processor CM-200

  • 1991: 64-processor Intel IPSC-860

  • 1993--1996: ~3700-processor Intel Paragon [180 Gflops]

  • 1996--present: 9400-processor Intel TFLOPS (ASCI Red) [3.2 Tflops]

  • 1997--present: 400 --> 2800 processors in Cplant Linux Cluster [~3 Tflops]

  • 2003: 1280-processor IA32- Linux cluster [~7 Tflops]

  • 2004: Red Storm: ~11600 processor Opteron-based MPP [>40 Tflops]

Our rubric since 1987
Our rubric (since 1987)

  • Complex, mission-critical, engineering & science applications

  • Large systems (1000’s of PE’s) with a few processors per node

  • Message passing paradigm

  • Balanced architecture

  • Use commodity wherever possible

  • Efficient systems software

  • Emphasis on scalability & reliability in all aspects

  • Critical advances in parallel algorithms

  • Vertical integration of technologies

A partitioned scalable computing architecture


File I/O




Net I/O

A partitioned, scalable computing architecture

Computing domains at sandia




Computing domains at Sandia

  • Red Storm is targeting the highest-end market but has real advantages for the mid-range market (from 1 cabinet on up)

Red storm architecture
Red Storm Architecture

  • True MPP, designed to be a single system-- not a cluster

  • Distributed memory MIMD parallel supercomputer

  • Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network

  • 108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0--2.4 GHz)

  • ~10 or 20 TB of DDR memory @ 333MHz

  • Red/Black switching: ~1/4, ~1/2, ~1/4 (for data security)

  • 12 Service, Visualization, and I/O cabinets on each end (640 S,V & I processors for each color)

  • 240 TB of disk storage (120 TB per color) initially

Red storm architecture1
Red Storm Architecture

  • Functional hardware partitioning: service and I/O nodes, compute nodes, Visualization nodes, and RAS nodes

  • Partitioned Operating System (OS): LINUX on Service, Visualization, and I/O nodes, LWK (Catamount) on compute nodes, LINUX on RAS nodes

  • Separate RAS and system management network (Ethernet)

  • Router table-based routing in the interconnect

  • Less than 2 MW total power and cooling

  • Less than 3,000 ft2 of floor space

Usage model
Usage Model

Unix (Linux)Login Nodewith Unixenvironment





User sees a coherent, single system

Thor s hammer topology
Thor’s Hammer Topology

  • 3D-mesh Compute node topology:

    • 27 x 16 x 24 (x, y, z) – Red/Black split: 2,688 – 4,992 – 2,688

  • Service, Visualization, and I/O partitions

    • 3 cab’s on each end of each row

      • 384 full bandwidth links to Compute Node Mesh

      • Not all nodes have a processor-- all have routers

    • 256 PE’s in each Visualization Partition--2 per board

    • 256 PE’s in each I/O Partition-- 2 per board

    • 128 PE’s in each Service Partition-- 4 per board

3 d mesh topology z direction is a torus
3-D Mesh topology (Z direction is a torus)

TorusInterconnectin Z


640 Visualization Service & I/O Nodes

640 Visualization, Service& I/O Nodes

10,368ComputeNode Mesh



Thor s hammer network chips
Thor’s Hammer Network Chips

  • 3D-mesh is created by SEASTAR ASIC:

    • Hyper-transport Interface and 6 network router ports on each chip

    • In computer partitions each processor has its own SEASTAR

    • In service partition, some boards are configured like compute partition (4 PE’s per board)

    • Others have only 2 PE’s per board; but still have 4 SEASTARS

      • So, network topology is uniform

  • SEASTAR designed by CRAY to our spec’s, Fabricated by IBM

    • The only truly custom part in Red Storm-- complies with HT open standard

Node architecture


Node architecture

DRAM 1 (or 2) Gbyte or more


Six LinksTo OtherNodes in X, Y,

and Z

ASIC = ApplicationSpecific IntegratedCircuit, or a“custom chip”

System layout 27 x 16 x 24 mesh
System Layout(27 x 16 x 24 mesh)






Disconnect Cabinets

Thor s hammer cabinet layout

Compute Node CabinetCPU Boards

2 ft

4 ft







Thor’s Hammer Cabinet Layout

  • Compute Node Partition

    • 3 Card Cages per Cabinet

    • 8 Boards per Card Cage

    • 4 Processors per Board

    • 4 NIC/Router Chips per Board

    • N + 1 Power Supplies

    • Passive Backplane

  • Service. Viz, and I/O Node Partition

    • 2 (or 3) Card Cages per Cabinet

    • 8 Boards per Card Cage

    • 2 (or 4) Processors per Board

    • 4 NIC/Router Chips per Board

    • 2-PE I/O Boards have 4 PCI-X busses

    • N + 1 Power Supplies

    • Passive Backplane


96 PE


  • Peak of 41.4 (46.6) TF based on 2 floating point instruction issues per clock at 2.0 Gigahertz .

  • We required 7-fold speedup versus ASCI Red but based on our benchmarks expect performance will be 8-10 time faster than ASCI Red.

  • Expected MP-Linpack performance: ~30--35 TF

  • Aggregate system memory bandwidth: ~55 TB/s

  • Interconnect Performance:

    • Latency <2 ms (neighbor), <5 ms (full machine)

    • Link bandwidth ~ 6.0 GB/s bi-directional

    • Minimal XC bi-section bandwidth ~2.3 TB/s


  • I/O System Performance

    • Sustained file system bandwidth of 50 GB/s for each color

    • Sustained external network bandwidth of 25 GB/s for each color

  • Node memory system

    • Page miss latency to local memory is ~80 ns

    • Peak bandwidth of ~5.4 GB/s for each processor

Red storm system software
Red Storm System Software

  • Operating Systems

    • LINUX on service and I/O nodes

    • Sandia’s LWK (Catamount) on compute nodes

    • LINUX on RAS nodes

  • Run-Time System

    • Logarithmic loader

    • Fast, efficient Node allocator

    • Batch system – PBS

    • Libraries – MPI, I/O, Math

  • File Systems being considered include

    • PVFS – interim file system

    • Lustre – Design Intent

    • Panassas-- possible alternative

Red storm system software1
Red Storm System Software

  • Tools

    • All IA32 Compilers, all AMD 64-bit Compilers – Fortran, C, C++

    • Debugger – Totalview (also examining alternatives)

    • Performance Tools (was going to be Vampir until Intel bought Pallas-- now?)

  • System Management and Administration

    • Accounting

    • RAS GUI Interface

Red storm project
Red Storm Project

  • 23 months, design to First Product Shipment!

  • System software is a joint project between Cray and Sandia

    • Sandia is supplying Catamount LWK and the service node run-time system

    • Cray is responsible for Linux, NIC software interface, RAS software, file system software, and Totalview port

    • Initial software development was done on a cluster of workstations with a commodity interconnect. Second stage involves an FPGA implementation of SEASTAR NIC/Router (Starfish). Final checkout on real SEASTAR-based system

  • System design is going on now

    • Cabinets-- exist

    • SEASTAR NIC/Router-- released to Fabrication at IBM earlier this month

  • Full system to be installed and turned over to Sandia in stages culminating in August--September 2004

Bill camp jim tomkins rob leland

Designing for scalable supercomputing

Challenges in:





Bill camp jim tomkins rob leland

SUREty for Very Large Parallel Computer Systems

Scalability - Full System Hardware and System Software

Usability - Required Functionality Only

Reliability - Hardware and System Software

Expense minimization- use commodity, high-volume parts

SURE poses Computer System Requirements:

Bill camp jim tomkins rob leland

  • SURE Architectural tradeoffs:

  • Processor and memory sub- system balance

  • Compute vs interconnect balance

  • Topology choices

  • Software choices

  • RAS

  • Commodity vs. Custom technology

  • Geometry and mechanical design

Bill camp jim tomkins rob leland

Sandia Strategies:

-build on commodity

-leverage Open Source (e.g., Linux)

-Add to commodity selectively (in RS there is basically one truly custom part!)

-leverage experience with previous scalable supercomputers

Bill camp jim tomkins rob leland

System Scalability Driven Requirements

Overall System Scalability - Complex scientific applications such as molecular dynamics, hydrodynamics & radiation transport should achieve scaled parallel efficiencies greater than 50% on the full system (~20,000 processors).


Bill camp jim tomkins rob leland


System Software;

System Software Performance scales nearly perfectly with the number of processors to the full size of the computer (~30,000 processors). This means that System Software time (overhead) remains nearly constant with the size of the system or scales at most logarithmically with the system size.

- Full re-boot time scales logarithmically with the system size.

- Job loading is logarithmic with the number of processors.

- Parallel I/O performance is not sensitive to # of PEs doing I/O

- Communication Network software must be scalable.

- No connection-based protocols among compute nodes.

- Message buffer space independent of # of processors.

- Compute node OS gets out of the way of the application.

Hardware scalability
Hardware scalability

  • Balance in the node hardware:

    • Memory BW must match CPU speed

  • Ideally 24 Bytes/flop (never yet done)

    • Communications speed must match CPU speed

    • I/O must match CPU speeds

    • Scalable System SW( OS and Libraries)

    • Scalable Applications

  • Bill camp jim tomkins rob leland


    >Application Code Support:

    Software that supports scalability of the Computer System

    Math Libraries

    MPI Support for Full System Size

    Parallel I/O Library


    Tools that Scale to the Full Size of the Computer System


    Performance Monitors

    Full-featured LINUX OS support at the user interface


    • Light Weight Kernel (LWK) O. S. on compute partition

      • Much less code fails much less often

    • Monitoring of correctible errors

      • Fix soft errors before they become hard

    • Hot swapping of components

      • Overall system keeps running during maintenance

    • Redundant power supplies & memories

    • Completely independent RAS System monitors virtually every component in system

    Bill camp jim tomkins rob leland

    • Economy

    • Use high-volume parts where possible

    • Minimize power requirements

      • Cuts operating costs

        Reduces need for new capital investment

  • Minimize system volume

    Reduces need for large new capital facilities

  • Use standard manufacturing processes where possible-- minimize customization

  • Maximize reliability and availability/dollar

  • Maximize scalability/dollar

  • Design for integrability

  • Economy

    • Red Storm leverages economies of scale

      • AMD Opteron microprocessor & standard memory

      • Air cooled

      • Electrical interconnect based on Infiniband physical devices

      • Linux operating system

    • Selected use of custom components

      • System chip ASIC

        • Critical for communication intensive applications

      • Light Weight Kernel

        • Truly custom, but we already have it (4th generation)

    Cplant on a slide



    File I/O

    Service Nodes



    I/O Nodes

    Compute Nodes







    Net I/O



    System Support

    Sys Admin

    Cplant on a slide

    • Goal: MPP “look and feel”

    • Start ~1997, upgrade ~1999--2001

    • Alpha & Myrinet, mesh topology

    • ~3000 procs (3Tf) in 7 systems

    • Configurable to ~1700 procs

    • Red/Black switching

    • Linux w/ custom runtime & mgmt.

    • Production operation for several yrs.

    ASCI Red

    Ia 32 cplant on a slide



    File I/O

    Service Nodes



    I/O Nodes

    Compute Nodes







    Net I/O



    System Support

    Sys Admin

    IA-32 Cplant on a slide

    • Goal: Mid-range capacity

    • Started 2003, upgrade annually

    • Pentium-4 & Myrinet, Clos network

    • 1280 procs (~7 Tf) in 3 systems

    • Currently configurable to 512 procs

    • Linux w/ custom runtime & mgmt.

    • Production operation for several yrs.

    ASCI Red

    Bill camp jim tomkins rob leland


    For most large scientific and engineering applications the performance is more determined by parallel scalability and less by the speed of individual CPUs.

    There must be balance between processor, interconnect, and I/O performance to achieve overall performance.

    To date, only a few tightly-coupled, parallel computer systems have been able to demonstrate a high level of scalability on a broad set of scientific and engineering applications.

    Let s compare balance in parallel systems


    Node Speed Rating(MFlops)

    Network Link BW


    Communications Balance










    ASCI RED**








    Blue Mtn*






    1200 (9600*)

    0.02 (0.16*)

    Blue Pacific


    300 (132)

    0.11 (0.05)













    Let’s Compare Balance In Parallel Systems

    Comparing red storm and bgl
    Comparing Red Storm and BGL

    Blue Gene Light**Red Storm*

    Node Speed 5.6 GF 5.6 GF (1x)

    Node Memory 0.25--.5 GB 2 (1--8 ) GB (4x nom.)

    Network latency 7 msecs 2 msecs (2/7 x)

    Network BW 0.28 GB/s 6.0 GB/s (22x)

    BW Bytes/Flops 0.05 1.1 (22x)

    Bi-Section B/F 0.0016 0.038 (24x)

    #nodes/problem 40,000 10,000 (1/4 x)

    *100 TF version of Red Storm

    * * 360 TF version of BGL

    Fixed problem performance
    Fixed problem performance

    Molecular dynamics problem

    (LJ liquid)

    Parallel s n neutronics provided by lanl
    Parallel Sn Neutronics (provided by LANL)

    Balance is critical to scalability

    Scientific & eng. codes

    Balance is critical to scalability



    Relating scalability and cost
    Relating scalability and cost

    Cluster more

    cost effective

    MPP more

    cost effective

    Efficiency ratio =

    Cost ratio = 1.8

    Average efficiency ratio over the five codes that consume >80% of Sandia’s cycles

    Scalability determines cost effectiveness
    Scalability determines cost effectiveness

    Sandia’s top priority computing workload:

    Cluster more

    cost effective

    MPP more

    cost effective

    55M node-hrs

    380M node-hrs


    Scalability also limits capability
    Scalability also limits capability

    ~3x processors

    Commodity nearly everywhere customization drives cost
    Commodity nearly everywhere-- Customization drives cost

    • Earth Simulator and Cray X-1 are fully custom Vector systems with good balance

      • This drives their high cost (and their high performance).

    • Clusters are nearly entirely high-volume with no truly custom parts

      • Which drives their low-cost (and their low scalability)

    • Red Storm uses custom parts only where they are critical to performance and reliability

      • High scalability at minimal cost/performance

    Scaling data for some key engineering codes
    Scaling data for some key engineering codes

    Random variation at small proc. counts

    Large differential in efficiency at large proc. counts

    Scaling data for some key physics codes
    Scaling data for some key physics codes

    Los Alamos’ Radiation transport code