achieving strong scaling on blue gene l case study with namd
Skip this Video
Download Presentation
Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Loading in 2 Seconds...

play fullscreen
1 / 57

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD - PowerPoint PPT Presentation

  • Uploaded on

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD. Sameer Kumar, Gheorghe Almasi Blue Gene System Software, IBM T J Watson Research Center, Yorktown Heights, NY { sameerk ,gheorghe} L. V. Kale, Chao Huang Department of Computer Science,

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD' - adina

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
achieving strong scaling on blue gene l case study with namd

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Sameer Kumar, Gheorghe Almasi

Blue Gene System Software,

IBM T J Watson Research Center,

Yorktown Heights, NY


L. V. Kale, Chao Huang

Department of Computer Science,

University of Illinois at Urbana Champaign,

Urbana, IL


  • Background and motivation
  • NAMD and Charm++
  • Blue Gene optimizations
  • Performance results
  • Summary
blue gene l
Blue Gene/L
  • Slow embedded core at a clock speed of 700 Mhz
    • 32 KB L1 cache
    • L2 is a small prefetch buffer
    • 4MB Embedded DRAM L3 cache
  • 3D Torus interconnect
    • Each processor is connected to six torus links with a throughput of 175 MB/s
  • System optimized for massive scaling and power


Blue Gene/L

64 Racks, 64x32x32


32 Node Cards

Node Card

180/360 TF/s

32 TB

(32 chips 4x4x2)

16 compute, 0-2 IO cards

2.8/5.6 TF/s

512 GB

Compute Card

2 chips, 1x2x1

90/180 GF/s

16 GB


2 processors

Has this slide been presented 65536 times ?

5.6/11.2 GF/s

1.0 GB

2.8/5.6 GF/s

4 MB

can we scale on blue gene l
Can we scale on Blue Gene/L ?
  • Several applications have demonstrated weak scaling
  • NAMD was one of the first applications to achieve strong scaling on Blue Gene/L
namd a production md program
NAMD: A Production MD program


  • Fully featured program from University of Illinois
  • NIH-funded development
  • Distributed free of charge (thousands downloads so far)
  • Binaries and source code
  • Installed at NSF centers
  • User training and support
  • Large published simulations (e.g., aquaporin simulation featured in keynote)
namd benchmarks
NAMD Benchmarks


3K atoms

ATP Synthase

327K atoms


Estrogen Receptor

36K atoms (1996)

Recent NSF Peta-scale proposal presents a 100 Million atom system

molecular dynamics in namd
Molecular Dynamics in NAMD
  • Collection of [charged] atoms, with bonds
    • Newtonian mechanics
    • Thousands to even a million atoms
  • At each time-step
    • Calculate forces on each atom
      • Bonds:
      • Non-bonded: electrostatic and van der Waal’s
        • Short-distance: every timestep
        • Long-distance: using PME (3D FFT)
        • Multiple Time Stepping : PME every 4 timesteps
    • Calculate velocities and advance positions
  • Challenge: femto-second time-step, millions needed!
spatial decomposition

Movable Computes

Spatial Decomposition
  • Atoms distributed to cubes based on their location
  • Size of each cube :
    • Just a bit larger than cut-off radius
    • Computation performed by movable computes
  • C/C ratio: O(1)
  • However:
    • Load Imbalance
    • Easily scales to about 8 times number of patches

Typically 13 computes per patch

Cells, Cubes or“Patches”

namd computation
NAMD Computation
  • Application data divided into data objects called patches
    • Sub-grids determined by cutoff
  • Computation performed by migratable computes
    • 13 computes per patch pair and hence much more parallelism
    • Computes can be further split to increase parallelism
charm and converse









System implementation

Send Msg Q

Recv Msg Q


Charm++ and Converse
  • Charm++: Application mapped to Virtual Processors (VPs)
    • Runtime maps VPs to physical processors
  • Converse: communication layer for Charm++
    • Send, recv, progress, on node level

User View


NAMD Parallelization using Charm++

108 VPs

847 VPs

100,000 VPs

These 100,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

the apo lipo protein a1
The Apo-lipo Protein A1
  • 92,000 atoms
  • Benchmark for testing NAMD performance on various architectures
f1 atp synthase
F1 ATP Synthase
  • 327K atoms
  • Can we run it on Blue Gene/L in virtual node mode?
lysozyme in 8m urea solution
Lysozyme in 8M Urea Solution
  • Total ~40,000 atoms
  • Solvated in 72.8Ǻ x 72.8Ǻ x 72.8Ǻ box
  • Lysozyme: 129 residues, 1934 atoms
  • Urea: 1811 molecules
  • Water: 7799 molecules
  • Water/Urea ratio: 4.31
  • Red: protein, Blue: urea; CPK: water

Ruhong Zhou, Maria Eleftheriou, Ajay Royyuru, Bruce Berne

ha binding simulation setup
HA Binding Simulation Setup
  • Homotrimer, each with 2 subunits (HA1 & HA2)
  • Protein: 1491 residues, and 23400 atoms
  • 3 Sialic acids, 6 NAGs (N-acetyl-D-Glucosamine)
  • Solvated in 91Å x 94Å x 156Å water box, with total 35,863 water molecules
  • 30 Na+ ions to neutralize the system
  • Total ~131,000 atoms
  • PME for long-range electrostatic interactions
  • NPT simulation at 300K and 1atm
namd 2 5 in may 2005
NAMD 2.5 in May 2005

Step Time (ms)

Initial serial time 17.6s


APoA1 step time with PME in Co-Processor Mode

parallel md easy or hard

Tiny working data

Spatial locality

Uniform atom density

Persistent repetition


Sequential timesteps

Very short iteration time

Full electrostatics

Fixed problem size

Dynamic variations

Parallel MD: Easy or Hard?
namd on bgl
  • Disadvantages
    • Slow embedded CPU
    • Small memory per node
    • Low bisection bandwidth
    • Hard to scale full electrostatics
    • Hard to overlap communication with computation
  • Advantages
    • Both application and hardware are 3D grids
    • Large 4MB L3 cache
    • Higher bandwidth for short messages
    • Six outgoing links from each node
    • Static TLB
    • No OS Daemons
single processor performance
Single Processor Performance
  • Inner loops
    • Better software pipelining
    • Aliasing issues resolved through the use of

#pragma disjoint (*ptr1, *ptr2)

    • Cache optimizations
    • 440d to use more registers
    • Serial time down from 17.6s (May 2005) to 7s
    • Iteration time down from 80 cycles to 32 cycles
    • Full 440d optimization would require converting some data structures from 24 to 32 bytes
memory performance
Memory Performance
  • Memory overhead high due to several short memory allocations
    • Group short memory allocations into larger buffers
    • We can now run the ATPase system in virtual node mode
  • Other sources of memory pressure
    • Parts of atom structure duplicated on all processors
    • Other duplication to support external clients like TCL and VMD
    • These issues still need to be addressed
bgl parallelization
BGL Parallelization
  • Topology driven problem mapping
    • Blue Gene Has a 3D Torus network
    • Near neighbor communication has better performance
  • Load-balancing schemes
    • Choice of correct grain size
  • Communication optimizations
    • Overlap of computation and communication
    • Messaging performance
problem mapping
Problem Mapping







Processor Grid

Application Data Space

problem mapping1
Problem Mapping







Processor Grid

Application Data Space

problem mapping2







Processor Grid

Problem Mapping

Application Data Space

problem mapping3

Data Objects


Compute Objects




Processor Grid

Problem Mapping
improving grain size two away computation
Improving Grain Size: Two Away Computation
  • Patches based on cutoff are too coarse on BGL
  • Each patch can be split along a dimension
    • Patches now interact with neighbors of neighbors
    • Makes application more fine grained
      • Improves load balancing
    • Messages of smaller size sent to more processors
      • Improves torus bandwidth
load balancing steps
Load Balancing Steps

Regular Timesteps

Detailed, aggressive Load Balancing

Instrumented Timesteps

Refinement Load Balancing

load balancing metrics
Load-balancing Metrics
  • Balancing load
  • Minimizing communication hop-bytes
    • Place computes close to patches
  • Minimizing number of proxies
    • Effects connectivity of each patch object
communication in namd
Communication in NAMD
  • Three major communication phases
    • Coordinate multicast
      • Heavy communication
    • Force reduction
      • Messages trickle in
    • PME
      • Long range calculations which require FFTs and alltoalls
optimizing communication
Optimizing communication
  • Overlap of communication with computation
  • New messaging protocols
    • Adaptive eager
    • Active put
  • Fifo mapping schemes
overlap of computation and communication
Overlap of Computation and Communication
  • Each FIFO has 4 packet buffers
  • Progress engine should be called every 4000 cycles
  • Progress overhead of about 200 cycles
    • 5 % increase in computation
  • Remaining time can be used for computation
network progress calls
Network Progress Calls
  • NAMD makes progress engine calls from the compute loops
    • Typical frequency is10000 cycles, dynamically tunable

for ( i = 0; i < (i_upper SELF(- 1)); ++i ){


const CompAtom &p_i = p_0[i];


//Compute Pairlists

for (k=0; k<npairi; ++k) {

//Compute forces



void CmiNetworkProgress() {

new_time = rts_get_timebase();

if(new_time < lastProgress + PERIOD) {

lastProgress = new_time;



lastProgress = new_time;



charm runtime scalability
Charm++ Runtime Scalability
  • Charm++ MPI Driver
    • Iprobe based implementation
    • Higher progress overhead of MPI_Test
    • Statically pinned FIFOs for point to point communication
  • BGX Message Layer (developed in collaboration with George Almasi)
    • Lower progress overhead makes overlap feasible
    • Active messages
      • Easy to design complex communication protocols
    • Charm++ BGX driver was developed by Chao Huang last summer
    • Dynamic FIFO mapping
better message performance adaptive eager
Better Message Performance: Adaptive Eager
  • Messages sent without rendezvous but with adaptive routing
  • Impressive performance results for messages in the 1KB-32KB range
  • Good performance for small non-blocking all-to-all operations like PME
  • Can achieve about 4 links of throughput
active put
Active Put
  • A put that fires a handler at the destination on completion
  • Persistent communication
  • Adaptive routing
  • Lower per message overheads
  • Better cache performance
  • Can optimize NAMD coordinate multicast
fifo mapping
FIFO Mapping
  • pinFifo Algorithms
    • Decide which of the 6 FIFOs to use when send msg to {x,y,z,t}
    • Cones, Chessboard
  • Dynamic FIFO mapping
    • A special send queue that msg can go from whichever FIFO that is not full
bgx message layer vs mpi
BGX Message layer vs MPI
  • Fully non-blocking version performed below par on MPI
    • Polling overhead high for a list of posted receives
  • BGX native comm. layer works well with asynchronous communication

NAMD 2.6b1 Co-Processor Mode Performance (ms/step) (OCT 2005)

namd performance

Scaling = 2.5

Scaling = 4.5

NAMD Performance

Step Time (ms)

Time-step = 4ms


APoA1 step time with PME in Co-Processor Mode

virtual node mode
Virtual Node Mode

Step Time (ms)

Plot comparing VN mode

with CO mode

on twice as many chips


APoA1 step time with PME

impact of optimizations
Impact of Optimizations

NAMD cutoff step time on the APoA1 system on 1024 processors

blocking communication
Blocking Communication

(Projections timeline of a 1024-node run without aggressive network progress)

  • Network progress not aggressive enough: communication gaps result in a low utilization of 65%
effect of network progress
Effect of Network Progress

(Projections timeline of a 1024-node run with aggressive network progress)

  • More frequent advance closes gaps: higher network utilization of about 75%
impact on science
Impact on Science
  • Dr Zhao ran the Lysome system for 6.7 billion time steps over about two months on 8 racks of Blue Gene/L
lysozyme misfolding amyloids
Lysozyme Misfolding & Amyloids
  • Mechanism behind protein misfolding and amyloid formation – Alzheimer’s disease
  • Amyloids can be formed not only from traditional b-amyloid peptides, but also from almost any proteins, such as lysozyme.
  • A single mutation on lysozyme (TRP62GLY) can cause the protein to be less stable and also misfold to form possible amyloids.
  • More mysteriously, the single mutation site TRP62 is on surface not in hydrophobic core.
  • To study lysozyme misfolding and amyloids formation
  • 10 ms aggregate MD simulation

C. Dobson and coworkers, Science 295, 1719, 2002; C. Dobson and coworkers, Nature 424, 783, 2003

  • Machine is capable of massive performance
    • We were able to scale ApoA1 on NAMD to 8k processors
    • The bigger ATPase system also scales to 8k processors
  • Applications benefit from native messaging APIs
  • Topology optimizations are a big winner
  • Overlap of computation and communication is possible
  • Lack of operating system daemons leads to massive scaling
future plans
Future Plans
  • Improve Application Scaling
    • We still have some Amdahl bottlenecks
      • Splitting bonded work
      • 2D or 3D decompositions for PME
    • Reducing grain size overhead
    • Improve load-balancing

NAMD Parallelization using Charm++

108 VPs

847 VPs

100,000 VPs

These 100,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

towards peta scale computing
Towards Peta Scale Computing
  • Sequential performance has to improve from 0.7 flops/cycle to 1-1.5 flops per cycle
    • Explore new algorithms for the inner loop to reduce register and cache pressure
    • Effectively using the double hummer
  • Reduce memory pressure to run very large problems
  • Fully distributed load balancer
Funding Agencies

NIH, NSF, DOE (ASCI center)

Students, Staff and Faculty

Parallel Programming Laboratory

Chao Huang, Gengbin Zheng, David Kunzman, Chee Wai Lee, Prof. Kale

Theoretical Biophysics

Klaus Schulten, Jim Phillips

IBM Watson

Gheorghe Almasi, Hao Yu

IBM Toronto

Murray Malleschuk, Mark Mendell