Massive supercomputing coping with heterogeneity of modern accelerators
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators. Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan. Accelerators for High Performance Computing. In HPC systems, power consumption has been/will remain a major concern

Download Presentation

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Massive supercomputing coping with heterogeneity of modern accelerators

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

Toshio Endo and Satoshi Matsuoka

Tokyo Institute of Technology, Japan

Accelerators for high performance computing

Accelerators for High Performance Computing

  • In HPC systems, power consumption has been/will remain a major concern

  • SIMD accelerators are promising for their excellent Flops/Watt ratio

NVidia GeForce 8800GTX,

375GFlops(Single precision), ~200W

ClearSpeed X620 accelerator,

80GFlops, 25W

Heterogeneous architectures 1

Heterogeneous Architectures (1)

HPC systems onlywith “special purpose” accelerators are infeasible

  • They don’t directly support existing compilers, applications, MPI, Linux…

    Heterogeneous architectures will be attractive for

  • Generality by general purpose CPUs

    • Typically x86/x86-64 CPUs

  • Higher Flops/Watt ratio by accelerators

    • ClearSpeed accelerators, GPGPUs, CellBE…

  • Example: IBM Roadrunner, TokyoTech TSUBAME

Heterogeneous architectures 2

Heterogeneous Architectures (2)


  • Running a large parallel application on heterogeneous systems


  • How can we utilize heterogeneous resources effectively?

  • Are they scalable up to supercomputing scale?

ClearSpeed X620 accelerator

80GFlops peak

AMD Opteron 880

4.8GFlops peak / core


Overview of our work

Overview of Our Work

  • We take a tightly-coupled program, Linpack

  • Combined usage of 10,368 Opteron cores and 648 ClearSpeed SIMD accelerators

  • >60TFlops: The world’s highest Linpack performance on heterogeneous supercomputers


Nec sun clearspeed voltaire tokyotech tsubame supercomputer 2006







NEC/Sun/ClearSpeed/VoltaireTokyoTech TSUBAME Supercomputer (2006)

SunFire X4600

16 Opteron core/node

x 655nodes

Voltaire ISR9288 Infiniband 10Gbps

102TFlops peak

= Opteron 49.8TF + ClearSpeed 52.2TF

ClearSpeed CSX600 SIMD accelerator

x 360 PCI-X boards


16th supercomputer in the world,

2nd in Asia (Top500 in Nov 2007)

Structure of tsubame node with heterogeneous processors

Structure of TSUBAME Node with Heterogeneous Processors



on PCI-X



8 dual-core

Opteron CPUs

(16 cores)

SunFire X4600

Massive supercomputing coping with heterogeneity of modern accelerators

16 Opteron cores x

655 Compute nodes

1.6PByte storage

288Port 10Gbps

InfiniBand SW x 6

Cooling Towers (~20 units)

Clearspeed accelerator

ClearSpeed Accelerator

  • PCI-X accelerator boards

    • CSX600 SIMD processor x 2 + 1GB DRAM on board

    • 210MHz x 2FP x 96SIMD x 2 = 80.6GFlops peak

      • Configurable up to 250MHz

    • Power: 25W/board

      Provided software:

    • Cn programming language

    • CSXL BLAS library <= Used by this work

    • CSFFT library

Dgemm performance of opteron and clearspeed



DGEMM Performance of Opteron and ClearSpeed


on Opteron (1 core)

CSXL BLAS 2.50 on ClearSpeed

Multiply of

(MxB) x (BxM)

  • An accelerator is equivalent to 14 CPU cores at peak

  • ClearSpeed Performance is much more sensitive to matrix size!

- GOTO BLAS is by Kazushige Goto, U. Texas

Linpack our target application benchmark

Linpack: Our Target Application Benchmark

  • Linpack is a numerical benchmark used in Top500

    • Solve N x N dense linear equations

  • HPL (High-performance Linpack) by A. Petitet

    • A well-known MPI parallel implementation

    • Matrix multiply (DGEMM) computation is most time-consuming; O(N3) in total

Data decomposition in hpl



Data Decomposition in HPL

Matrix distribution on

6 (=2x3) processes

  • Matrix is uniformly distributed with 2D Block-Cyclic distribution

Flow of hpl simplified

MPI comm

MPI comm

MPI comm

Flow of HPL (simplified)

Panel fact, etc.

Panel fact, etc

Panel fact, etc

  • Performance is bottlenecked by the slowest process

  • HPL is designed for uniform systems

Matrix multiply


for own data

Matrix multiply


for own data

Matrix multiply


for own data

Back to Top

Back to Top

Back to Top

Requirements on heterogeneous system

Requirements on Heterogeneous System

HPL is designed for homogeneous systems, but

  • Intra-node heterogeneity: A node has both general purpose CPUs and SIMD accelerator

  • Inter-node heterogeneity: (In previous TSUBAME configuration) about half the nodes have accelerators, while others not

  • We want to keep modification to HPL source code small

    How can we run HPL efficiently on heterogeneous systems?

Three system configurations




Three System Configurations

No Heterogeneity

Intra-node Hetero +

Inter-node Hetero

Intra-node Hetero

Our basic policy 1 2

Our Basic Policy (1/2)

  • For intra-node heterogeneity, we ‘virtualize’ heterogeneous processors at library layer

  • Processors are provides of DGEMM performance

  • We control mapping between processes and processors

Example of mapping during DGEMM



Our basic policy 2 2

Our Basic Policy (2/2)

  • For inter-node heterogeneity, we control the number of processes among nodes

    • cf. CHARM++, AMPI from UIUC

  • We can keep kernel workload of each process uniform (good for HPL ), while maintaining heterogeneity

Careful tuning is necessary for performance

Careful Tuning is Necessary for Performance

Since SIMD accelerators are sensitive to many HPL parameters, careful tuning is necessary

  • Process granularity

  • Process mapping

  • Block size

    We need different tuning for each system configuration

Tuning of process granularity

Tuning of Process Granularity


We can tune ‘process granularity’ as number of BLAS threads

  • If processes are too coarse (a process uses many threads), it is more difficult to balance among nodes

  • If too fine, HPL suffers from duplicated computation


Tuning of block size



Tuning of Block Size

  • When block size B is small, ClearSpeed performance is heavily degraded

  • When B is too large, HPL suffers from large overhead for panel factorization

Tuning on cpu only case

Tuning on “CPU-only” Case

  • Focus is bringing out performance of BLAS performance on CPUs

x 648

16 Opteron cores

  • Block size is 240, which is good for GOTO BLAS

Tuning on fully acc d case

Tuning on “Fully-Acc’d” Case

  • Focus is

    • Process granularity / block size should be large enough for ClearSpeed BLAS

    • Balancing among processes, while utilizing both processors



x 648

16 Opteron cores

For PCI-X communication

  • Block size is 864, which is good for ClearSpeed BLAS

Tuning on half acc d case

Tuning on “Half-Acc’d” Case

  • Focus is balance among accelerated nodes and non-accelerated nodes

Node w/o ClearSpeed

x 288

Node with ClearSpeed



x 360


  • Block size is 864



  • 648 SunFire X4600 nodes in TSUBAME

  • Modified HPL + Voltaire MPI + GOTO BLAS + CSXL BLAS

  • Three configurations:

    • CPU Only: Only Opteron CPUs are used

    • Half Acc’d: Only half the nodes are accelerated

    • Fully Acc’d: All the nodes are accelerated

Experimental results

Experimental Results

Relative speed (CPU only=1)

  • 38.18TFin “CPU-only”

  • 48.88TF in “Half-Acc’d”

    • +28% over CPU Only

  • 63.51TF in “Fully-Acc’d”

    • Check precise figures in the next Top500 in June 



  • Scalability of heterogeneous supercomputers with SIMD accelerators is demonstrated

  • >60TFlops Linpack performance is achieved

  • Our method works efficiently even when nodes are partially accelerated

    Future work:

  • From hand-tuning to automatic tuning

  • Other useful applications!

  • Login