Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan

Accelerators for High Performance Computing • In HPC systems, power consumption has been/will remain a major concern • SIMD accelerators are promising for their excellent Flops/Watt ratio NVidia GeForce 8800GTX, 375GFlops(Single precision), ~200W ClearSpeed X620 accelerator, 80GFlops, 25W

Heterogeneous Architectures (1) HPC systems onlywith “special purpose” accelerators are infeasible • They don’t directly support existing compilers, applications, MPI, Linux… Heterogeneous architectures will be attractive for • Generality by general purpose CPUs • Typically x86/x86-64 CPUs • Higher Flops/Watt ratio by accelerators • ClearSpeed accelerators, GPGPUs, CellBE… • Example: IBM Roadrunner, TokyoTech TSUBAME

Heterogeneous Architectures (2) Objectives: • Running a large parallel application on heterogeneous systems Questions: • How can we utilize heterogeneous resources effectively? • Are they scalable up to supercomputing scale? ClearSpeed X620 accelerator 80GFlops peak AMD Opteron 880 4.8GFlops peak / core +

Overview of Our Work • We take a tightly-coupled program, Linpack • Combined usage of 10,368 Opteron cores and 648 ClearSpeed SIMD accelerators • >60TFlops: The world’s highest Linpack performance on heterogeneous supercomputers +

500GB 48disks 500GB 48disks 500GB 48disks NEC/Sun/ClearSpeed/VoltaireTokyoTech TSUBAME Supercomputer (2006) SunFire X4600 16 Opteron core/node x 655nodes Voltaire ISR9288 Infiniband 10Gbps 102TFlops peak = Opteron 49.8TF + ClearSpeed 52.2TF ClearSpeed CSX600 SIMD accelerator x 360 PCI-X boards 648 16th supercomputer in the world, 2nd in Asia (Top500 in Nov 2007)

Structure of TSUBAME Node with Heterogeneous Processors ClearSpeed Accelerator on PCI-X InfiniBand Interconnect 8 dual-core Opteron CPUs (16 cores) SunFire X4600

16 Opteron cores x 655 Compute nodes 1.6PByte storage 288Port 10Gbps InfiniBand SW x 6 Cooling Towers (~20 units)

ClearSpeed Accelerator • PCI-X accelerator boards • CSX600 SIMD processor x 2 + 1GB DRAM on board • 210MHz x 2FP x 96SIMD x 2 = 80.6GFlops peak • Configurable up to 250MHz • Power: 25W/board Provided software: • Cn programming language • CSXL BLAS library <= Used by this work • CSFFT library

M B DGEMM Performance of Opteron and ClearSpeed GOTO BLAS on Opteron (1 core) CSXL BLAS 2.50 on ClearSpeed Multiply of (MxB) x (BxM) • An accelerator is equivalent to 14 CPU cores at peak • ClearSpeed Performance is much more sensitive to matrix size! - GOTO BLAS is by Kazushige Goto, U. Texas

Linpack: Our Target Application Benchmark • Linpack is a numerical benchmark used in Top500 • Solve N x N dense linear equations • HPL (High-performance Linpack) by A. Petitet • A well-known MPI parallel implementation • Matrix multiply (DGEMM) computation is most time-consuming; O(N3) in total

N B Data Decomposition in HPL Matrix distribution on 6 (=2x3) processes • Matrix is uniformly distributed with 2D Block-Cyclic distribution

MPI comm MPI comm MPI comm Flow of HPL (simplified) Panel fact, etc. Panel fact, etc Panel fact, etc • Performance is bottlenecked by the slowest process • HPL is designed for uniform systems Matrix multiply (DGEMM) for own data Matrix multiply (DGEMM) for own data Matrix multiply (DGEMM) for own data Back to Top Back to Top Back to Top

Requirements on Heterogeneous System HPL is designed for homogeneous systems, but • Intra-node heterogeneity: A node has both general purpose CPUs and SIMD accelerator • Inter-node heterogeneity: (In previous TSUBAME configuration) about half the nodes have accelerators, while others not • We want to keep modification to HPL source code small How can we run HPL efficiently on heterogeneous systems?

CPU-Only Half-Acc’d Fully-Acc’d Three System Configurations No Heterogeneity Intra-node Hetero + Inter-node Hetero Intra-node Hetero

Our Basic Policy (1/2) • For intra-node heterogeneity, we ‘virtualize’ heterogeneous processors at library layer • Processors are provides of DGEMM performance • We control mapping between processes and processors Example of mapping during DGEMM Processes Processors

Our Basic Policy (2/2) • For inter-node heterogeneity, we control the number of processes among nodes • cf. CHARM++, AMPI from UIUC • We can keep kernel workload of each process uniform (good for HPL ), while maintaining heterogeneity

Careful Tuning is Necessary for Performance Since SIMD accelerators are sensitive to many HPL parameters, careful tuning is necessary • Process granularity • Process mapping • Block size We need different tuning for each system configuration

Tuning of Process Granularity Coarse We can tune ‘process granularity’ as number of BLAS threads • If processes are too coarse (a process uses many threads), it is more difficult to balance among nodes • If too fine, HPL suffers from duplicated computation Fine

M B Tuning of Block Size • When block size B is small, ClearSpeed performance is heavily degraded • When B is too large, HPL suffers from large overhead for panel factorization

Tuning on “CPU-only” Case • Focus is bringing out performance of BLAS performance on CPUs x 648 16 Opteron cores • Block size is 240, which is good for GOTO BLAS

Tuning on “Fully-Acc’d” Case • Focus is • Process granularity / block size should be large enough for ClearSpeed BLAS • Balancing among processes, while utilizing both processors Clear Speed x 648 16 Opteron cores For PCI-X communication • Block size is 864, which is good for ClearSpeed BLAS

Tuning on “Half-Acc’d” Case • Focus is balance among accelerated nodes and non-accelerated nodes Node w/o ClearSpeed x 288 Node with ClearSpeed Clear Speed x 360 For PCI-X • Block size is 864

Experimentation • 648 SunFire X4600 nodes in TSUBAME • Modified HPL + Voltaire MPI + GOTO BLAS + CSXL BLAS • Three configurations: • CPU Only: Only Opteron CPUs are used • Half Acc’d: Only half the nodes are accelerated • Fully Acc’d: All the nodes are accelerated

Experimental Results Relative speed (CPU only=1) • 38.18TFin “CPU-only” • 48.88TF in “Half-Acc’d” • +28% over CPU Only • 63.51TF in “Fully-Acc’d” • Check precise figures in the next Top500 in June 

Summary • Scalability of heterogeneous supercomputers with SIMD accelerators is demonstrated • >60TFlops Linpack performance is achieved • Our method works efficiently even when nodes are partially accelerated Future work: • From hand-tuning to automatic tuning • Other useful applications!

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

Presentation Transcript

Coping with Diabetes

Coping With Death

Coping With Hardship

COPING WITH SHIFTWORK

Coping with Anxiety

Coping with stress

Coping with Stressors

Coping with Diabetes

Coping with Abuse

Coping with stress

Coping With Stress:

COPING WITH EXAM

Coping with Cancer

COPING With Lectures!

Coping With Anxiety

Coping With Hardship

COPING WITH INTERCONNECT

Coping with Abuse

Coping with Nerves

Dealing with the heterogeneity of cancer

Coping with Emotions