1 / 26

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators. Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan. Accelerators for High Performance Computing. In HPC systems, power consumption has been/will remain a major concern

gretel
Download Presentation

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan

  2. Accelerators for High Performance Computing • In HPC systems, power consumption has been/will remain a major concern • SIMD accelerators are promising for their excellent Flops/Watt ratio NVidia GeForce 8800GTX, 375GFlops(Single precision), ~200W ClearSpeed X620 accelerator, 80GFlops, 25W

  3. Heterogeneous Architectures (1) HPC systems onlywith “special purpose” accelerators are infeasible • They don’t directly support existing compilers, applications, MPI, Linux… Heterogeneous architectures will be attractive for • Generality by general purpose CPUs • Typically x86/x86-64 CPUs • Higher Flops/Watt ratio by accelerators • ClearSpeed accelerators, GPGPUs, CellBE… • Example: IBM Roadrunner, TokyoTech TSUBAME

  4. Heterogeneous Architectures (2) Objectives: • Running a large parallel application on heterogeneous systems Questions: • How can we utilize heterogeneous resources effectively? • Are they scalable up to supercomputing scale? ClearSpeed X620 accelerator 80GFlops peak AMD Opteron 880 4.8GFlops peak / core +

  5. Overview of Our Work • We take a tightly-coupled program, Linpack • Combined usage of 10,368 Opteron cores and 648 ClearSpeed SIMD accelerators • >60TFlops: The world’s highest Linpack performance on heterogeneous supercomputers +

  6. 500GB 48disks 500GB 48disks 500GB 48disks NEC/Sun/ClearSpeed/VoltaireTokyoTech TSUBAME Supercomputer (2006) SunFire X4600 16 Opteron core/node x 655nodes Voltaire ISR9288 Infiniband 10Gbps 102TFlops peak = Opteron 49.8TF + ClearSpeed 52.2TF ClearSpeed CSX600 SIMD accelerator x 360 PCI-X boards 648 16th supercomputer in the world, 2nd in Asia (Top500 in Nov 2007)

  7. Structure of TSUBAME Node with Heterogeneous Processors ClearSpeed Accelerator on PCI-X InfiniBand Interconnect 8 dual-core Opteron CPUs (16 cores) SunFire X4600

  8. 16 Opteron cores x 655 Compute nodes 1.6PByte storage 288Port 10Gbps InfiniBand SW x 6 Cooling Towers (~20 units)

  9. ClearSpeed Accelerator • PCI-X accelerator boards • CSX600 SIMD processor x 2 + 1GB DRAM on board • 210MHz x 2FP x 96SIMD x 2 = 80.6GFlops peak • Configurable up to 250MHz • Power: 25W/board Provided software: • Cn programming language • CSXL BLAS library <= Used by this work • CSFFT library

  10. M B DGEMM Performance of Opteron and ClearSpeed GOTO BLAS on Opteron (1 core) CSXL BLAS 2.50 on ClearSpeed Multiply of (MxB) x (BxM) • An accelerator is equivalent to 14 CPU cores at peak • ClearSpeed Performance is much more sensitive to matrix size! - GOTO BLAS is by Kazushige Goto, U. Texas

  11. Linpack: Our Target Application Benchmark • Linpack is a numerical benchmark used in Top500 • Solve N x N dense linear equations • HPL (High-performance Linpack) by A. Petitet • A well-known MPI parallel implementation • Matrix multiply (DGEMM) computation is most time-consuming; O(N3) in total

  12. N B Data Decomposition in HPL Matrix distribution on 6 (=2x3) processes • Matrix is uniformly distributed with 2D Block-Cyclic distribution

  13. MPI comm MPI comm MPI comm Flow of HPL (simplified) Panel fact, etc. Panel fact, etc Panel fact, etc • Performance is bottlenecked by the slowest process • HPL is designed for uniform systems Matrix multiply (DGEMM) for own data Matrix multiply (DGEMM) for own data Matrix multiply (DGEMM) for own data Back to Top Back to Top Back to Top

  14. Requirements on Heterogeneous System HPL is designed for homogeneous systems, but • Intra-node heterogeneity: A node has both general purpose CPUs and SIMD accelerator • Inter-node heterogeneity: (In previous TSUBAME configuration) about half the nodes have accelerators, while others not • We want to keep modification to HPL source code small How can we run HPL efficiently on heterogeneous systems?

  15. CPU-Only Half-Acc’d Fully-Acc’d Three System Configurations No Heterogeneity Intra-node Hetero + Inter-node Hetero Intra-node Hetero

  16. Our Basic Policy (1/2) • For intra-node heterogeneity, we ‘virtualize’ heterogeneous processors at library layer • Processors are provides of DGEMM performance • We control mapping between processes and processors Example of mapping during DGEMM Processes Processors

  17. Our Basic Policy (2/2) • For inter-node heterogeneity, we control the number of processes among nodes • cf. CHARM++, AMPI from UIUC • We can keep kernel workload of each process uniform (good for HPL ), while maintaining heterogeneity

  18. Careful Tuning is Necessary for Performance Since SIMD accelerators are sensitive to many HPL parameters, careful tuning is necessary • Process granularity • Process mapping • Block size We need different tuning for each system configuration

  19. Tuning of Process Granularity Coarse We can tune ‘process granularity’ as number of BLAS threads • If processes are too coarse (a process uses many threads), it is more difficult to balance among nodes • If too fine, HPL suffers from duplicated computation Fine

  20. M B Tuning of Block Size • When block size B is small, ClearSpeed performance is heavily degraded • When B is too large, HPL suffers from large overhead for panel factorization

  21. Tuning on “CPU-only” Case • Focus is bringing out performance of BLAS performance on CPUs x 648 16 Opteron cores • Block size is 240, which is good for GOTO BLAS

  22. Tuning on “Fully-Acc’d” Case • Focus is • Process granularity / block size should be large enough for ClearSpeed BLAS • Balancing among processes, while utilizing both processors Clear Speed x 648 16 Opteron cores For PCI-X communication • Block size is 864, which is good for ClearSpeed BLAS

  23. Tuning on “Half-Acc’d” Case • Focus is balance among accelerated nodes and non-accelerated nodes Node w/o ClearSpeed x 288 Node with ClearSpeed Clear Speed x 360 For PCI-X • Block size is 864

  24. Experimentation • 648 SunFire X4600 nodes in TSUBAME • Modified HPL + Voltaire MPI + GOTO BLAS + CSXL BLAS • Three configurations: • CPU Only: Only Opteron CPUs are used • Half Acc’d: Only half the nodes are accelerated • Fully Acc’d: All the nodes are accelerated

  25. Experimental Results Relative speed (CPU only=1) • 38.18TFin “CPU-only” • 48.88TF in “Half-Acc’d” • +28% over CPU Only • 63.51TF in “Fully-Acc’d” • Check precise figures in the next Top500 in June 

  26. Summary • Scalability of heterogeneous supercomputers with SIMD accelerators is demonstrated • >60TFlops Linpack performance is achieved • Our method works efficiently even when nodes are partially accelerated Future work: • From hand-tuning to automatic tuning • Other useful applications!

More Related