The ffte library and the hpc challenge hpcc benchmark suite
Download
1 / 24

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite. Daisuke Takahashi Center for Computational Sciences/ Graduate School of Systems and Information Engineering University of Tsukuba. Outline. HPC Challenge (HPCC) Benchmark Suite Overview The Benchmark Tests Example Results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite' - nicola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
The ffte library and the hpc challenge hpcc benchmark suite

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite

Daisuke Takahashi

Center for Computational Sciences/

Graduate School of Systems and Information Engineering

University of Tsukuba

First French-Japanese PAAP Workshop


Outline
Outline Suite

  • HPC Challenge (HPCC) Benchmark Suite

    • Overview

    • The Benchmark Tests

    • Example Results

  • FFTE: A High-Performance FFT Library

    • Background

    • Related Works

    • Block Six-Step/Nine-Step FFT Algorithm

    • Performance Results

    • Conclusion and Future Work

First French-Japanese PAAP Workshop


Overview of the hpc challenge hpcc benchmark suite
Overview of the HPC Challenge (HPCC) Benchmark Suite Suite

  • HPC Challenge (HPCC) is a suite of tests that examine the performance of HPC architectures using kernels.

  • The suite provides benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g.,

    • Spatial locality

    • Temporal locality

First French-Japanese PAAP Workshop


The benchmark tests
The Benchmark Tests Suite

  • The HPC Challenge benchmark consists at this time of 7 performance tests:

    • HPL (High Performance Linpack)

    • DGEMM (matrix-matrix multiplication)

    • STREAM (sustainable memory bandwidth)

    • PTRANS (A=A+B^T, parallel matrix transpose)

    • RandomAccess (integer updates to random memory locations)

    • FFT (complex 1-D discrete Fourier transform)

    • b_eff (MPI latency/bandwidth test)

First French-Japanese PAAP Workshop


Targeted application areas in the memory access locality space
Targeted Application Areas in the Memory Access Locality Space

PTRANSSTREAM

HPL

DGEMM

CFD

Radar X-section

Spatial locality

Applications

TSP

DSP

RandomAccess

FFT

0

Temporal locality

First French-Japanese PAAP Workshop


Hpcc testing scenarios
HPCC Testing Scenarios Space

  • Local (S-STREAM, S-RandomAccess, S-DGEMM, S-FFTE)

    • Only single MPI process computes.

  • Embarrassingly parallel (EP-STREAM, EP-RandomAccess, EP-DGEMM, EP-FFTE)

    • All processes compute and do not communicate (explicitly).

  • Global (G-HPL, G-PTRANS, G-RandomAccess, G-FFTE)

    • All processes compute and communicate.

  • Network only (RandomRing Bandwidth, etc.)

First French-Japanese PAAP Workshop


Sample results page http icl cs utk edu hpcc hpcc results cgi
Sample results page Spacehttp://icl.cs.utk.edu/hpcc/hpcc_results.cgi

First French-Japanese PAAP Workshop


The winners of the 2006 hpc challenge class 1 awards
The winners of the 2006 HPC Challenge Class 1 Awards Space

  • G-HPL: 259 TFlops/s

    • IBM Blue Gene/L (131072 Procs)

  • G-RandomAccess: 35 GUPS

    • IBM Blue Gene/L (131072 Procs)

  • G-FFTE: 2311 GFlop/s

    • IBM Blue Gene/L (131072 Procs)

  • EP-STREAM-Triad (system): 160TB/s

    • IBM Blue Gene/L (131072 Procs)

First French-Japanese PAAP Workshop


Ffte a high performance fft library
FFTE: A High-Performance Space FFT Library

  • FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions.

  • It includes complex, mixed-radix and parallel transforms.

    • Shared / Distributed memory parallel computers (OpenMP, MPI and OpenMP + MPI)

  • It also supports Intel’s SSE2/SSE3 instructions.

  • The FFTE library can be obtained fromhttp://www.ffte.jp

First French-Japanese PAAP Workshop


Background
Background Space

  • One goal for large FFTs is to minimize the number of cache misses.

  • Many FFT algorithms work well when data setsfit into a cache.

  • When a problem exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically.

  • The conventional six-step FFT algorithm requires

    • Two multicolumn FFTs.

    • Three data transpositions. → The chief bottlenecks in cache-based processors.

First French-Japanese PAAP Workshop


Related works
Related Works Space

  • FFTW [Frigo and Johnson (MIT)]

    • The recursive call is employed to access main memory hierarchically.

    • This technique is very effective in the case that the total amount of data is not so much greater than the cache size.

    • For parallel FFT, the conventional six-step FFT is used.

    • http://www.fftw.org

  • SPIRAL [Pueschel et al. (CMU)]

    • The goal of SPIRAL is to push the limits of automation in software and hardware development and optimization for digital signal processing (DSP) algorithms.

    • http://www.spiral.net

First French-Japanese PAAP Workshop


Approach
Approach Space

  • Some previously presented six-step FFT algorithms separate the multicolumn FFTs from the transpositions.

  • Taking the opposite approach, we combinethe multicolumn FFTs and transpositions to reduce the number of cache misses.

  • We modify the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”.

First French-Japanese PAAP Workshop


Discrete fourier transform dft
Discrete Fourier Transform (DFT) Space

  • DFT is given by

First French-Japanese PAAP Workshop


2 d formulation
2-D Formulation Space

  • If has factors and then

First French-Japanese PAAP Workshop


Six step fft algorithm
Six-Step FFT Algorithm Space

individual

Transpose

-point FFTs

Transpose

Transpose

First French-Japanese PAAP Workshop


Block six step fft algorithm
Block Six-Step FFT Algorithm Space

PartialTranspose

individual

-point FFTs

Transpose

PartialTranspose

First French-Japanese PAAP Workshop


3 d formulation
3-D Formulation Space

  • For very large FFTs, we should switch to a 3-D formulation.

  • If has factors , and then

First French-Japanese PAAP Workshop


Parallel block nine step fft
Parallel Block Nine-Step FFT Space

PartialTranspose

All-to-all comm.

PartialTranspose

PartialTranspose

First French-Japanese PAAP Workshop


Operation counts for point fft
Operation Counts for -point FFT Space

  • Conventional FFT algorithms (e.g., Cooley-Tukey FFT, Stockham FFT)

    • Arithmetic operations:

    • Main memory accesses:

  • Block Nine-Step FFT

    • Arithmetic operations:

    • Main memory accesses (ideal case):

First French-Japanese PAAP Workshop


Performance results
Performance Results Space

  • To evaluate the implemented parallel FFTs, we compared

    • The implemented parallel FFT, named FFTE (ver 4.0, supports SSE3, using MPI)

    • FFTW (ver. 2.1.5, not support SSE3, using MPI)

  • Target parallel machine:

    • A 32-node dual PC SMP cluster(Irwindale 3GHz, 1GB DDR2-400 SDRAM / node, Linux 2.4.17-1smp).

    • Interconnected through a Gigabit Ethernet switch.

    • LAM/MPI 7.1.1 was used as a communication library

    • The compilers used were gcc 4.0.2 and g77 3.2.3.

First French-Japanese PAAP Workshop




Discussion
Discussion Space

  • For N = 2^29 and P = 32, the FFTE runs about 1.72 times faster than the FFTW.

    • The performance of the FFTE remains at a high level even for the larger problem size, owing to cache blocking.

    • Since the FFTW uses the conventional six-step FFT,each column FFT does not fit into the L1 data cache.

    • Moreover, the FFTE exploits the SSE3 instructions.

  • These are three reasons why the FFTE is most advantageous than the FFTW.

First French-Japanese PAAP Workshop


Conclusion and future work
Conclusion and Future Work Space

  • The block nine-step FFT algorithm is most advantageous with processors that have a considerable gap between the speed of the cache memory and that of the main memory.

  • Towards Petascale computing systems,

    • Exploiting the multi-level parallelism:

      • SIMD or Vector accelerator

      • Multi-core

      • Multi-socket

      • Multi-node

    • Reducing the number of main memory accesses.

    • Improving the all-to-all communication performance.

      • In the G-FFTE, the all-to-all communication occursthree times.

First French-Japanese PAAP Workshop


ad