The ffte library and the hpc challenge hpcc benchmark suite
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite. Daisuke Takahashi Center for Computational Sciences/ Graduate School of Systems and Information Engineering University of Tsukuba. Outline. HPC Challenge (HPCC) Benchmark Suite Overview The Benchmark Tests Example Results

Download Presentation

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The ffte library and the hpc challenge hpcc benchmark suite

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite

Daisuke Takahashi

Center for Computational Sciences/

Graduate School of Systems and Information Engineering

University of Tsukuba

First French-Japanese PAAP Workshop


Outline

Outline

  • HPC Challenge (HPCC) Benchmark Suite

    • Overview

    • The Benchmark Tests

    • Example Results

  • FFTE: A High-Performance FFT Library

    • Background

    • Related Works

    • Block Six-Step/Nine-Step FFT Algorithm

    • Performance Results

    • Conclusion and Future Work

First French-Japanese PAAP Workshop


Overview of the hpc challenge hpcc benchmark suite

Overview of the HPC Challenge (HPCC) Benchmark Suite

  • HPC Challenge (HPCC) is a suite of tests that examine the performance of HPC architectures using kernels.

  • The suite provides benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g.,

    • Spatial locality

    • Temporal locality

First French-Japanese PAAP Workshop


The benchmark tests

The Benchmark Tests

  • The HPC Challenge benchmark consists at this time of 7 performance tests:

    • HPL (High Performance Linpack)

    • DGEMM (matrix-matrix multiplication)

    • STREAM (sustainable memory bandwidth)

    • PTRANS (A=A+B^T, parallel matrix transpose)

    • RandomAccess (integer updates to random memory locations)

    • FFT (complex 1-D discrete Fourier transform)

    • b_eff (MPI latency/bandwidth test)

First French-Japanese PAAP Workshop


Targeted application areas in the memory access locality space

Targeted Application Areas in the Memory Access Locality Space

PTRANSSTREAM

HPL

DGEMM

CFD

Radar X-section

Spatial locality

Applications

TSP

DSP

RandomAccess

FFT

0

Temporal locality

First French-Japanese PAAP Workshop


Hpcc testing scenarios

HPCC Testing Scenarios

  • Local (S-STREAM, S-RandomAccess, S-DGEMM, S-FFTE)

    • Only single MPI process computes.

  • Embarrassingly parallel (EP-STREAM, EP-RandomAccess, EP-DGEMM, EP-FFTE)

    • All processes compute and do not communicate (explicitly).

  • Global (G-HPL, G-PTRANS, G-RandomAccess, G-FFTE)

    • All processes compute and communicate.

  • Network only (RandomRing Bandwidth, etc.)

First French-Japanese PAAP Workshop


Sample results page http icl cs utk edu hpcc hpcc results cgi

Sample results pagehttp://icl.cs.utk.edu/hpcc/hpcc_results.cgi

First French-Japanese PAAP Workshop


The winners of the 2006 hpc challenge class 1 awards

The winners of the 2006 HPC Challenge Class 1 Awards

  • G-HPL: 259 TFlops/s

    • IBM Blue Gene/L (131072 Procs)

  • G-RandomAccess: 35 GUPS

    • IBM Blue Gene/L (131072 Procs)

  • G-FFTE: 2311 GFlop/s

    • IBM Blue Gene/L (131072 Procs)

  • EP-STREAM-Triad (system): 160TB/s

    • IBM Blue Gene/L (131072 Procs)

First French-Japanese PAAP Workshop


Ffte a high performance fft library

FFTE: A High-Performance FFT Library

  • FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions.

  • It includes complex, mixed-radix and parallel transforms.

    • Shared / Distributed memory parallel computers (OpenMP, MPI and OpenMP + MPI)

  • It also supports Intel’s SSE2/SSE3 instructions.

  • The FFTE library can be obtained fromhttp://www.ffte.jp

First French-Japanese PAAP Workshop


Background

Background

  • One goal for large FFTs is to minimize the number of cache misses.

  • Many FFT algorithms work well when data setsfit into a cache.

  • When a problem exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically.

  • The conventional six-step FFT algorithm requires

    • Two multicolumn FFTs.

    • Three data transpositions. → The chief bottlenecks in cache-based processors.

First French-Japanese PAAP Workshop


Related works

Related Works

  • FFTW [Frigo and Johnson (MIT)]

    • The recursive call is employed to access main memory hierarchically.

    • This technique is very effective in the case that the total amount of data is not so much greater than the cache size.

    • For parallel FFT, the conventional six-step FFT is used.

    • http://www.fftw.org

  • SPIRAL [Pueschel et al. (CMU)]

    • The goal of SPIRAL is to push the limits of automation in software and hardware development and optimization for digital signal processing (DSP) algorithms.

    • http://www.spiral.net

First French-Japanese PAAP Workshop


Approach

Approach

  • Some previously presented six-step FFT algorithms separate the multicolumn FFTs from the transpositions.

  • Taking the opposite approach, we combinethe multicolumn FFTs and transpositions to reduce the number of cache misses.

  • We modify the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”.

First French-Japanese PAAP Workshop


Discrete fourier transform dft

Discrete Fourier Transform (DFT)

  • DFT is given by

First French-Japanese PAAP Workshop


2 d formulation

2-D Formulation

  • If has factors and then

First French-Japanese PAAP Workshop


Six step fft algorithm

Six-Step FFT Algorithm

individual

Transpose

-point FFTs

Transpose

Transpose

First French-Japanese PAAP Workshop


Block six step fft algorithm

Block Six-Step FFT Algorithm

PartialTranspose

individual

-point FFTs

Transpose

PartialTranspose

First French-Japanese PAAP Workshop


3 d formulation

3-D Formulation

  • For very large FFTs, we should switch to a 3-D formulation.

  • If has factors , and then

First French-Japanese PAAP Workshop


Parallel block nine step fft

Parallel Block Nine-Step FFT

PartialTranspose

All-to-all comm.

PartialTranspose

PartialTranspose

First French-Japanese PAAP Workshop


Operation counts for point fft

Operation Counts for -point FFT

  • Conventional FFT algorithms (e.g., Cooley-Tukey FFT, Stockham FFT)

    • Arithmetic operations:

    • Main memory accesses:

  • Block Nine-Step FFT

    • Arithmetic operations:

    • Main memory accesses (ideal case):

First French-Japanese PAAP Workshop


Performance results

Performance Results

  • To evaluate the implemented parallel FFTs, we compared

    • The implemented parallel FFT, named FFTE (ver 4.0, supports SSE3, using MPI)

    • FFTW (ver. 2.1.5, not support SSE3, using MPI)

  • Target parallel machine:

    • A 32-node dual PC SMP cluster(Irwindale 3GHz, 1GB DDR2-400 SDRAM / node, Linux 2.4.17-1smp).

    • Interconnected through a Gigabit Ethernet switch.

    • LAM/MPI 7.1.1 was used as a communication library

    • The compilers used were gcc 4.0.2 and g77 3.2.3.

First French-Japanese PAAP Workshop


The ffte library and the hpc challenge hpcc benchmark suite

First French-Japanese PAAP Workshop


The ffte library and the hpc challenge hpcc benchmark suite

First French-Japanese PAAP Workshop


Discussion

Discussion

  • For N = 2^29 and P = 32, the FFTE runs about 1.72 times faster than the FFTW.

    • The performance of the FFTE remains at a high level even for the larger problem size, owing to cache blocking.

    • Since the FFTW uses the conventional six-step FFT,each column FFT does not fit into the L1 data cache.

    • Moreover, the FFTE exploits the SSE3 instructions.

  • These are three reasons why the FFTE is most advantageous than the FFTW.

First French-Japanese PAAP Workshop


Conclusion and future work

Conclusion and Future Work

  • The block nine-step FFT algorithm is most advantageous with processors that have a considerable gap between the speed of the cache memory and that of the main memory.

  • Towards Petascale computing systems,

    • Exploiting the multi-level parallelism:

      • SIMD or Vector accelerator

      • Multi-core

      • Multi-socket

      • Multi-node

    • Reducing the number of main memory accesses.

    • Improving the all-to-all communication performance.

      • In the G-FFTE, the all-to-all communication occursthree times.

First French-Japanese PAAP Workshop


  • Login