the ffte library and the hpc challenge hpcc benchmark suite
Download
Skip this Video
Download Presentation
The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite

Loading in 2 Seconds...

play fullscreen
1 / 24

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite - PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite. Daisuke Takahashi Center for Computational Sciences/ Graduate School of Systems and Information Engineering University of Tsukuba. Outline. HPC Challenge (HPCC) Benchmark Suite Overview The Benchmark Tests Example Results

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite' - nicola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the ffte library and the hpc challenge hpcc benchmark suite

The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite

Daisuke Takahashi

Center for Computational Sciences/

Graduate School of Systems and Information Engineering

University of Tsukuba

First French-Japanese PAAP Workshop

outline
Outline
  • HPC Challenge (HPCC) Benchmark Suite
    • Overview
    • The Benchmark Tests
    • Example Results
  • FFTE: A High-Performance FFT Library
    • Background
    • Related Works
    • Block Six-Step/Nine-Step FFT Algorithm
    • Performance Results
    • Conclusion and Future Work

First French-Japanese PAAP Workshop

overview of the hpc challenge hpcc benchmark suite
Overview of the HPC Challenge (HPCC) Benchmark Suite
  • HPC Challenge (HPCC) is a suite of tests that examine the performance of HPC architectures using kernels.
  • The suite provides benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g.,
    • Spatial locality
    • Temporal locality

First French-Japanese PAAP Workshop

the benchmark tests
The Benchmark Tests
  • The HPC Challenge benchmark consists at this time of 7 performance tests:
    • HPL (High Performance Linpack)
    • DGEMM (matrix-matrix multiplication)
    • STREAM (sustainable memory bandwidth)
    • PTRANS (A=A+B^T, parallel matrix transpose)
    • RandomAccess (integer updates to random memory locations)
    • FFT (complex 1-D discrete Fourier transform)
    • b_eff (MPI latency/bandwidth test)

First French-Japanese PAAP Workshop

targeted application areas in the memory access locality space
Targeted Application Areas in the Memory Access Locality Space

PTRANSSTREAM

HPL

DGEMM

CFD

Radar X-section

Spatial locality

Applications

TSP

DSP

RandomAccess

FFT

0

Temporal locality

First French-Japanese PAAP Workshop

hpcc testing scenarios
HPCC Testing Scenarios
  • Local (S-STREAM, S-RandomAccess, S-DGEMM, S-FFTE)
    • Only single MPI process computes.
  • Embarrassingly parallel (EP-STREAM, EP-RandomAccess, EP-DGEMM, EP-FFTE)
    • All processes compute and do not communicate (explicitly).
  • Global (G-HPL, G-PTRANS, G-RandomAccess, G-FFTE)
    • All processes compute and communicate.
  • Network only (RandomRing Bandwidth, etc.)

First French-Japanese PAAP Workshop

sample results page http icl cs utk edu hpcc hpcc results cgi
Sample results pagehttp://icl.cs.utk.edu/hpcc/hpcc_results.cgi

First French-Japanese PAAP Workshop

the winners of the 2006 hpc challenge class 1 awards
The winners of the 2006 HPC Challenge Class 1 Awards
  • G-HPL: 259 TFlops/s
    • IBM Blue Gene/L (131072 Procs)
  • G-RandomAccess: 35 GUPS
    • IBM Blue Gene/L (131072 Procs)
  • G-FFTE: 2311 GFlop/s
    • IBM Blue Gene/L (131072 Procs)
  • EP-STREAM-Triad (system): 160TB/s
    • IBM Blue Gene/L (131072 Procs)

First French-Japanese PAAP Workshop

ffte a high performance fft library
FFTE: A High-Performance FFT Library
  • FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions.
  • It includes complex, mixed-radix and parallel transforms.
    • Shared / Distributed memory parallel computers (OpenMP, MPI and OpenMP + MPI)
  • It also supports Intel’s SSE2/SSE3 instructions.
  • The FFTE library can be obtained fromhttp://www.ffte.jp

First French-Japanese PAAP Workshop

background
Background
  • One goal for large FFTs is to minimize the number of cache misses.
  • Many FFT algorithms work well when data setsfit into a cache.
  • When a problem exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically.
  • The conventional six-step FFT algorithm requires
    • Two multicolumn FFTs.
    • Three data transpositions. → The chief bottlenecks in cache-based processors.

First French-Japanese PAAP Workshop

related works
Related Works
  • FFTW [Frigo and Johnson (MIT)]
    • The recursive call is employed to access main memory hierarchically.
    • This technique is very effective in the case that the total amount of data is not so much greater than the cache size.
    • For parallel FFT, the conventional six-step FFT is used.
    • http://www.fftw.org
  • SPIRAL [Pueschel et al. (CMU)]
    • The goal of SPIRAL is to push the limits of automation in software and hardware development and optimization for digital signal processing (DSP) algorithms.
    • http://www.spiral.net

First French-Japanese PAAP Workshop

approach
Approach
  • Some previously presented six-step FFT algorithms separate the multicolumn FFTs from the transpositions.
  • Taking the opposite approach, we combinethe multicolumn FFTs and transpositions to reduce the number of cache misses.
  • We modify the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”.

First French-Japanese PAAP Workshop

discrete fourier transform dft
Discrete Fourier Transform (DFT)
  • DFT is given by

First French-Japanese PAAP Workshop

2 d formulation
2-D Formulation
  • If has factors and then

First French-Japanese PAAP Workshop

six step fft algorithm
Six-Step FFT Algorithm

individual

Transpose

-point FFTs

Transpose

Transpose

First French-Japanese PAAP Workshop

block six step fft algorithm
Block Six-Step FFT Algorithm

PartialTranspose

individual

-point FFTs

Transpose

PartialTranspose

First French-Japanese PAAP Workshop

3 d formulation
3-D Formulation
  • For very large FFTs, we should switch to a 3-D formulation.
  • If has factors , and then

First French-Japanese PAAP Workshop

parallel block nine step fft
Parallel Block Nine-Step FFT

PartialTranspose

All-to-all comm.

PartialTranspose

PartialTranspose

First French-Japanese PAAP Workshop

operation counts for point fft
Operation Counts for -point FFT
  • Conventional FFT algorithms (e.g., Cooley-Tukey FFT, Stockham FFT)
    • Arithmetic operations:
    • Main memory accesses:
  • Block Nine-Step FFT
    • Arithmetic operations:
    • Main memory accesses (ideal case):

First French-Japanese PAAP Workshop

performance results
Performance Results
  • To evaluate the implemented parallel FFTs, we compared
    • The implemented parallel FFT, named FFTE (ver 4.0, supports SSE3, using MPI)
    • FFTW (ver. 2.1.5, not support SSE3, using MPI)
  • Target parallel machine:
    • A 32-node dual PC SMP cluster(Irwindale 3GHz, 1GB DDR2-400 SDRAM / node, Linux 2.4.17-1smp).
    • Interconnected through a Gigabit Ethernet switch.
    • LAM/MPI 7.1.1 was used as a communication library
    • The compilers used were gcc 4.0.2 and g77 3.2.3.

First French-Japanese PAAP Workshop

discussion
Discussion
  • For N = 2^29 and P = 32, the FFTE runs about 1.72 times faster than the FFTW.
    • The performance of the FFTE remains at a high level even for the larger problem size, owing to cache blocking.
    • Since the FFTW uses the conventional six-step FFT,each column FFT does not fit into the L1 data cache.
    • Moreover, the FFTE exploits the SSE3 instructions.
  • These are three reasons why the FFTE is most advantageous than the FFTW.

First French-Japanese PAAP Workshop

conclusion and future work
Conclusion and Future Work
  • The block nine-step FFT algorithm is most advantageous with processors that have a considerable gap between the speed of the cache memory and that of the main memory.
  • Towards Petascale computing systems,
    • Exploiting the multi-level parallelism:
      • SIMD or Vector accelerator
      • Multi-core
      • Multi-socket
      • Multi-node
    • Reducing the number of main memory accesses.
    • Improving the all-to-all communication performance.
      • In the G-FFTE, the all-to-all communication occursthree times.

First French-Japanese PAAP Workshop

ad