- 86 Views
- Uploaded on
- Presentation posted in: General

Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Scientific Computing Beyond CPUs: FPGA implementations of common scientific kernels

Melissa C. Smith1Jeffrey S. Vetter2Sadaf R. Alam2

Sreesa Akella3Luis Cordova3

1Engineering Science and Technology Division, ORNL

2Computer Science and Mathematics Division, ORNL

3University of South Carolina

September 2005

- Introduction & Motivation
- Candidate Kernels/Apps & Implementation
- Results
- Function Library
- Lessons Learned
- Conclusions

Smith

Image courtesy of SRC

Traditional Computing

- Hardware development struggling to keep pace with analysis needs
- Reaching limits on computing speed due to I/O bandwidth and clock wall
- Managing heat dissipation becoming increasingly difficult

Reconfigurable Computing (RC) with FPGAs

- Faster execution & lower power consumption all with slower clock speeds
- Exploit inherent parallelism in algorithms
- Match computation to application data flow i.e. Data Flow Graph Theory
- Hardware-like speed with software-like flexibility that can adapt to the needs of the application
- Gate densities suitable for 64b floating-point

Smith

- Many scientific applications at ORNL and elsewhere depend on double precision operations
- Kernel selection and classification
- compute intensive
- common among many relevant applications
- candidate for hardware implementation

- Interface to legacy code (FORTRAN & C) extremely important
- Memory bottleneck in conventional memory hierarchies for scientific applications throttling performance
With this knowledge:

- Can users harness reconfigurable hardware without
(a) becoming hardware experts and

(b) completely re-writing their code?

- Can we develop function libraries such as BLAS, VSIPL, or others without loss of generality?

- Can users harness reconfigurable hardware without

Smith

- Initial studies
- Kernels
- Dense matrix operations (e.g. DGEMM)
- Sparse matrix operations

- Climate
- PSTSWM

- Bioinformatics
- BLAST
- Fragment assembly

- Molecular dynamics
- AMBER
- LAMMPS

- Kernels

Cannot cover all apps studies today.

Smith

- BLAS routines:SGEMM & DGEMM perform the matrix-matrix operation:
C = aAB + bC

a and b are scalars, and A, B, and C are matrices (A is an m x k, B is an k x n, and C is an m x n matrix)

- What makes them difficult and interesting:
- Memory communication bottleneck (limited bandwidth)
- Local storage limitation (for both sequential & parallel machines)
Answer:Exploit Data Reusability and Data Flow with FPGAs

Smith

A00 A01 A02 A03 A04 A05

A10 A11 A12 A13 A14 A15

A20 A21 A22 A23 A24 A25

A30 A31 A32 A33 A34 A35

A40 A41 A42 A43 A44 A45

A50 A51 A52 A53 A54 A55

- Fully utilize both user FPGAs (XC2V6000) of the SRC MAPstation
- DGEMM: 12 MAC units per FPGA (SGEMM: 25 MAC units per FPGA)
- Geared to handle arbitrary size matrices up to 1024x1024
- Matrices operations occur in blocks
- How to count FLOPS?
- FPGA Algorithm performs more FLOPS than efficient SW implementation
- Takes advantage of the data flow architecture
- Later referred to as alternate FLOPS

A00

A01

A10

A11

Smith

D

E

F

B

C

A

C10,C11

A10,A11

C00,C01

B01,B11

B00,B10

A00,A01

FPGA1

FPGA0

OBM

Banks

800 MB/s

Per bank

- Calculations are conducted in two stages
- Two FPGAs exchange ownership of the matrix B blocks

Smith

D

E

F

B

C

A

C10,C11

A10,A11

C00,C01

B01,B11

B00,B10

A00,A01

FPGA1

FPGA0

OBM

Banks

- In stage two, the two FPGAs have exchanged ownership of the matrix B blocks

Smith

Data transfer time in/out of hardware is significant and takes away from “time to solution” – Hence the interest in other memory systems such as those used in systems by Cray and SGI

Faster and/or denser FPGAs can significantly improve performance and ‘time to solution’

Performance and ‘time to solution’ could potentially be improved with ‘DMA streaming’ of data

Smith

Our results using SRC CARTE v1.8

Dual Xilinx XC2V6000

12 64-b MACs @ 100 MHz (or 25 32-b MACs)

3.5 GFlops (5.3 GFlops alternate FLOPS)

Dou et.al. results using hardware language

Xilinx XC2VP125-7

39 64-b MACs @ 200 MHz

15.6 GFlops

Image courtesy of SRC

Flops/MAC ratios: MAPstation=0.44 Dou’s=0.4

- Parts Available on the Cray XD1
- Xilinx XC2VP50-7 x 6 nodes
- Up to 200 MHz
- Conservative estimate 18 64-b MACs ->7.2 GFlops per node
- Full utilization of all 6 nodes potentially 43.2 GFlops

Smith

- Goal: To assemble library of user friendly, familiar, and pertinent scientific functions
- Initial functions identified:
- BLAS Level 2 and Level 3 (e.g. DGEMM/SGEMM)
- Sparse Matrix Operations
- FFT and 3D-FFT
- Bioinformatics query functions

ClimateApps

MDApps

BioinformaticsApps

FFT

BLAS

Iterative SolversSpMatVec

Queries

Smith

MAC

NZ[0]

OV[0]

IV[CO[0]]

MAC

NZ[1]

OV[1]

IV[CO[1]]

.

.

MAC

NZ[n-1]

OV[n-1]

IV[CO[n-1]]

NZ[n]

MAC

OV[n]

IV[CO[n]]

NZ – Non-zero element vector, CO – Column indices vector,

IV- Input vector, OV- Output vector

- Used in iterative solvers for linear systems
- Not efficient on general purpose microprocessor systems
- High cache miss rate due to poor data locality
- Low utilization of floating point unit due to high ratio load/store to floating point operations

- RC advantage
- Avoid cache misses with high on-chip and off-chip memory bandwidth
- Local distributed memory banks
- High density FPGAs
- High speed host to FPGA communication

Investigating multiple storage formats (CSR, ELLPACK, and CSRPERM)

Smith

Identified regions of Amber8 application using detailed profiling and modeling of code

ew_direct.f

veclib.f

Examining strategy for mapping this routine into SRC’s two FPGAs

Also investigating acceleration of FFTs using FPGAs

ew_recip.f

ew_fft.f

pub_fft.f & passb2.f

O(N2) smaller problems

3D FFT time worsens for parallel systems due communication costs

3.39%

73.14%

11.22%

1

main

1

sander

1

runmd

1000

force

fastwt_mp_quick3

shake

ew_force.f

1000

ewald_force

ew_recip.f

ew_box.f

ew_direct.f

1000

get_nb_energy

do_pmesh_kspace

nb_adjust

adjust_imagcrds

23558000

short_ene

fft_setup

fft_backrc

fft_forwardrc

grad_sumrc

fft3d0rc

fft3dzxyrc

vdinvsqrt

47116000

vec_lib.f

fft2drc

ew_fft.f

pub_fft.f

cfftb1

cfftf1

cffti

passb4

passf4

passb2.f

passb2

Smith

total1/length1 = 1x3x3/3 = 3

fft_3d ( )

total2/length2 = 3x1x3/3 = 3

total3/length3 = 3x3x1/3 = 3

Nfast (1) x Nmid (2) x Nslow (3)

1/-1 = forward / inverse

fftw_orchestrator

1

remap_3d (data, copy, scratch, pre_plan)

M

M

fly

fftw (plan, total1/length1, data, 1/-1, length1, NULL, 0, 0)

fftw

I

O

2

remap_3d (data, copy, scratch, pre_plan)

fly

fftw (plan, total2/length2, data, 1/-1, length2, NULL, 0, 0)

3

remap_3d (data, copy, scratch, pre_plan)

fftw (plan, total3/length3, data, 1/-1, length3, NULL, 0, 0)

Depending on data size the FPGA implementation of the fftw will resemble the software counterpart with improved performance and data reuse

The fly element indicated stand for different FFT computation units with radix 2,3,4, and 5 and with certain level of parallelism

Single/Multi-MAP

OBM

BRAM plane

fftw

fftw

fftw

GCM

1

2

3

…

fftw

fftw

fftw

BRAM plane

Remap stages are exchanged by intelligent access and addressing

Will not necessarily fit but there is a penalty for going off-chip

Smith

- BLAST: Basic Local Alignment Search Tool
- Profiling of the NCBI source code determine time-consuming functions that could be targeted to FPGA completed
- Currently investigating best problem structure and domain for given RC architecture and bandwidths (analysis of data streams, memory capacity, etc.)

Smith

- Effective use of HLL (such as the Carte tool used here) to design for FPGAs still requires some hardware knowledge
- Memory limitations
- FPGA limitations
- ‘Tricks’ to take advantage of FPGA strengths
- ‘Tricks’ to take advantage of RC architecture

- Library development requires analysis to determine functions appropriate for FPGA implementation
- Breakout level of library functions may not always be appropriate for RC implementation – still under investigation
- Combine or fuse appropriate function calls to form larger functions with more computational weight

Smith

- Consider these caveats
- FPGA growth rates exceeding general purpose microprocessors
- These FPGA implementations demonstrate performance with additional power and space savings vs. general processor implementations

- Restricted our evaluation to compiler transformed high-level languages
- No manual VHDL coding
- Performance comparable with VHDL techniques (adjusting for FPGA size & clock frequency)

- New higher bandwidth RC architectures promise to dramatically reduce data transfer costs
- Efforts in 64b floating-point computation just beginning
- Cores not widely available

- No common tools exist that identify candidate codes or regions in the application for acceleration
- Must manually profile and model large, complex applications

- FPGA growth rates exceeding general purpose microprocessors

We expect the performance advantages and applicability of these systems to only improve over the coming years.

Smith

- Ability to code in C or FORTAN a significant benefit for our users
- Progress on several application areas
- Initial studies completed with competitive performance
- Kernels (dense & sparse matrix), climate

- Actively studying other fruitful areas
- Molecular dynamics, Bioinformatics

- Initial studies completed with competitive performance
- Future work will focus on
- Maximum utilization of FPGA resources
- Additional function/kernel library development
- Resource management for multi-paradigm platforms
- Evaluations of other RC platforms (Cray XD1 and SGI)

Smith

End