Performance and Overhead in a Hybrid Reconfigurable Computer

Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci1, D. Poznanovic2, K. Gaj3, T. El-Ghazawi1, N. Alexandridis1 1George Washington University, 2SRC Computers Inc., 3George Mason University http://cpe02.gmu.edu/rcm/

Features of General-Purpose Reconfigurable Computers • composed of traditional microprocessors and Field Programmable Gate Arrays (FPGAs) closely integrated with each other • programming does not require knowledge of hardware design • permit run-time reconfiguration of FPGAs

Hardware Architecture and Programming Model of SRC-6E

800 MB/s 800 MB/s 2 Intel® microprocessors 2 Intel® microprocessors MAP processor Chain ports MAP processor SNAP SNAP MAP module SRC Hardware Architecture

SRC Hardware Architecture – cont.

SRC Programming Model FPGA contents after the Function_1 call Program in C or Fortran Main program Function_1 a …… FPGA Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Function_1(a, d, e) Macro_1 …… c b Function_2 Macro_2 Macro_2 Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) Function_2(d, e, f) d e ……

Compilation Process of SRC-6E Macro sources Application sources .vhd or .v files .c or .f files HDL sources Synplicity Intel Logic synthesis .v files P Compiler MAP Compiler Netlists .ngo files Xilinx Object files .o files .o files Place & Route Linker .bin files Configuration bitstreams Application executable

Application Case Study 1 High-throughput Triple DES encryption

High-throughput encryption . . . . Mi+2 Mi+1 Mi K0 3 DES Ci+2 Ci+1 Ci

Fully pipelined architecture of Triple DES 1 2 . . . . DES macro … 17 18 DES macro 19 . . . . … 34 35 36 DES macro . . . . … 51 • 51 pipeline stages • New input & new output every clock cycle

Xeon mP Xeon mP L2 L2 L2 L2 Control Chip Control Chip MIOC MIOC SNAP SNAP On-Board Memory (24 MB) On-Board Memory (24 MB) PCI Slot PCI Slot User Chip User Chip User Chip User Chip Overhead of the data transfer mP Board mP Board Xeon mP Xeon mP MAP Board (6x) (6x) Private Memory Private Memory (6x) (6x)

Timing Measurements Three-level timing measurement scheme has been employed: • end-to-end execution time: (wall clock time - HLL Level) includes the configuration, data transfer and data processing times • w/o configuration time: (wall clock time - HLL Level) excludes the configuration time but includes data transfer and data processing times • MAP Time: (clock counter - Hardware Level) only includes data processing time

Execution time [ms] Triple DES Encryption 160 configuration data transfer 140 computation 120 100 80 60 40 20 0 1024 10,000 25,000 50,000 100,000 250,000 500,000 Number of encrypted blocks

Problems • execution time dominated by • - configuration of the MAP FPGA and • - data transfer betweenthe • System Common Memory and • On-Board-Memory • configuration time hiding techniques • preloading the configuration before execution • flip-flopping FPGAs during reconfiguration

Data transfer hiding techniques • Data transfer can be hidden by overlapping DMA time with the data processing time Encryption Output DMA Input DMA Input DMA Input DMA Possible speed-up up to 33% Encry- ption Encry- ption Output DMA Output DMA

Reference software implementations Platform: Pentium 4, 1.8 GHz, 512 kB cache, 1 GB RAM Software: Optimized for encryption (but not for cipher breaking): Non-optimized: Public domain code C only Intel C++ -O3 optimization Phil Karn’s DES code C and assembly language with look-up table precomputations GNU gcc v. 2.96 -O4 optimization

Total execution time of Triple DES for Pentium 4 using optimized and non-optimized code Optimized P4 code Non-optimized P4 code  4

Throughput results for SRC-6E and Pentium 4

SRC-6E vs. Pentium 4speed-up

Application Case Study 2 DES cipher breaking

Secret-key breaking C0 M0 K1 K2 K3 KN … DES Generated by the DES breaker

Xeon mP Xeon mP L2 L2 L2 L2 Control Chip Control Chip MIOC MIOC SNAP SNAP On-Board Memory (24 MB) On-Board Memory (24 MB) PCI Slot PCI Slot User Chip User Chip User Chip User Chip Keys generated in the User FPGA mP Board mP Board Xeon mP Xeon mP MAP Board (6x) (6x) Private Memory Private Memory (6x) (6x)

DES breaking machine Execution time [ms] 1,200 configuration data transfer 1,000 computation 800 600 400 200 0 128,000 1,000,000 100,000,000 Number of testedkeys

SRC-6e vs. Pentium 4 Speed-up

Conclusions Two different classes of applications developed and tested for SRC-6E and Pentium 4 PC - Triple DES encryption: real-time data streaming - DES breaking: minimal input/output

Conclusions – cont. Wall-clock speed-ups 3 DES Encryption DES Breaking • vs. P4 C code • (larger for real-time input sizes) 3.4 vs. P4 C code 12.5 vs. P4 assembly code Speed-ups without reconfiguration 3 DES Encryption DES Breaking 11 vs. P4 C code 41 vs. P4 assembly code 1583 vs. P4 C code

Informal speed/cost comparison Cost of the SRC machine Cost of PC  100 Speed of the SRC machine Speed of PC  1600* * with only one out of four FPGAs used in computations 16 x improved speed/cost ratio

Conclusions: Overheads Reconfiguration time Most affected applications: short execution time, large resource requirements, frequent reconfiguration Minimization techniques: • preloading configuration • flip-flopping among multiple FPGAs Data transfer time Most affected applications: high speed real-time input/output Minimization techniques: • overlapping data transfer with computations

Performance and Overhead in a Hybrid Reconfigurable Computer