1 / 28

Performance and Overhead in a Hybrid Reconfigurable Computer

Performance and Overhead in a Hybrid Reconfigurable Computer. O. D. Fidanci 1 , D. Poznanovic 2 , K. Gaj 3 , T. El-Ghazawi 1 , N. Alexandridis 1 1 George Washington University, 2 SRC Computers Inc., 3 George Mason University. http://cpe02.gmu.edu/rcm/.

benny
Download Presentation

Performance and Overhead in a Hybrid Reconfigurable Computer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci1, D. Poznanovic2, K. Gaj3, T. El-Ghazawi1, N. Alexandridis1 1George Washington University, 2SRC Computers Inc., 3George Mason University http://cpe02.gmu.edu/rcm/

  2. Features of General-Purpose Reconfigurable Computers • composed of traditional microprocessors and Field Programmable Gate Arrays (FPGAs) closely integrated with each other • programming does not require knowledge of hardware design • permit run-time reconfiguration of FPGAs

  3. Hardware Architecture and Programming Model of SRC-6E

  4. 800 MB/s 800 MB/s 2 Intel® micro- processors 2 Intel® micro- processors MAP processor Chain ports MAP processor SNAP SNAP MAP module SRC Hardware Architecture

  5. SRC Hardware Architecture – cont.

  6. SRC Programming Model FPGA contents after the Function_1 call Program in C or Fortran Main program Function_1 a …… FPGA Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Function_1(a, d, e) Macro_1 …… c b Function_2 Macro_2 Macro_2 Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) Function_2(d, e, f) d e ……

  7. Compilation Process of SRC-6E Macro sources Application sources .vhd or .v files .c or .f files HDL sources Synplicity Intel Logic synthesis .v files P Compiler MAP Compiler Netlists .ngo files Xilinx Object files .o files .o files Place & Route Linker .bin files Configuration bitstreams Application executable

  8. Application Case Study 1 High-throughput Triple DES encryption

  9. High-throughput encryption . . . . Mi+2 Mi+1 Mi K0 3 DES Ci+2 Ci+1 Ci

  10. Fully pipelined architecture of Triple DES 1 2 . . . . DES macro … 17 18 DES macro 19 . . . . … 34 35 36 DES macro . . . . … 51 • 51 pipeline stages • New input & new output every clock cycle

  11. Xeon mP Xeon mP L2 L2 L2 L2 Control Chip Control Chip MIOC MIOC SNAP SNAP On-Board Memory (24 MB) On-Board Memory (24 MB) PCI Slot PCI Slot User Chip User Chip User Chip User Chip Overhead of the data transfer mP Board mP Board Xeon mP Xeon mP MAP Board (6x) (6x) Private Memory Private Memory (6x) (6x)

  12. Timing Measurements Three-level timing measurement scheme has been employed: • end-to-end execution time: (wall clock time - HLL Level) includes the configuration, data transfer and data processing times • w/o configuration time: (wall clock time - HLL Level) excludes the configuration time but includes data transfer and data processing times • MAP Time: (clock counter - Hardware Level) only includes data processing time

  13. Execution time [ms] Triple DES Encryption 160 configuration data transfer 140 computation 120 100 80 60 40 20 0 1024 10,000 25,000 50,000 100,000 250,000 500,000 Number of encrypted blocks

  14. Problems • execution time dominated by • - configuration of the MAP FPGA and • - data transfer betweenthe • System Common Memory and • On-Board-Memory • configuration time hiding techniques • preloading the configuration before execution • flip-flopping FPGAs during reconfiguration

  15. Data transfer hiding techniques • Data transfer can be hidden by overlapping DMA time with the data processing time Encryption Output DMA Input DMA Input DMA Input DMA Possible speed-up up to 33% Encry- ption Encry- ption Output DMA Output DMA

  16. Reference software implementations Platform: Pentium 4, 1.8 GHz, 512 kB cache, 1 GB RAM Software: Optimized for encryption (but not for cipher breaking): Non-optimized: Public domain code C only Intel C++ -O3 optimization Phil Karn’s DES code C and assembly language with look-up table precomputations GNU gcc v. 2.96 -O4 optimization

  17. Total execution time of Triple DES for Pentium 4 using optimized and non-optimized code Optimized P4 code Non-optimized P4 code  4

  18. Throughput results for SRC-6E and Pentium 4

  19. SRC-6E vs. Pentium 4speed-up

  20. Application Case Study 2 DES cipher breaking

  21. Secret-key breaking C0 M0 K1 K2 K3 KN … DES Generated by the DES breaker

  22. Xeon mP Xeon mP L2 L2 L2 L2 Control Chip Control Chip MIOC MIOC SNAP SNAP On-Board Memory (24 MB) On-Board Memory (24 MB) PCI Slot PCI Slot User Chip User Chip User Chip User Chip Keys generated in the User FPGA mP Board mP Board Xeon mP Xeon mP MAP Board (6x) (6x) Private Memory Private Memory (6x) (6x)

  23. DES breaking machine Execution time [ms] 1,200 configuration data transfer 1,000 computation 800 600 400 200 0 128,000 1,000,000 100,000,000 Number of testedkeys

  24. SRC-6e vs. Pentium 4 Speed-up

  25. Conclusions Two different classes of applications developed and tested for SRC-6E and Pentium 4 PC - Triple DES encryption: real-time data streaming - DES breaking: minimal input/output

  26. Conclusions – cont. Wall-clock speed-ups 3 DES Encryption DES Breaking • vs. P4 C code • (larger for real-time input sizes) 3.4 vs. P4 C code 12.5 vs. P4 assembly code Speed-ups without reconfiguration 3 DES Encryption DES Breaking 11 vs. P4 C code 41 vs. P4 assembly code 1583 vs. P4 C code

  27. Informal speed/cost comparison Cost of the SRC machine Cost of PC  100 Speed of the SRC machine Speed of PC  1600* * with only one out of four FPGAs used in computations 16 x improved speed/cost ratio

  28. Conclusions: Overheads Reconfiguration time Most affected applications: short execution time, large resource requirements, frequent reconfiguration Minimization techniques: • preloading configuration • flip-flopping among multiple FPGAs Data transfer time Most affected applications: high speed real-time input/output Minimization techniques: • overlapping data transfer with computations

More Related