1 / 24

Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California

Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California. Craig Ulmer cdulmer@sandia.gov. Cray User Group (CUG 2005) May 16, 2005. Craig Ulmer, Ryan Hilles, and David Thompson SNL/CA Keith Underwood and K. Scott Hemmert SNL/NM.

cleo
Download Presentation

Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reconfigurable Computing Aspects of the Cray XD1Sandia National Laboratories / California Craig Ulmer cdulmer@sandia.gov Cray User Group (CUG 2005) May 16, 2005 Craig Ulmer, Ryan Hilles, and David Thompson SNL/CA Keith Underwood and K. Scott Hemmert SNL/NM Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

  2. MPI Bandwidth 1.6GB/s HT PCI-X 133 Bandwidth (Million Bytes/s) Message Size (Bytes) Overview • Cray XD1 • Dense multiprocessor • Commodity parts • Good HPC performance • FPGA Accelerators in XD1 • Close to host memory • Reconfigurable Computing • Our early experiences

  3. Outline • The Cray XD1 & Reconfigurable Computing • Four simple applications • DMA, MD5, Sort, and Floating Point • Observations • Conclusions

  4. Cray XD1 Architecture

  5. Cray XD1 • Dense multiprocessor system • 12 AMD Opterons on 6 blades • 6 Xilinx Virtex-II/Pro FPGAs • InfiniBand-like interconnect • 6 SATA hard drives • 4 PCI-X slots • 3U Rack

  6. 3.2 GB/s HT 1.6 GB/s HT FPGA SRAM NI SRAM SRAM 1+1 GB/s “4x IB” SRAM Expansion Module 1.6 GB/s “HT” XD1 Blade Architecture Main Memory Main Memory CPU CPU NI Rapid Array Fabric

  7. a[i+2] a[i] a[i+1] * + Z -1 + Reconfigurable Computing (RC) • Use reconfigurable hardware devices to implement key computations in hardware Software Hardware FPGA double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; }

  8. Reconfigurable Computing and the XD1 • There have always been three challenges in RC: • System Integration • Capacity • Development Environment • XD1 is interesting because of (1) and possibly (2) • Describe our early experiences • Four simple applications to expose performance

  9. Application 1: Data Exchange

  10. Data Exchange • How fast can we move data into/out of FPGA? • Cray provides basic HyperTransport interface • Host/FPGA initiated transfers • HT-Compliance • 64B burst boundaries • 64b words • FPGA-Initiated DMA Engine • Read/Write large blocks of data • Based on memory interface HT DMA Op Host User’s Circuits FPGA FPGA-Initiated Transfers

  11. Data Transfer Performance

  12. FPGA Frequency Affects Performance

  13. Application 2: Data Hashing

  14. Data Hashing • Generate fingerprint for block of data • MD5 Message Digest function • 512b blocks of data • 64 computations per block • Computation: Boolean, add, and rotate • Two methods • Single-cycle computation (64) • Multi-cycle computation (320)

  15. MD5 Performance • MD5 is serial • Read/Modify fingerprint • Host does better • FPGA clock rates • Multi-cycle: 190 MHz • Single-cycle: 66 MHz • Multi-cycle has 5x stages but only 3x clock rate of single-cycle

  16. Application 3: Sorting

  17. > 9.4 17.5 31.3 81.3 31.3 9.4 81.3 17.5 Reg 11.5 32.9 88.7 94.1 88.7 94.1 32.9 11.5 34.5 39.2 54.2 79.4 54.2 39.2 34.5 79.4 3.0 14.8 19.7 91.4 91.4 19.7 3.0 14.8 Sorting Element 4.2 24.3 35.1 54.1 35.1 4.2 54.1 24.3 unsorted sorted DMA Engine Sorting Array Sorting Control Rd Req Address Wr Req Double-Buffered Outgoing Memory Sorting • Sort a matrix of 64b values • Hardware implementation • Sorting element • Array of elements • Data transfer engine • Exploits parallelism • Do n comparisons each clock • Requires 3n clock cycles • Limited to n values • Use as building block

  18. Sorting Performance • V2P50-7 FPGA • 128 elements • 110 MHz • Data transfer by FPGA • Read/write concurrency • Pinned memory • 2-4x Rate of Quicksort(128)

  19. Application 4: 64b Floating Point

  20. Floating Point Computations • Floating point has been weakness for FPGAs • Complex to implement, consume many gates • Keith Underwood and K. Scott Hemmert • Developed FP library at SNL/NM • FP Library • Highly-optimized and compact • Single/Double Precision • Add, Multiply, Divide, Square Root • V2P50: 10-20 DP cores, 130+ MHz

  21. Distance Computation • Compute Pythagorean Theorem: c=sqrt(a2 + b2) • Implemented as simple pipeline (88 stages) • Fetch two inputs, store one output each clock • Host pushes data in, FPGA pushes data out Incoming SRAM Outgoing SRAM DMA Engine From Host To Host Incoming SRAM Control

  22. Distance Performance • FPGA Design • 159 MHz clock rate • 39% of V2P50 • CPU and FPGA converge • 75 MiOperations/s • 1,200 MiB/s of input • For Square Root and Divide, FPGA can be competitive

  23. Observations • FPGAs have high-bandwidth access to host memory • FPGAs can get 1,300 MiB/s – 10x PCI • Still need planning for data transfers • Note: host’s HT links are 2x • Some applications work better than other • Need to exploit parallelism • V2P50 is approaching a capacity for interesting work • Small number of FP cores • Encourage use of larger FPGAs

  24. Conclusions • XD1 is an interesting platform for RC research • FPGAs require careful planning • High-bandwidth access to memory • Sign of interesting HT architectures • Acknowledgments • Steve Margerm of Cray Canada for XD1 Support • Keith Underwood and K. Scott Hemmert SNL/NM for FP library

More Related