1 / 18

Reconfigurable Computing: A First Look at the Cray-XD1

Reconfigurable Computing: A First Look at the Cray-XD1. Craig Ulmer. Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963. September 1, 2004. Outline. Reconfigurable computing refresher Progress update Cray XD1 Architecture

Download Presentation

Reconfigurable Computing: A First Look at the Cray-XD1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reconfigurable Computing:A First Look at the Cray-XD1 Craig Ulmer Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963 September 1, 2004

  2. Outline • Reconfigurable computing refresher • Progress update • Cray XD1 • Architecture • General message passing • Reconfigurable Computing and the XD1

  3. Reconfigurable Computing Update

  4. a[i+2] a[i] a[i+1] * + Z -1 + Reconfigurable Computing • Use reconfigurable hardware devices to implement key computations in hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; }

  5. First Year Progress • Computation (Underwood SNL/NM) • Double-precision Floating Point Cores • Communication • Multi-gigabit Transceiver (MGT) interface • Gigabit Ethernet work • Early application experiments • Simplified isosurfacing • Networked pattern matching

  6. Peak Floating-Point Performance From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04

  7. S o c k e t I/F Outgoing Data Queue T C P I/F IP Header CRC MAC Framer Pad Tx Rocket I/O MGT CRC Gen Timeout Monitor SEQ Gen Ping Reply ARP Cache ARP Reply MGT Control CRC CRC GT_Ethernet_2 ACK Monitor Ping ARP Decode Align Rx Incoming Data Queue SNL_OpenGigE SNL_OpenTOE Connecting FPGAs to the Network Fabric • Modern FPGAs feature multi-gigabit transceivers • Experimented with GigE, Myrinet 2000, and IB • Implemented TCP Offload Engine (TOE) in hardware • Working on OpenTOE and OpenGigE cores

  8. Cray XD1 Overview

  9. NDA Notice We do have an NDA with Cray Canada The XD1 we have on loan is an early Beta system

  10. Cray XD1 Overview • Dense MP system • 12 AMD Opterons on 6 blades • 6 Xilinx Virtex-II/Pro FPGAs • InfiniBand-like interconnect • 6 SATA hard drives • 4 PCI-X slots • 3U Rack

  11. HT: 6.4 GB/s “HT”: 3.2 GB/s HT: 3.2 GB/s “Einstein” Chip RAP NI 4xIB: 2 GB/s RapidArray Fabric (24 4x IB Ports) Individual Blade DDR Memory Opteron Opteron DDR Memory RAP NI RapidArray Fabric (24 4x IB Ports) * All data rates are aggregates(i.e., 3.2 GB/s = 1.6 GB/s + 1.6 GB/s)

  12. Message Passing • MPICH 1.2.5 • Latency: 2.25 μs • Bandwidth: 1.3 GB/s (82% of HT-IB link) • RapidArray message layer • Open source • MP, RDMA • Global address space MPI Bandwidth 1.6GB/s HT PCI-X 133 Bandwidth (Million Bytes/s) Message Size (Bytes)

  13. System Administration • Active manager • Synchronize each node’s OS • Partition blade functionality • Control access rights • Embedded processor • Monitors health (heartbeats) • Can restart nodes • Issues?

  14. Reconfigurable Computing and the Cray XD1

  15. User-defined Circuits Host HT QDR2 I/F 2MB SRAM HT HT I/F FPGA Port QDR2 I/F 2MB SRAM RAP NI 1.6+1.6GB/s QDR2 I/F 2MB SRAM Fabric Port QDR2 I/F 2MB SRAM Net IB FPGA 1.6+1.6GB/s Connecting to the “Einstein” Accelerator

  16. Host Memory CPU RNG NI FPGA Example: Random Number Generator • Monte Carlo app in need of good random numbers • Mersenne twister • Implemented in FPGA • FPGA pushes to host memory • 301 vs 101 Million Integers/s • ~1.2 GB/s

  17. Reconfigurable computing FPGA in memory Fast local memory Other accelerators ClearSpeed Global address space Opteron limits (40b PA) Vendor lock-in Incompatible network All-in-one box? Current NI is a bottleneck Density vs. Reliability Value-added features General XD1 Comments Not-so-Good Good

  18. Friendly Users? • We have a month left on evaluation • Could use feedback from other users http://cdulmer.ran.sandia.gov/xd1cdulmer@sandia.gov

More Related