RC Device Characterizations & Tradeoff Analysis

RC Device Characterizations & Tradeoff Analysis Jason Williams

Introduction • Reconfigurable Computing (RC) is an emerging field that utilizes devices with a programmable fabric allowing the hardware to be configured and adapted to solve changing problems • RC systems have typically been built using Field Programmable Gate Arrays (FPGAs) but there are other architectures that could implement RC systems such as Field Programmable Object Arrays (FPOAs) and Field Programmable Compute Arrays (FPCA, e.g. MONARCH)

Subject & Purpose • Subject • To survey the landscape of various RC devices • Characterize these devices using various metrics (performance, price, power) • Create a comparison framework using the characterizations • Purpose • Will give the end user a quantitative framework to aid in the selection of an appropriate RC device to meet their application needs • Lays groundwork for understanding performance impacts of architectural components

Problem Definition • RC devices differ from traditional microprocessors • Typically slower clock rates • Potential for massive parallelism • Different power consumption trends • Different on-die memory configurations • All of these differences make direct device comparisons difficult • Problems • RC devices can be vastly different from one another • Various architectural differences and very few standard/common parameters • Memory Example: Xilinx BRAM vs. Altera M-RAM/M4K/M512 vs. FPOA RF/IRAM vs. CPU cache

Problem Background • Users have a variety of requirements/concerns – What key parameters do we need to compare? • Computational performance (integer/fixed point, floating point, fine grained/bit level) • On-chip memory performance (latency, bandwidth) • Off-chip communications and I/O • Power consumption • Price

Scope Statement • Devices to be included in study • Xilinx Virtex 4 LX200, LX100, SX55 • Altera Stratix II S180 • Freescale PowerPC MPC7447 + AltiVec • MathStar Arrix FPOA (1 GHz) • Raytheon Monarch PCA • Sony/Toshiba/IBM Cell

Methods • Literature review • Apply and extend characterizations and metrics to devices under study • Datasheet analysis • Experiments using vendor development tools/simulation environments • Example: Utilization and timing analysis results from post place and route for common ALU/FP structures • Combine characterization study results into a QFD style matrix

FPGA Theoretical Floating Point Performance • Methodology • Adapted from Jeff Mason’s (Xilinx) presentation at RSSI ’07 “FPGA HPC – The road beyond processors” with input from Dave Strenski (Cray). Similar methodology also reported in An overview of FPGAs and FPGA programming; Initial experiences at Daresbury, Richard Wain, Ian Bush, Martyn Guest, Miles Deegan, Igor Kozin and Christine Kitchen. November 2006. Distributed Computing Group at Daresbury Laboratory. • Using datasheet information, Altera and Xilinx Floating Point cores, ISE and Quartus, estimate FP add and FP multiply performance.

FPGA Floating Point Performance • Xilinx Example • Data from Virtex 4 Family Overview (DS112) and Coregen Floating Point Operator v3.0 (DS335) • Assumptions: • 15% slice overhead (routing, I/O, etc.) • Use DSP resources first, then logic only implementation to fill device. • Use lower of the two clock speeds for all calculations (DSP vs. Logic only). • Assume 2 storage elements (BRAM) per operation (operands, overwrite with result). Limit the number of operations if there is not enough BRAM to support. • Use speed optimized, highest effort for Synthesis, Map, PAR.

FPGA Floating Point Performance • Xilinx Example Continued (LX200 –10) • Double Precision Floating Point Multiply • 96 / 16 = 6 DSP Multipliers • 151449 – (774 * 6) = 146805 remaining LUT for Logic Multipliers • 146805 / 2457 = ~59 Logic Only Multipliers • 65 total multipliers in 1 context @ 185 MHz = ~12 Gflop/s • Limit total number of multipliers to 85 due to BRAM limitation = ~11.1 Gflop/s • LX100 has 336 18Kb dual port BRAM. For 64-bit (DP), ((336 * 2) / 4) / 2 = 85 function units

Theoretical Floating Point Performance • Methodology • FPOA floating point performance is reported as 0. This device could have a floating point core designed for it, but its architecture (16 bit ALUs) would not implement FP efficiently. • PowerPC, AltiVec, MONARCH, and Cell floating point performance numbers are available/derivable from their respective datasheets

Floating Point Performance Results

Floating Point Performance Results Theoretical Floating Point Performance (GFlops, BRAM Limitation) Theoretical Floating Point Performance (GFlops, No BRAM Limitation)

Floating Point Conclusions • For FPGAs, floating point performance dependent on FP core implementation. This impacts resource utilization and maximum achievable frequency. • For Xilinx devices, available on-chip memory also greatly impacts performance if we assume there has to be enough on-chip memory to buffer operands and results. Stratix II S180 has more on chip RAM (1.5x V4LX200) and a more flexible memory hierarchy (a larger number of smaller blocks to support more individual registers, higher device memory bandwidth) and does not have this issue. • Xilinx adder cores can use on-chip DSP resources, Altera adder cores do not. • MONARCH only supports single precision floating point. • Cell is the clear leader in theoretical floating point performance (using all processing elements).

Theoretical Integer Performance • Utilize same basic methodology as Floating Point Performance Comparison • 15% slice overhead (routing, I/O, etc.). • Use DSP resources first, then logic only implementation to fill device. • Use lower of the two clock speeds for all calculations (DSP vs. Logic only). • Use vendor software (Quartus, ISE) to find resource utilization for 1 functional unit. Calculate the number of parallel functional units that fit in 1 context using datasheet values. • Assume 2 storage elements (BRAM) per functional unit (operands, overwrite with result). Limit the number of parallel functional units if there is not enough BRAM to support 2 storage elements per functional unit. • Use speed optimized, highest effort for Synthesis, Map, PAR. • Use standard integer widths (32 bit and 16 bit). • Analyze Addition and Multiplication operations separately.

Theoretical Integer Performance • Methodology • FPOA 32 bit integer performance is reported as 0. This device could have a 32 bit ALU core designed for it, but it is natively a 16 bit device. • PowerPC, AltiVec, MONARCH, and Cell integer performance numbers are available/derivable from their respective datasheets

Integer Performance Results

Integer Performance Results Theoretical Integer Performance (GOPs, BRAM Limitation) Theoretical Integer Performance (GOPs, No BRAM Limitation)

Integer Performance Conclusions • In some cases, BRAM limitation is again an important performance limiter for Xilinx devices. Stratix II S180 has more on chip RAM (1.5x V4LX200) and a more flexible memory hierarchy (a larger number of smaller blocks to support more individual registers, higher device memory bandwidth) and does not . • Quartus II 6.0 typically reports higher maximum achievable frequency for post place and route timing analysis versus ISE 9.2. • Used speed grade –10 for Virtex 4 devices. • Used speed grade –3 for Stratix II device. • 32 bit multiply example: Quartus reports 500 MHz for both DSP and Logic Only implementations, ISE reports 421 MHz for DSP, 249 MHz for Logic Only. • Xilinx adder cores can use on-chip DSP resources, which could improve add performance if there was enough memory support. Altera adder cores do not support DSP utilization and therefore suffer a performance hit compared to Xilinx devices. • Without the BRAM limitation, Xilinx devices show the highest performance for Integer Add operations. • With the BRAM limitation, the FPOA has the highest 16 bit integer performance. • Cell has the highest 32 bit integer performance (using all processing elements).

Bit-level Computational Performance • Methodology • Based off of Dehon’s Computational Density calculations • Computational Density • Normalizes performance by die (or package) area and minimum feature size/process technology • Bit operations for FPGAs are number of 4 input LUTs • Bit operations for GPP and other “hybrid” devices based on number of cores, number of issued instructions, and width of ALU/Functional Units

Bit-level Computation Performance • As expected, fine-grained FPGAs dominate performance in this metric

External Memory Bandwidth Methodology • Methodology varies by platform due to available information and architecture differences. • In all cases, choose maximum throughput available based on vendor IP for memory controllers. • Saturated Case uses maximum amount of I/O for external memory interface, Balanced Case assumes a balance of I/O and memory interface. • Altera Stratix II • Influenced by speed grade, number of I/O • Used new high performance ALTMEMPHY core (vs. legacy memory interface core) • Support for 333 MHz DDR2 RAM • Number of controllers limited by the number of on-chip delay-locked loops (2)

External Memory Bandwidth Methodology • Xilinx Virtex 4 • Influenced by speed grade, number of I/O • Memory Interface Generator v1.73 (Coregen) forces use of slower “Direct Clocking” to support multiple banks vs. SERDES strobe implementation, for -10 speed grade maximum frequency is 220 – 240 MHz (depending on bus width) • Mathstar FPOA • Datasheet information for total external memory interface bandwidth (RLDRAM II) • Cell • External Memory Bandwidth (Rambus XDRAM) reported in presentation “Introduction to the Cell Processor” from Dr. Michael Perrone (IBM) • MONARCH • External Memory Bandwidth (DDR2) reported in presentation “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon)

External Memory Bandwidth Results

External Memory Bandwidth Conclusions • External Memory Bandwidth important to prevent data bottleneck into the device. • For FPGAs, the type and speed of external memory supported depends on the family and speed grade of the device. • In this study, non-FPGA devices have separate I/O and memory controllers/interfaces, so there is not a distinction between saturated and balanced. • Stratix II S180 and Virtex 4 SX55 configurations support 2 simultaneous controllers, Virtex 4 LX100 and LX200 support 3 simultaneous controllers which is shown in the performance difference for the saturated case. • Although Stratix II controller supports faster DDR2 RAM (333 MHz vs. 220 MHz in this configuration), Virtex 4 SX55 has higher bandwidth due to support for a wider bus. • Xilinx claims higher bandwidth on website, assumes wider bus than existing memories. • For the balanced case, Cell is the performance leader, primarily due to specialized RAM format (XDRAM).

I/O Bandwidth Methodology • Methodology varies by platform due to available information and architecture differences. • In all cases, choose maximum throughput available protocol/signaling level. • Saturated Case uses maximum amount of I/O for I/O interface, Balanced Case assumes a balance of I/O and 1 memory interface. • Altera Stratix II • Datasheet information for concurrent receive pairs and transmit pairs @ 1.040 Gb/s per pair. • Xilinx Virtex 4 • Datasheet information for concurrent receive pairs and transmit pairs @ 1 Gb/s per pair. • Mathstar FPOA • Datasheet information for concurrent total transmit and receive bandwidth.

I/O Bandwidth Methodology • Cell • I/O Bandwidth reported in presentation “Introduction to the Cell Processor” from Dr. Michael Perrone (IBM) • MONARCH • I/O Bandwidth reported in presentation “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon)

I/O Bandwidth Results

I/O Bandwidth Conclusions • I/O Bandwidth is important to prevent I/O and data bottleneck. • In this study, non-FPGA devices have separate I/O and memory controllers/interfaces, so there is not a distinction between saturated and balanced. • All devices except for FPOA have at least 40 GB/s throughput. • FPGAs are shown in both fully utilized and balanced cases. • Stratix II uses separate I/O for single ended memory interface and differential pairs so there is no distinction between saturated and balanced cases. • Cell has the highest I/O performance for both cases.

Internal Device Memory Bandwidth • Methodology • FPGAs • Xilinx – all BRAMs are the same, calculation = number of BRAMS * port width * number of ports * memory access frequency • Altera – 3 levels of internal memory hierarchy, calculation similar to above for all levels of hierarchy • FPOA – similar to above with 2 levels of memory hierarchy (Register File and Internal RAM) • GPP – bus width * frequency * ports

Internal Memory Bandwidth • Large amount of parallel accesses give FPGAs the advantage in this metric

Device Characterization Matrix Goal: enable comparison of different devices on key parameters Tie all device characterizations into unifying framework User weights allow adjustment to specific application needs Scores quickly show comparison results based on input weights Approach: Scale each characterization study from 1 to 10 Generate weighted average score for each device taking into account user weights Justification Significant architectural differences have historically made these devices difficult to compare • Single-Precision Floating-Point scaling example • Use min and max values to scale from 1 to 10 34

Device Characterization Matrix • Examples with other weights: • Power & cost (10), internal & external memory BW (5), 16-bit integer performance (7): • FPOA & V4SX55 lead • DP FP performance (5), power (10) • Stratix-II S180 and V4LX200 lead • External & I/O BW (10), power (10), cost (10) • MONARCH and Cell lead 35

References • DeHon, A. The Density Advantage of Configurable Computing. Computer , vol.33, no.4, pp.41-49, Apr 2000. • DeHon, A. Reconfigurable Architectures for General-Purpose Computing. A.I. Technical Report No. 1586, Massachusetts Institute of Technology, 1996. • Compton, K. and Hauck, S. Reconfigurable computing: a survey of systems and software. ACM Comput. Surv. 34, 2 (Jun. 2002), 171-210. • Memory Bandwidth, http://en.wikipedia.org/wiki/Memory_bandwidth. • Mason, J. FPGA HPC – The road beyond processors, Xilinx Corporation. RSSI 2007. • Wain, R., Bush, I., Guest, M., Deegan, M., Kozin, I. and Kitchen, C. An overview of FPGAs and FPGA programming; Initial experiences at Daresbury,. November 2006. Distributed Computing Group at Daresbury Laboratory. • Bolsens, I. Programming Modern FPGAs. Xilinx Corporation. MPSOC August, 2006. • Underwood, K. 2004. FPGAs vs. CPUs: trends in peak floating-point performance. In Proceedings of the 2004 ACM/SIGDA 12th international Symposium on Field Programmable Gate Arrays (Monterey, California, USA, February 22 - 24, 2004). FPGA '04. ACM Press, New York, NY, 171-180. • HPEC Challenge Benchmarks. http://www.ll.mit.edu/HPECchallenge. • Xilinx Corporation. 2100 Logic Drive, San Jose, CA 95124-3400. Virtex-4 Family Overview (DS112), January 23, 2007. • Xilinx Corporation. 2100 Logic Drive, San Jose, CA 95124-3400. Floating-Point Operator v3.0 (DS335). September 28, 2006. • “Introduction to the Cell Processor” from Dr. Michael Perrone (IBM) • “World’s First Polymorphic Computer – MONARCH” from K. Prager, L. Lewis, M. Vahey, G. Groves (Raytheon) • Strenski, Dave. “FPGA Floating Point Performance – a pencil and paper evaluation”. http://www.hpcwire.com/hpc/1195762.html. • Strenski, Dave. 2006. Computational Bottlenecks and Hardware Decisions for FPGAs. FPGA and Structured ASIC Journal. • Altera Corporation. 101 Innovation Drive, San Jose, CA 95134. Stratix II Device Handbook v 4.3, May 2007. • Freescale Semiconductor Inc. 6501 William Cannon Drive West, Austin, TX 78735. MPC7450 RISC Microprocessor Family Reference Manual, Rev. 5. January 2005. • Freescale Semiconductor Inc. 6501 William Cannon Drive West, Austin, TX 78735. AltiVec Technology Programming Environments Manual, Rev. 3. April 2006. • MathStar Corporation. 19075 NW Tanasbourne Dr. Suite 200, Hillsboro, OR 97124. Arrix Family Product Brief, August 2006.

RC Device Characterizations & Tradeoff Analysis