Reconfigurable Computing

Computing Meeting EU, ESU, Brussells, May 18, 2006 Reconfigurable Computing Reiner Hartenstein

# of hits by Google # of hits by Google 647,000 171,000 194,000 1,490,000 398,000 127,000 1,620,000 113,000 158,000 915,000 162,000 272,000 The Pervasiveness of RC Math/SW-savvy scene (more recently: 2-3 years) ECE-savvy scene (mainstream many years) “FPGA and ….” and many more areas and many more areas 2

The dominance of Configware Most compute power is coming from Configware More MIPS migrated to Configware than running as Software 3

silicon graphics RASC Reconfigurable Supercomputing (VHPC) going commercial Cray XD1 … and other vendors 4

>> Outline<< • Reconfigurable Computing Paradox • The Supercomputing Paradox • We are using the wrong model • Coarse-grained Reconfigurable Devices • Super Pentium for Desktop Supercomputer http://www.uni-kl.de 5

**) DeHon ‘98 The Reconfigurable Computing Paradox poor FPGA technology: area-inefficient, slow, power-hungry, expensive poor tools: tools and languages unacceptable by most users even most hardware experts (86%**) hate their tools poor education: RC education: extremely poor, if at all - ignored by CS curricula CS taught like for a 50 year old mainframe … 6

FPGA integration density what paradox ? However, brilliant results everywhere the effective integration density of plane FPGAs is behind Moore’s law by more than 4 orders of magnitude 7

pre-FPGA era X 2/yr Reed-Solomon Decoding real-time face detection 2400 6000 MAC crypto video-rate stereo vision 1000 10 000 1000 pattern recognition Viterbi Decoding 400 730 Grid-based DRC („fair comparizon“) 900 SPIHT wavelet-based image compression 288 Smith-Waterman pattern matching 457 15000 DPLA 88 molecular dynamics simulation 2000 FPGA 100 Grid-based DRC: no FPGA: DPLA on MoM by TU-KL 52 FFT BLAST protein identification 40 Los Alamos traffic simulation 10 000 47 MoM Xputer architecture GRAPE 20 160 Lee Routing (by TU-KL) 2-D FIR filter [TU-KL] Astrophysics 39,4 8080 speed-up factors published <4 OoM 109 DSP and wireless >3 OoM relative performance Image processing, Pattern matching, Multimedia 106 Bioinformatics x1.25 / yr (Moore) Pentium 4 103 50%/yr >2 OoM Microprocessor >1 OoM 7%/yr Memory 100 1980 1990 2000 2010 8 http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

[courtesy Xilinx Corp.] DSP platform FPGA platform FPGAs: better area efficiency DeHon‘s 1st Law (1996) was for plane FPGAs 500MHz PowerPC™ Processors (680DMIPS) with Auxiliary Processor Unit 500MHz multi-port Distributed 10 Mb SRAM 500MHz DCM Digital Clock Management 500MHz Flexible Soft Logic Architecture 200KLogic Cells 0.6-11.1Gbps Serial Transceivers 1Gbps Differential I/O 500MHz Programmable DSP Execution Units 9

classical PLA layout highly area-efficient: close to Moore’s law Mid’ 80ies: first only very tiny FPGAs available: 1 DPLA replaced 256 of them 2 1 ASM: Auto-Sequencing Memory ASM pre FPGA era: Why DPLA* was so good Large arrays of canonical boolean expressions - Speed-up factor of 20 by reducing memory cycles which is the keyissue a generalization of the DMA** GAG Generic Address Generator** to avoid address computation overhead **) for a survey by IMEC & TU-KL see: [M. Herz et al.: ICECS 2003, Dubrovnik] *) fabricated 1984 byE.I.S. multi university project 10

X 2/yr consolidation? real-time face detection Reed-Solomon Decoding 2400 6000 MAC crypto video-rate stereo vision even higher speed-up ? 1000 10 000 1000 pattern recognition Viterbi Decoding 400 730 Grid-based DRC („fair comparizon“) 900 SPIHT wavelet-based image compression 288 Smith-Waterman pattern matching 457 15000 DPLA 88 molecular dynamics simulation 2000 FPGA 100 Grid-based DRC: no FPGA: DPLA on MoM by TU-KL 52 FFT BLAST protein identification 40 Los Alamos traffic simulation 10 000 47 MoM Xputer architecture GRAPE 20 160 Lee Routing (by TU-KL) 2-D FIR filter [TU-KL] Astrophysics 39,4 8080 taxonomy of algorithms, better tools and better education 109 DSP and wireless relative performance Image processing, Pattern matching, Multimedia 106 Bioinformatics x1.25 / yr (Moore) Pentium 4 103 50%/yr Microprocessor 7%/yr Memory 100 1980 1990 2000 2010 11

„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates] Google Amsterdam NY (also a matter of national energy policy) New dimensions of low power: Application migration [from supercomputer] resulting not only in massive speed-ups Electricity bills reduced by an order of magnitude and even more you may get for free …. up to millions of $ dollars per year 12

As CTO for Linux Networx, Dr. Joshua Harr has the responsibility of laying the technical roadmap for the company and is leading the team developing cluster management tools. Josh's experience with parallel processing, distributed computing, large server farms, and Linux clustering began when he built an eight-node cluster system out of used components while in college. An industry expert, Josh has been called upon to consult with businesses and lecture in college classrooms. He earned a Ph.D. in computational chemistry and a bachelor's degree in molecular biology from BYU. ISC2006 BoF Session Title and Abstract Is Reconfigurable Computing the Next Generation Supercomputing?Advances in reconfigurable computing, particularly FPGA (field-programmable gate array) technology, have reached a performance level where they rival and exceed the performance of general purpose processors for the right applications. FPGAs have gotten cheaper thanks to smaller geometries, multimillion gate counts and volume market leverage from ASIC preproduction and other conventional uses. The potential benefit from the widespread incorporation of FPGA technology into high-performance applications is high, provided present day barriers to their incorporation can be overcome. This session will focus on defining the anticipated market changes, anticipated roles of FPGA technology in high-performance computing (from accelerators to hybrid architectures), characterizing present day barriers to the incorporation of FPGA technology (such as identifying the right applications), and partnering efforts required (tools, benchmarks, standards, etc.)to speed the adoption of reconfigurable technology in high-performance supercomputing.Keywords: Reconfigurable computing, FPGA Accelerators, Supercomputing Date and Time This BoF session is part of the conference program and will take place within a 45 minute-slot on Wednesday 28. June 2006 from 18:00 - 19:30. BoF Organizers John AbottChief Analyst, The 451 Group, USA Dr. Joshua HarrCTO, Linux Networx, USA Dr. Eric StahlbergOrganizing founder OpenFPGA, Ohio Supercomputer Center (OSC), USA The Supercomputing Paradox promising technology COTS processor decreasing cost Growing listed Teraflops Increasing number of processors running in parallel 14

HPC by classic supercomputing methodology poor results Extreme shortage of affordable capacity More parallelism absorbs programmer productivity Program ready: hardware obsolete The law of More Not for high performance embedded computing Lack of scalability: progress onlyby innovation 15

As CTO for Linux Networx, Dr. Joshua Harr has the responsibility of laying the technical roadmap for the company and is leading the team developing cluster management tools. Josh's experience with parallel processing, distributed computing, large server farms, and Linux clustering began when he built an eight-node cluster system out of used components while in college. An industry expert, Josh has been called upon to consult with businesses and lecture in college classrooms. He earned a Ph.D. in computational chemistry and a bachelor's degree in molecular biology from BYU. ISC2006 BoF Session Title and Abstract Is Reconfigurable Computing the Next Generation Supercomputing?Advances in reconfigurable computing, particularly FPGA (field-programmable gate array) technology, have reached a performance level where they rival and exceed the performance of general purpose processors for the right applications. FPGAs have gotten cheaper thanks to smaller geometries, multimillion gate counts and volume market leverage from ASIC preproduction and other conventional uses. The potential benefit from the widespread incorporation of FPGA technology into high-performance applications is high, provided present day barriers to their incorporation can be overcome. This session will focus on defining the anticipated market changes, anticipated roles of FPGA technology in high-performance computing (from accelerators to hybrid architectures), characterizing present day barriers to the incorporation of FPGA technology (such as identifying the right applications), and partnering efforts required (tools, benchmarks, standards, etc.)to speed the adoption of reconfigurable technology in high-performance supercomputing.Keywords: Reconfigurable computing, FPGA Accelerators, Supercomputing Date and Time This BoF session is part of the conference program and will take place within a 45 minute-slot on Wednesday 28. June 2006 from 18:00 - 19:30. BoF Organizers John AbottChief Analyst, The 451 Group, USA Dr. Joshua HarrCTO, Linux Networx, USA Dr. Eric StahlbergOrganizing founder OpenFPGA, Ohio Supercomputer Center (OSC), USA CPU extremely unbalanced stolen from Bob Colwell Why traditional supercomputing / HPC failed because of the wrong multi-core interconnect architecture memory-cycle-hungry instruction-stream-based: the wrong way, how the data are moved around 17

EarthSimulator movingdataaroundinside the Crossbar weight: 220 t, 3000 km of thick cable, 18

discarding the wrong road map with a paradigm shift the same performance is feasible on a single 19” rack 19

by Software Bringing together data and processor Moving data to the processor: moving the grand piano 20

Key issues in very High Performance Computing (vHPC) reducing memory cycles is the keyissue this needs a paradigm shift away from the dominance of instruction streams 21

software code configware code CPU hardwired reconfigurable accelerator accelerator symbiotic Here is the common model it’s not von Neumann the vN monopoly in our curricula is severely harmful we need dual paradigm education instruction-stream-based data-stream-based Von Neumann: the tail is wagging the dog very high performance & electricity bill issues legacy issues 22

The wrong basic mind set we need a a dual paradigm approach our IT expert labor force lacks the rite basic mind set this is a severe eduational challenge 23

For high school and undergraduate education we need a an archtype simple common model instead of a wide variety of sophisticated architectures this is a severe eduational challenge 24

integration density the effective integration density of plane FPGAs behind Moore’s law by more than 4 orders of magnitude the effective integration density of rDPAs* may come close to Moore’s law *) reconfigurable DataPath Arrays (coarse-grained reconfigurability) 26

rDPU Coarse grain is about computing, not logic SNN filter on KressArray (mainly a pipe network) rout thru only array size: 10 x 16 = 160 rDPUs no CPU reconfigurable Data Path Unit, e. g. 32 bits wide not used backbus connect [Ulrich Nageldinger] 27

rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU + S SW 2coarse-grained CW migration example rDPU rDPU rDPU rDPU 28

rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU + + Clock 200 S S Compare it to software solution on CPU S = R + (if C then A else B endif); S = R + (if C then A else B endif); 29

R B A C =1 no memory cycles: speed-up factor = 100 + S clock 200 MHz (5 nanosec) hypothetical branching example to illustrate software-to-configware migration S = R + (if C then A else B endif); *) if no intermediate storage in register file 30

Why the speed-up? What‘s the difference? moving the locality of operation into the route of the data stream by P&R instead of moving data by instruction streams 31

by Configware Bringing together data and processor Place the location of execution into the data pipe Move the stool 32

Data-stream-based execution should betransport-triggered instead of instruction-triggered transport should be done within compiled pipelines, not by move engines* *) which are instruction-stream-based ! 33

For high school and undergraduate education we should send CTOs and professors back to school this is a severe eduational challenge 34

SNN filter on KressArray (mainly a pipe network) rout thru only array size: 10 x 16 = 160 rDPUs no CPU rDPU reconfigurable Data Path Unit, e. g. 32 bits wide not used backbus connect [Ulrich Nageldinger] The wrong model upon this schematics … … question by a Japanese Corporate vVIP 35

R B A C =1 + S clock 200 MHz (5 nanosec) The wrong mind set .... (Question by a Japanese Corporate vVIP: [RAW’99]) „but you can‘t implement decisions!“ not knowing this solution: symptom of the hardware / software chasm and the configware / software chasm We need Reconfigurable Computing Education 36

Application co-development environment for Hardware non-experts, .... Acceptability by software-type users, ... some Goals Universal HPC co-architecture for: embedded vHPC (nomadic, automotive, ...) desktop vHPC (scientific computing ...) Meet product lifetime >> embedded syst. life: FPGA emulation logistics from development downto maintenance and repair stations examples: automotive, aerospace, industrial, .. 38

Architecture: A potential Pentium successor Discard most caches have 64* cores, 0.5 - 1 GHz with clever interconnect for: ▪concurrent processes and CPU mode ▪and for multithreading, and, for ▪Kung-Kress pipe network DPU mode The Desk-top Supercomputer! *) CPU mode / DPU mode capability 39

CPU CPU CPU CPU CPU CPU CPU CPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU “Super Pentium” configuration example twin paradigm machine 40

Games Videos Music LCD DISPLAY Camera SMeXPP rDPA SD/MMC Cards Baseband-Processor Radio-Interface Audio-Interface e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz World TV & game console & multi media center • Variable resolutions and refresh rates • Variable scan mode characteristics • Noise Reduction and Artifact Removal • High performance requirements • Variable file encoding formats • Variable content security formats • Variable Displays • Luminance processing • Detail enhancement • Color processing • Sharpness Enhancement • Shadow Enhancement • Differentiation • Programmable de-interlacing heuristics • Frame rate detection and conversion • Motion detection & estimation & compensation • Different standards (MPEG2/4, H.264) • A single device handles all modes http://pactcorp.com 41

feasible under 500 MHz means low electricity cost and allows very high inegration density 42

pipeline apropos compiled pipeline … 43

high level language software/configware co-compiler CPU reconfigurable hardwired accelerator accelerator Dual Paradigm Application Development Support placement & routing in the compiler optimizes interconnect bandwidth by preferring nearest neighbor connect software code configware code instruction-stream-based data-stream-based 44

Placement & Routing rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU Partitioner (Move the Locality of Operation) supporting different platforms CW SW compiler compiler Juergen Becker’s CoDe-X, 1996 CPU Resource Parameters rDPU rDPU rDPU rDPU Software / Configware Co-Compilation C language source 45

Math formula .... term-rewriting-based vhl synthesis system CPU reconfigurable hardwired accelerator accelerator Software / Configware very high level Synthesis [Arvind, or, Mauricio Ayala] software code configware code instruction-stream-based data-stream-based 46

>> Conclusions<< • Reconfigurable Computing Paradox • The Supercomputing Paradox • We are using the wrong model • Coarse-grained Reconfigurable Devices • Super Pentium for Desktop Supercomputer • Conclusions http://www.uni-kl.de 47

Objectives for every area which needs: cheap, compact vHPC rapid prototyping, field-patching, emulation avoiding specific silicon flexibility (for accelerators) 48

Reconfigurable Computing opens many spectacular new horizons: Conclusion (1) Cheap vHPC without needing specific silicon, no mask .... Cheap embedded vHPC Cheap desktop supercomputer (a new market) Replacing expensive hardwired accelerators Fast and cheap prototyping Flexibility for systems with unstable multiple standards by dynamic reconfigurability Supporting fault tolerance, self-repair and self-organization Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace …) Massive reduction of the electricity bill: locally and national 49

Conclusion (2) Needed: Universal vHPC co-architecture demonstrator For widely spreading its use successfully: The compilation tool problem to be solved Language selection problem to be solved Education backlog problems to be solved Use this to develop a very good high school and undergraduate lab course select killer applications for demo A motivator: preparing for the top 500 contest 50

Reconfigurable Computing