Reconfigurable Computing and the von Neumann Syndrome

Reconfigurable Computing and the von Neumann Syndrome Reiner Hartenstein

Questions ? • familiar with FPGAs ? Programming easy? • Who is familiar with systolic arrays ? • Duality: data streams vs. instruction streams ? • Programming a multicore microprocessor: will it be easy ? 2

pervas 3

Outline • The Pervasiveness of FPGAs • The Reconfigurable Computing Paradox • The Gordon Moore gap • The von Neumann syndrome • We need a dual paradigm approach • Conclusions 4

FPGAs found everywhere 5

Pervasiveness of RC http://hartenstein.de/pervasiveness.html http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html 6

RCeducation 2008 The 3rd International Workshop on Reconfigurable Computing Education April 10, 2008, Montpellier, France http://www.fpl.uni-kl.de/RCeducation08/ 7

Outline • The Pervasiveness of FPGAs • The Reconfigurable Computing Paradox • The Gordon Moore gap • The von Neumann syndrome • We need a dual paradigm approach • Conclusions the hardware / software chasm, the configware / software chasm the instruction stream tunnel the overhead-prone paradigm 8

Outline • The Pervasiveness of FPGAs • The Reconfigurable Computing Paradox • The Gordon Moore gap • The von Neumann syndrome • We need a dual paradigm approach • Conclusions instruction-stream vs. data stream bridging the chasm: an old hat stubborn curriculum task forces 9

Outline • The Pervasiveness of FPGAs • The Reconfigurable Computing Paradox • The Gordon Moore gap • The von Neumann syndrome • We need a dual paradigm approach • Conclusions 10

Outline paradox 11

RC education http://www.fpl.uni-kl.de/RCeducation/ http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html 12

Outline • The Pervasiveness of FPGAs • The Reconfigurable Computing Paradox • The Gordon Moore gap • The von Neumann syndrome • We need a dual paradigm approach • Conclusions platform FPGAs, coarse-grained arrays saving energy 13

connect box reconfigurable interconnect fabrics switch box reconfigurable logic box FPGA with island architecture FPGA with island architecture FPGA with island architecture 14

density: overhead: FPGA physical wiring overhead >> 10 000 FPGA logical FPGA routed Deficiencies of reconfigurable fabrics (FPGA)(fine-grained) transistors / microchip 109 reconfigurability overhead> (Gordon Moore curve) 106 routing congestion (microprocessor) immense area inefficiency 103 deficiency factor: >10,000 1st DeHon‘s Law [1996: Ph. D thesis, MIT] general purpose “simple” FPGA power guzzler 100 slow clock 1980 1990 2000 2010 15

Reed-Solomon Decoding pattern recognition 730 2400 MAC SPIHT wavelet-based image compression 288 Smith-Waterman pattern matching real-time face detection 457 1000 6000 Viterbi Decoding 400 100 FFT crypto oil and gas video-rate stereo vision 17 1000 900 GRAPE 20 Astrophysics 52 BLAST 88 molecular dynamics simulation 1980 1990 2000 2010 protein identification 40 X 2/yr Software-to-Configware (FPGA) Migration: some published speed-up factors [2003 – 2005] 106 Image processing, Pattern matching, Multimedia speedup factor DSP and wireless 103 Bioinformatics 100 16

The RC paradox Reed-Solomon Decoding pattern recognition 730 2400 MAC SPIHT wavelet-based image compression 288 PISA Smith-Waterman pattern matching real-time face detection 457 1000 6000 Viterbi Decoding 400 100 FFT 3000 crypto oil and gas video-rate stereo vision 17 1000 deficiency factor: >10,000 speed-up factor:6,000 total discrepancy: >60,000,000 900 GRAPE 20 Astrophysics 52 BLAST 88 molecular dynamics simulation 1980 1990 2000 2010 protein identification 40 X 2/yr Software-to-Configware (FPGA) Migration: some published speed-up factors [2003 – 2005] 106 Image processing, Pattern matching, Multimedia speedup factor DSP and wireless 103 Bioinformatics 100 18

The RC paradox Reed-Solomon Decoding pattern recognition 730 2400 MAC SPIHT wavelet-based image compression 288 Smith-Waterman pattern matching real-time face detection 457 1000 6000 Viterbi Decoding 400 100 FFT 3000 crypto oil and gas video-rate stereo vision 17 1000 deficiency factor: >10,000 speed-up factor:6,000 total discrepancy: >60,000,000 900 GRAPE 20 Astrophysics 52 BLAST 88 molecular dynamics simulation 1980 1990 2000 2010 protein identification 40 X 2/yr Software-to-Configware (FPGA) Migration: some published speed-up factors [2003 – 2005] 106 Image processing, Pattern matching, Multimedia speedup factor DSP and wireless 103 Bioinformatics 100 19

The RC paradox Reed-Solomon Decoding pattern recognition 730 2400 MAC SPIHT wavelet-based image compression 288 PISA Smith-Waterman pattern matching real-time face detection 457 1000 6000 Viterbi Decoding 400 100 FFT crypto oil and gas video-rate stereo vision 17 1000 deficiency factor: >10,000 speed-up factor:6,000 total discrepancy: >60,000,000 900 GRAPE 20 Astrophysics 52 BLAST 88 molecular dynamics simulation 1980 1990 2000 2010 protein identification 40 X 2/yr Software-to-Configware (FPGA) Migration: some published speed-up factors [2003 – 2005] 106 Image processing, Pattern matching, Multimedia speedup factor DSP and wireless 103 Bioinformatics 100 20

Software-to-Configware (FPGA) Migration: some published speed-up factors [2003 – 2005] These examples worked fine with on-chip memory There are other algorithms more difficult to accelerate … … where d-daching might be useful (ASM) 21

Outline platform-FPGA 22

How much on-chip embedded BRAM ? 256 – 1704 BGA 8 – 32 DPU: coarse-grained 56 – 424 fast on-chip block RAMs: BRAMs On-chip LatticeCS series 23

coarse 24

rDPU Coarse-grained Reconfigurable Array note: software perspective without instruction streams: pipelining question after the talk: „but you can‘t implement decisions!“ SNN filter on (supersystolic) KressArray (mainly a pipe network) rout thru only no CPU reconfigurable Data Path Unit, 32 bits wide array size: 10 x 16 rDPUs compiled by Nageldinger‘s KressArray Xplorer with Juergen Becker‘s CoDe-X inside not used backbus connect 25

Simple KressArray Configuration Example 26

CPU program counter DPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPA logical rDPA physical rDPU rDPU rDPU rDPU Much less deficiencies by coarse-grained transistors / microchip DPU 109 rDPU (Gordon Moore curve) 106 area efficiency very close to Moore‘s law 103 Hartenstein‘s Law[1996: ISIS, Austin, TX] very compact configuration code: very fast reconfiguration 100 1980 1990 2000 2010 27

energy 28

oil and gas 17 1980 1990 2000 2010 X 2/yr Software-to-Configware (FPGA) Migration: Oil and gas [2005] 106 side effect: slashing the electricity bill by more than an order of magnitude speedup factor 103 100 29

What about higher speed-up factors ? More dramatic electricity savings? An accidentially discovered side effect Herb Riley, R. Associates $70in 2010? Saves > $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack • Software to FPGA migration of an oil and gas application: • Speed-up factor of 17 • Electricity bill down to <10% • Hardware cost down to <10% • All other publications reporting speed-up did not report energy consumption. - This will change. 30

What’s Really Going On With Oil Prices? [BusinessWeek, January 29, 2007] $52 Price of delivery in February 2007 [New York Mercantile Exchange: Jan. 17] $200 Minimum oil price in 2010, in a bet by investment banker Matthew Simmons 31

Energy as a strategic issue • Google‘s annual electricity bill: 50,000,000 $ • Amsterdam‘s electricity: 25% into server farms • NY city server farms: 1/4 km2 building floor area • Predicted f. USA in 2020: 30-50% of the entire national electricity consumption goes into cyber infrastructure [Mark P. Mills] • petaFlop supercomputer (by 2012 ?): extreme power consumption 32

Energy: an im portant motivation *) feasible also on reconfigurable platforms 33

Moore gap 34

Outline • The Pervasiveness of FPGAs • The Reconfigurable Computing Paradox • The Gordon Moore gap • The von Neumann syndrome • We need a dual paradigm approach • Conclusions & the multicore crisis 35

Moore’s law not applicable to all aspects of VLSI the law of Gates What is the reason of the paradox ? The Gordon Moore curve does not indicate performance The peak clock frequency does not indicate performance 36

200 DEC alpha [BWRC, UC Berkeley, 2004] 175 150 memory wall, caches, ... 125 100 SPECfp2000/MHz/Billion Transistors CPU 75 IBM 50 SUN 25 HP 0 1990 1995 2000 2005 stolen from Bob Colwell Rapid Decline of Computational Density primary design goal: avoiding a paradigm shift dramatic demo of the von Neumann Syndrome alpha: down by 100 in 6 yrs IBM: down by 20 in 6 yrs 37

Crossbar weight: 220 t, 3000 km of thick cable, Monstrous Steam Engines of Computing power measured in tens of megawatts, floor space measured in tens of thousands of square feet 5120 Processors, 5000 pins each larger than a battleship ready 2003 38

ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/ Stellar/Stardent DAPP Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech ICL Intel Scientific Computers International Parallel . Machines Kendall Square Research Key Computer Laboratories Dead Supercomputer Society Research 1985 – 1995[Gordon Bell, keynote ISCA 2000] • MasPar • Meiko • Multiflow • Myrias • Numerix • Prisma • Tera • Thinking Machines • Saxpy • Scientific Computer • Systems (SCS) • Soviet Supercomputers • Supertek • Supercomputer Systems • Suprenum • Vitesse Electronics 39

supercomputing crisis microprocessor crisis MPP parallelism does not scale going multi core We are in a Computing Crisis *) feasible also with rDPA 40

Syndrome 41

[Burks, Goldstein, von Neumann; 1946] • RAM (memory cells have adresses ….) The von Neumann Paradigm Trap CS education got stuck in this paradigm trap which stems from technology of the 1940s • Program counter (auto-increment, jump, goto, branch) • Datapath Unit with ALU etc., • I/O unit, …. CS education’s right eye is blind, and its left eye suffers from tunnel view We need a dual paradigm approach 42

The Law of More: drastically declining programmer productivity What is the reason of the paradox ? the von Neumann Syndrome Result from decades of tunnel view in CS R&D and education basic mind set completely wrong “CPU: most flexible platform” ? >1000 CPUs running in parallel: the most inflexible platform However, FPGA & rDPA are very flexible 43

multicore 44

Understanding the Paradox ? Executive Summary doesn‘t help We must first understand the nature of the paradigm von Neumann chickens ? 45

models 47

RAM memory DPU DPU CPU program counter Von Neumann CPU (tunnel view with the left eye) Program Source: Software - World of Software -Engineering 48

software instruction-stream-based data-stream-based RAM memory von Neumann bottleneck accelerator DPU hardware CPU co-processors program counter CPU von Neumann is not the common model microprocessor age: mainframe age: von Neumann instruction-stream-based machine 49

software instruction-stream-based data-stream-based RAM memory von Neumann bottleneck accelerator DPU hardware CPU co-processors program counter CPU CPU reconfigurable hardwired accelerator accelerator Here is the contemporary common model microprocessor age: mainframe age: Now we are in the configware age: von Neumann instruction-stream-based machine 50

Reconfigurable Computing and the von Neumann Syndrome

Reconfigurable Computing and the von Neumann Syndrome

Presentation Transcript

Von Neumann-Morgenstern

Von Neumann Architecture

The Von Neumann Architecture

John von Neumann

Von Neumann architecture

John Von Neumann

von Neumann machine

John von Neumann

John von Neumann Institute for Computing

The von Neumann Syndrome calls for a Revolution

The Von Neumann Model

The Von Neumann Model

The von Neumann Syndrome

The Von Neumann Architecture

John Von Neumann

Von Neumann Architecture

John von Neumann