1 / 64

Reconfigurable Supercomputing means to brave the paradigm chasm

HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28, 2007. Reconfigurable Supercomputing means to brave the paradigm chasm. Reiner Hartenstein. The von Neumann Syndrome. CS people: blind on the right eye Tunnel view on the left eye Treatment is urgently needed.

daisy
Download Presentation

Reconfigurable Supercomputing means to brave the paradigm chasm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28, 2007 Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein

  2. The von Neumann Syndrome • CS people: blind on the right eye • Tunnel view on the left eye • Treatment is urgently needed 2

  3. # of hits by Google 647,000 1,490,000 398,000 1,620,000 915,000 272,000 FPGA 10,000,000 Mainstream in Embedded Systems for a Decade http://hartenstein.de/pervasiveness.html “FPGA and ….” • Hardware People: • Computer architects • Embedded system designers Embedded Systems scene not imprisened by the von Neumann paradigm trap 3

  4. # of hits by Google # of hits by Google 647,000 171,000 194,000 1,490,000 398,000 127,000 1,620,000 113,000 158,000 915,000 162,000 272,000 Everywhere in Scientific Computing educational deficits: help needed by hardware experts unqualified for RC ? Math/SW-savvy scene “FPGA and ….” 4

  5. silicon graphics RASC Reconfigurable Computing at Microsoft Cray XT4 Chuck Thacker Reconfigurable Supercomputing Revolution 5

  6. Outline • The von Neumann Paradigm • Accelerators and FPGAs • The Reconfigurable Computing Paradox • The new Paradigm • Coarse-grained • Bridging the Paradigm Chasm • Conclusions 6

  7. Software Industry Software Industry’s Secret of Success compile or assemble procedural personalization The first archetype machine model But now we live in the Configware Age simple basic . Machine Paradigm instruction-stream- based mind set main frame CPU personalization: RAM-based “von Neumann” 7

  8. [Burks, Goldstein, von Neumann; 1946] • RAM (memory cells have adresses ….) The von Neumann Paradigm Trap CS education got stuck in this paradigm trap which stems from technology of the 1940s • Program counter (auto-increment, jump, goto, branch) • Datapath Unit with ALU etc., • I/O unit, …. CS education’s right eye is blind, and its left eye suffers from tunnel view We need a dual paradigm approach 8

  9. RAM memory DPU DPU CPU program counter Von Neumann CPU (tunnel view with the left eye) Program Source: Software - World of Software -Engineering 9

  10. Early historic machines CPU 1 programming source needed resources: fixed resources: fixed Von Neumann algorithm: fixed algorithm: variable software Nick Tredennick’s Paradigm Shifts: (slowly preparing to use both eyes for a dual paradigm point of view) 10

  11. Software Engineering source program software compiler instruction schedule software code (Befehls-Fahrplan) sequential Compilation: Software (von Neumann model) 11

  12. Crossbar weight: 220 t, 3000 km of thick cable, Monstrous Steam Engines of Computing power measured in tens of megawatts, floor space measured in tens of thousands of square feet 5120 Processors, 5000 pins each larger than a battleship ready 2003 12

  13. We are in a Computing Crisis *) feasible also with rDPA 13

  14. Outline • The von Neumann Paradigm • Accelerators and FPGAs • The Reconfigurable Computing Paradox • The new Paradigm • Coarse-grained • Bridging the Paradigm Chasm • Conclusions 14

  15. software instruction-stream-based data-stream-based RAM memory von Neumann bottleneck accelerator DPU hardware CPU co-processors program counter CPU von Neumann is not the common model microprocessor age: mainframe age: von Neumann instruction-stream-based machine 15

  16. procedural structural hardware guy programmer µprocessor accelerators The clash of paradigms kind of data-stream-based mind set the basic mind set is instruction-stream-based microprocessor age: the software / hardware chasm a programmer does not understand function evaluation without machine mechanisms - without a pogram counter … we need a datastream based machine paradigm 16

  17. software instruction-stream-based data-stream-based RAM memory von Neumann bottleneck accelerator DPU hardware CPU co-processors program counter CPU CPU reconfigurable hardwired accelerator accelerator Here is the contemporary common model microprocessor age: mainframe age: Now we are in the configware age: von Neumann instruction-stream-based machine 17

  18. DataPath Units 32 Bit, 64 Bit DPU DPU CPU program counter reconfigurable logic box: 1 Bit FPGAs in Supercomputing • Synergisms: coarse-grained parallelism through conventional parallel processing, • and:fine-grained parallelism through direct configware execution on the FPGAs (millions of rLBs embedded in a reconfigurable interconnect fabrics) 18

  19. Execution phase Configuration phase C ph C ph E ph E ph time FPGA Modes of Operation Legend: (requiring new OS principles) configware code loaded from external flash memory, e. g. after power-on (~milliseconds) simple, static reconfigurability off 19

  20. configware OS fundamentally different from software OS macro X macro Z configware macro Y C ph module X C ph C ph C ph C ph E ph E ph X configures Y module Y Reconfigurable Computing at Microsoft E ph time module z E ph E ph illustrating dynamically reconfigurable partially reconfigurable FPGA Swapping and scheduling ofrelocatable configware codemacros is managed by aconfigwareoperating system Configware OS established R&D area module no. Microsoft ReconVista ? 20

  21. Gliederung • The von Neumann Paradigm • Accelerators and FPGAs • The Reconfigurable Computing Paradox • The new Paradigm • Coarse-grained • Bridging the Paradigm Chasm • Conclusions 21

  22. density: overhead: FPGA physical wiring overhead >> 10 000 FPGA logical FPGA routed Deficiencies of reconfigurable fabrics (FPGA)(fine-grained) transistors / microchip 109 reconfigurability overhead> (Gordon Moore curve) 106 routing congestion (microprocessor) immense area inefficiency 103 deficiency factor: >10,000 1st DeHon‘s Law [1996: Ph. D thesis, MIT] general purpose “simple” FPGA power guzzler 100 slow clock 1980 1990 2000 2010 22

  23. This extreme area-inefficiency holds only for „simple FPGAs“ „Platform-FPGAs“, however, are a predefined mixture of powerful, hardwired resources (microprocessors, memory blocks, multipliers, etc.), embedded in FPGA fabrics. 23

  24. DSP and wireless Reed-Solomon Decoding real-time face detection 2400 6000 MAC crypto video-rate stereo vision 1000 1000 pattern recognition Viterbi Decoding 400 730 900 SPIHT wavelet-based image compression 288 Smith-Waterman pattern matching 457 88 molecular dynamics simulation 100 FFT oil and gas protein identification 17 40 52 BLAST Bioinformatics GRAPE 20 X 2/yr the memory wall Astrophysics 8080 Software-to-Configware (FPGA) Migration: some published speed-up factors 109 relative performance Image processing, Pattern matching, Multimedia Areas of success. from high-end systems on earth to mission-critical systems in space. The RC paradox 106 deficiency factor: >10,000 speed-up factor:6,000 total discrepancy: >60,000,000 Pentium 4 103 50%/yr Microprocessor 7%/yr Memory 100 1980 1990 2000 2010 24

  25. Reconfigurable HPC • This area is almost 10 years old 25

  26. Understanding the RC Paradox ? Executive Summary doesn‘t help We must first understand the nature of the paradigm von Neumann chickens ? 26

  27. Moore’s law not applicable to all aspects of VLSI the law of Gates What is the reason of the paradox ? the von Neumann Syndrome resulting from decades of tunnel view in CS R&D and education basic mind set completely wrong “CPU: most flexible platform” ? But >1000 CPUs running in parallel are the most inflexible platform: The Law of More: drastically declining programmer productivity However, FPGA & rDPA are very flexible 27

  28. CPU program counter DPU 200 DEC alpha [BWRC, UC Berkeley, 2004] 175 150 memory wall, caches, ... 125 100 SPECfp2000/MHz/Billion Transistors CPU 75 IBM 50 SUN 25 HP 0 1990 1995 2000 2005 stolen from Bob Colwell Rapid Decline of Computational Density primary design goal: avoiding a paradigm shift dramatic demo of the von Neumann Syndrome alpha: down by 100 in 6 yrs IBM: down by 20 in 6 yrs 28

  29. Avoiding the paradigm shift? „It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“ Tarek El-Ghazawi, panelist at SuperComputing 2006 „A leap too far for the existing HPC community“ panelist Allan J. Cantle We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective 29 SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors

  30. Outline • The von Neumann Paradigm • Accelerators and FPGAs • The Reconfigurable Computing Paradox • The new Paradigm • Coarse-grained • Bridging the Paradigm Chasm • Conclusions 30

  31. data-stream-based mind set x x x x x x x x | x | | - - - x x x - - - - x x x x x x a programmer does not understand function evaluation without machine mechanisms - without a pogram counter … x x x - - - - - data x x x - | | | x x x - - | | | | | | x | | | x x | | x x x data streams x x x it was pepared almost 30 years ago We need a new machine paradigm we urgently need a datastream based machine paradigm 31

  32. time (pipe network) input data stream DPA x x x x x x x x time port # | x | | time - - - x x x execution transport-triggered - - - - x x x x x x x x x - - - - - x x x - | | | x x x - - | | | port # | | | port # x | | | x x | | x x x output data streams x x „data streams“ x time Having introduced Data streams The road map to HPC: ignored for decades H. T. Kung ~1980 no memory wall systolic array research: throughout the 80ies: Mathematicians‘ hobby 32

  33. x x x x x x x x | x | | - - - x x x - - - - x x x x x x x x x - - - - - x x x - | | | x x x - - | | | | | | x | | | x x | | x x x x x x Who generates the Data Streams? Mathematicians: it‘s not our job „systolic“ (it‘s not algebraic) 33

  34. (it‘s not our job) Without a sequencer … reductionist approach: … it’s not a machine Mathematicians have missed to invent the new machine paradigm 34

  35. Synthesis Method? reductionist approach of course algebraic (linear projection) only for applications with regular data dependencies Mathematicians caught by their own paradigm trap The super-systolic array: a generalization of the systolic array 1995 Rainer Kress discarded their algebraic synthesis methods and replaced it by simulated annealing: rDPA 35

  36. ASM ASM ASM ASM ASM ASM data counters: located at memory (not at data path) (r)DPA x x x x x x x x | x ASM ASM | | - - - x x x ASM ASM - - - - x x x x x x x x x - - - - - ASM ASM x x x - | | | x x x - - RAM | | | GAG | | | x | | | x x | data counter | x x x x x x ASM: Auto-Sequencing Memory The counterpart of the von Neumann machine coarse-grained data counters instead of a program counter Kress /Kung Anti Machine 36

  37. RAM memory rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU program counter DPU CPU DPU data counter data counter data counter RAM RAM RAM rDPU rDPU rDPU rDPU *) “transport-triggered” **) does not have a program counter machine models von Neumann Anti machine - no instruction fetch at run time 37

  38. Early historic machines CPU 1 programming source needed resources: fixed resources: fixed Von Neumann algorithm: fixed algorithm: variable software configware flowware resources: variable Reconfigurable Computing 2 programming sources needed algorithm:variable Nick Tredennick’s Paradigm Shifts flowware 38

  39. Configware Engineering source „program“ placement & routing x x x x x x x x | x configware compiler | | mapper ASM: Auto-Sequencing Memories - - - x x x - - - - x x x x x x x x x - - - - - x x x - | | | x x x - - | | | programming the data counters | | | GAG GAG GAG GAG x | | | scheduler x x | | x x x data streams x x data counter data counter data counter data counter x RAM RAM RAM RAM rDPA pipe network ASM ASM ASM ASM Configware Compilation configware compilation fundamentally different from software compilation configware code data flowware code 39

  40. by Software by Configware Data meeting the Processing Unit (PU) ... partly explaining the RC paradox We have 2 choices routing the data by memory-cycle-hungry instruction streams thru shared memory placement of the execution locality ... pipe network generated by configware compilation 40

  41. How much on-chip embedded BRAM ? 256 – 1704 BGA 8 – 32 DPU: coarse-grained 56 – 424 fast on-chip block RAMs: BRAMs On-chip LatticeCS series 41

  42. GAG data counter Generic Address Generator GAG Generalization of the DMA Acceleration factors by: • address computation without memory cycles avoid e.g. 94% address computation overhead* • storge scheme optimization methodology, etc. GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik] *) Software to Xputer migration 42 patented by TI 1995

  43. Configware Industry Configware Industry’s Secret of Success compile structural personalization reconfigurable accelerator The 2nd “archetype” machine model simple basic . Machine Paradigm data-stream- based mind set personalization: RAM-based “Kress-Kung” 43

  44. Outline • The von Neumann Paradigm • Accelerators and FPGAs • The Reconfigurable Computing Paradox • The new Paradigm • Coarse-grained • Bridging the Paradigm Chasm • Conclusions 44

  45. rDPU Coarse-grained Reconfigurable Array note: software perspective without instruction streams: pipelining question after the talk: „but you can‘t implement decisions!“ SNN filter on (supersystolic) KressArray (mainly a pipe network) rout thru only no CPU reconfigurable Data Path Unit, 32 bits wide array size: 10 x 16 rDPUs compiled by Nageldinger‘s KressArray Xplorer with Juergen Becker‘s CoDe-X inside not used backbus connect 45

  46. rDPU Symptom of the von Neumann Syndrome note: software perspective without instruction streams question after the talk: „but you can‘t implement decisions!“ SNN filter on (supersystolic) KressArray (mainly a pipe network) rout thru only A High level R&D manager of a large Japanese IT industry group array size: 10 x 16 = 160 rDPUs yielded by single-paradigm mind set no CPU Executive summary? Forget it ! How about a microprocessor giant having >100 vice presidents ? reconfigurable Data Path Unit, e. g. 32 bits wide not used backbus connect if clause turns into multiplexer 46

  47. CPU program counter DPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPA logical rDPA physical rDPU rDPU rDPU rDPU Much less deficiencies by coarse-grained transistors / microchip DPU 109 rDPU (Gordon Moore curve) 106 area efficiency very close to Moore‘s law 103 Hartenstein‘s Law[1996: ISIS, Austin, TX] very compact configuration code: very fast reconfiguration 100 1980 1990 2000 2010 47

  48. Juergen Becker’s CoDe-X, 1996 Partitioner rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU CW SW compiler compiler generating a pipe network CPU rDPU rDPU rDPU rDPU Dual Paradigm Application Development automatic parallelization by loop transformations C language source placement and routing 48

  49. by Configware Data meeting the Processing Unit placement of the execution locality ... … pipe network generated by configware compilation 49

  50. CPU CPU CPU CPU CPU CPU CPU CPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU Hybrid Multi Core example each core can run CPU mode or rDPU mode twin paradigm machine 64 cores How about microprocessor industry ? Disabled for the paradigm shift ? Customers refuse the pradigm shift? Twin paradigm provides the flexibility 50

More Related