Reiner Hartenstein University of Kaiserslautern

November 21, 2001, Tampere, Finland Enabling Technologies for Reconfigurable ComputingPart 2:Stream-based Computing for RCWednesday, November 21, 10.30 – 12.00 hrs. Reiner Hartenstein University of Kaiserslautern

Schedule 2

>> EDA revolution • EDA revolution • Dead Supercomputer • Stream-based Computing • Stream-based Memory Architecture • Design Space Explorers • KressArray Xplorer • Machine paradigms • Co-Compilation http://www.uni-kl.de 3

1k EDA: where Electronics begins [Richard Newton] • NASDAQ indexEDA index • Dataquest InitiativeNew book 4

[Richard Newton] 5

15 10 12 10 9 10 6 10 3 10 0 10 The End is near The end of Hypergrowth ? transistors/chip x100/decade x1.6/year year to market 1960 1970 1980 1990 2000 2010 2020 2030 2040 6

Development of Hypergrowth Markets Harper Business 1995 Mainstream Tornado Paradigm Shift 7

Makimoto’s 3rd wave EDA industry paradigm switching every 7 years 2006 [Hartenstein] 1999 (Co-) Compilation Stream-based DPU arrays [Keutzer / Newton] 1992 Synthesis: Cadence, Synopsys ... 1985 Schematics entry: Daisy, Mentor, Valid ... 1978 Transistor entry: Applicon, Calma, CV ... The next EDA Industry Revolution 8

Biggest Mistake in History 9

Innovation Stalled ? [Richard Newton] What is next after VHDL ? 10

What is next after VHDL ? Motivations • HDL-savvy designers needed • New Business Model • Co-Design never ending • HDLs ? • Extended HDLs – how far ? • Automatic Partitioning 11

>> Dead Supercomputer • EDA revolution • Dead Supercomputer • Stream-based Computing • Stream-based Memory Architecture • Design Space Explorers • KressArray Xplorer • Machine paradigms • Co-Compilation http://www.uni-kl.de 12

Dead Supercomputer Society • 37 university and corporate R&D projects: 2 or 3 successes… • All the rest failed to work or to be successful (Research 1985-1995) 13

ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/ Stellar/Stardent DAPP Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech ICL Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories Dead Supercomputer Society [Gordon Bell, keynote at ISCA 2000]. • MasPar • Meiko • Multiflow • Myrias • Numerix • Prisma • Tera • Thinking Machines • Saxpy • Scientific Computer • Systems (SCS) • Soviet Supercomputers • Supertek • Supercomputer Systems • Suprenum • Vitesse Electronics 14

ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent DAP (ICL) Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics Dead Supercomputer Society 15

>> Stream-based Computing • EDA revolution • Dead Supercomputer • Stream-based Computing • Stream-based Memory Architecture • Design Space Explorers • KressArray Xplorer • Machine paradigms • Co-Compilation http://www.uni-kl.de 16

hardwired no instruction sequencing! reconfigurable I-Seq I-Seq I-Seq ALU ALU ALU I-Seq I-Seq ALU I-Seq ALU • • • ALU • • • I-Seq I-Seq ALU ALU Coarse Grain Reconfigurable Arrays vs. Parallel Processes Parallelism at Datapath Level Parallelism at Process Level Paralellität auf Prozeß-Ebene Paralellität auf Datenpfad-Ebene rALU rALU • • • rALU Data rALU rALU • • • rALU Sequencer • • • • • • • • • • • • rALU rALU rALU I-Seq ALU 17

.... DPU DPU DPU DPU instruction sequencer instruction sequencer instruction sequencer instruction sequencer Bus(es) or switch box Concurrent Computing CPU extremely inefficient 18

DPU DPU DPU DPU driven by data stream from/to memory or, from/to peripheral interface Stream-based Computing no instruction sequencer inside ! transport-triggered execution 19

driven by data streams DPU DPU DPU DPU DPU DPU DPU DPU DPU Stream-based Computing: (r)DPU array for both, reconfigurable, and, hardwired 20

avoiding address computation overhead avoiding instruction fetch and interpretation overhead high parallelism, massively multiple deep pipelines much less configuration memory no routing areas to configure functions from CLBs >>> extremely high efficiency 21

y a DPU architecture + y 1 * y 2 - x y 3 - - equations - a a - a x 33 13 23 3 - a a a x 12 22 32 placement 2 linear projection or algebraic mapping a a a x 11 21 31 1 data streams - - linear pipelines and uniform arrays only ( ) y 0 - 1 ( ) 0 y The Mathematician’s Synthesis Method 2 ( ) y 0 3 Systolic Stream-based Computing System Systolic Array [H. T. Kung, 1980]: an array of DPUs (Data Path Units) no routing! 22

this dichotomy is completely ignored by our CS curricula y 1 y 2 - y 3 - - placement - a a - a x 33 13 23 3 - a a a x 12 22 32 2 computing computing systolic a in space a a x in time 11 21 arrays 31 1 etc. - - ( ) y 0 data streams - 1 ( ) 0 y migration by re-timing 2 ( ) y 0 3 and other transformations Computing in space and time 23

y a + * DPU architectures x expression tree 1 3 2 simultaneous placement & routing + + 4 * xf Mapper - * sh sh + + * xf Scheduler data streams - * free form pipe network sh sh simulated annealing General Stream-based Computing System heterogenous Array of DPUs (data path units) The same mapper for both: Reconfigurable, or hardwired Kress DPSS [1995] 24

terms: DPU: datpath unit DPA: data path array rDPU: reconfigurable DPU rDPA: reconfigurable DPA Converging Design Flows the same synthesis method may be used for mapping an algorithm onto both: rDPA [Kress, 1995], and DPA [Broderson, 2000]: this synthesis method is a generalization of systolic array synthesis: super systolic synthesis 25

Super Pipe Networks The key is mapping, rather than architecture * *) KressArray [1995] 26

>> Stream-based Memory Architecture • EDA revolution • Dead Supercomputer • Stream-based Computing • Stream-based Memory Architecture • Design Space Explorers • KressArray Xplorer • Machine paradigms • Co-Compilation http://www.uni-kl.de 27

Hot Research Topic: Memory Architectures • High Performance Embedded Memory Architectures • High Performance Memory Communication Architectures [Herz] • Custom Memory Management Methodology [Cathoor] • Data Reuse Transformations [Kougia et al.] • Data Reuse Exploration [Soudris, Wuytak] 28

Performance 1000 µProc 60%/yr.. CPU Processor-Memory Performance Gap:(grows 50% / year) 100 10 DRAM 7%/yr.. DRAM 1 1980 1990 2000 Processor Memory Performance Gap 29

the memory bandwidth problem is often more dramatic then for microprocessors interleaving is not practicable, since based on sequential instruction streams RAs: Cache does not help • super pipe networks, no parallel computers ! • Stream-based arrays are a memory bandwidth problem • classical caches do not help, since instruction sequencing is not used • the problem: throughput of parallel data streams, not instruction streams 30

An example by Nageldinger’s KressArray Xplorer Efficient Memory Communication should be directly supported by the Mapper Tools Legend: Optimized Parallel Memory Controller sequencers memory ports application not used Synthesizable Memory Communication http://kressarray.de 31

14" The Disk Farm? or a System On a Card? [Gordon Bell, Jim Gray, ISCA2000] The 500GB disc card LOTS of bandwidth A few disks replaced by >10s Gbytes RAM and a processor • MicroDrive:1.7” x 1.4” x 0.2” 2006: ? • 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek • 2006: 9 GB, 50 MB/s ? • (1.6X/yr capacity, 1.4X/yr BW) • Integrated IRAM processor • 2x height • Connected via crossbar switch • growing like Moore’s law • 16 Mbytes; ; 1.6 Gflops; 6.4 Gops • 10,000+ nodes in one rack! • 100/board = 1 TB; 0.16 Tflops 32

Memory Communication Architecture • hot research topic in embedded systems • storage context transformations [Herz, others] • for low power • for high performance • startups provide memory IP or generators 33

“instructions” rDPA Compiler Memory (data memory) Scheduler memory bank memory bank memory bank ... memory bank ... Sequencers (data stream generator) memory bank Stream-based Soft Machine 34

>> Design Space Explorers • EDA revolution • Dead Supercomputer • Stream-based Computing • Stream-based Memory Architecture • Design Space Explorers • KressArray Xplorer • Machine paradigms • Co-Compilation http://www.uni-kl.de 35

general purpose is unrealistic • fully general purpose reconfigurable sometimes is .... • domain-specific Reconfigurable Platforms will be suitable to cope with the 2nd Design Crisis an Illusion... • just as the general purpose massively parallel computer system KressArray Explorer... 36

Universal RAs: is it feasible? The General Purpose (coarse grain) Reconfigurable Array appears to be an Illusion ... Motivation ... such as obviously also the Universal Massively Parallel Computer Architecture ... counter-example: Application Domain of Image Processing 37

-> Design Space Exploration • Design Space Exploration • Design Space Explorer (DSEs) • Platform Space Explorers (PSEs) • Compiler / PSE symbiosis • Parallel computing vs. reconfigurable 38

Design Space ExplorationSystems 39

DSEs: an overview • For VLSI design in general • for parallel Computer Systems • Xplorer the only one for reconfigurable platforms (auch MATRIX ?) 40

>> KressArray Xplorer • EDA revolution • Dead Supercomputer • Stream-based Computing • Stream-based Memory Architecture • Design Space Explorers • KressArray Xplorer • Machine paradigms • Co-Compilation http://www.uni-kl.de 41

Xplorer Application Set KressArray Xplorer (Platform Design Space Explorer) ALE-X Compiler expr. tree interm. form 2 ALEX Code Compiler Architecture Estimator User HDL Generator Simulator Suggestion VHDL Verilog User Interface Selection Design Rules Architecture Editor interm. form 3 Mapper Improvement Proposal Generator Mapping Editor Datapath Generator Generator Mapper data stream Schedule Scheduler Kress rDPU Layout Scheduler Delay Estim. Sug- gest- ion statist. Data Power Estimator DPSS KressArray family parameters Power Data Inference Engine (FOX) Analyzer KressArray DPSS published at ASP-DAC 1995 42

Xplorer Datastream Generator DPSS Architecture & Mapping Editor HDL Generator Simulator Application Set Source Input KressArray (Design Space) Platform Space Explorer User Statistics Datapath Generator Generator Improvement Proposal Generator Delay & Power Estimator http://kressarray.de KressArray DPSS 43

including a Fuzzy Logic Improvement Proposal Generator accessible by internet: http://kressarray.de runs best with Netscape 4.6.1 Design Flow of Domain-specific Architecture Optimization Nageldinger’s KressArray Design Space Xplorer: 44

User g other IP n i DPSS-N p p ALE-X .alex .krs Data Path Systhesis System a M Code Module Generator Kress IP Kress Library Analyser .krs rDPU ALE-X Layout Compiler Editor / Mapper User Interface Technology Mapping .stat Statistical Data Placement & Routing Intermediate .map Format including Data .seq configware Sequencing Scheduler code Code to Synthesis Environment Architecture Interm. .map Estimation HDL .v HDL Generator Format Description KressArray Design Space Xplorer 45

>> Machine paradigms • EDA revolution • Dead Supercomputer • Stream-based Computing • Stream-based Memory Architecture • Design Space Explorers • KressArray Xplorer • Machine paradigms • Co-Compilation http://www.uni-kl.de 46

University of Kaiserslautern Computer tightly coupled by compact instruction code loosely coupled by decision data bits only Xputer Compiler Compiler Memory Memory “von Neumann” Scheduler instructions does not support soft data paths “instructions” Sequencer Datapath Datapath Array multiple sequencer Datapath Xputer: har dw ired program d a ta reconfigurable reconfigurable The Soft Machine Paradigm cou n ter: cou n ter also for hardwired state register Computer:the wrong Machine Paradigm “von Neumann” 47

Decision data only; i, e, loose coupling Compiler Compiler • • • memory memory Memory Scheduler Scheduler “instructions” “instructions” Sequencer • Datapath • Datapath • Sequencer Sequencer multiple d a ta reconfigurable d a ta reconfigurable cou n ters cou n ter Soft Machine Paradigm Xputer Parallel Xputer 48

Compiler Memory tightly coupled by a compact instruction code instructions “von Neumann” Decoder Datapath does not support soft data paths: Datapath Sequencer har dw ired program at run time: noinstruction fetch reconfigurable : Instruction Sequencer cou n ter Computer:the wrong Machine Paradigm “von Neumann” 49

Machine Paradigms 50

Reiner Hartenstein University of Kaiserslautern