reconfigurable computing and the von neumann syndrome n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Reconfigurable Computing and the von Neumann Syndrome PowerPoint Presentation
Download Presentation
Reconfigurable Computing and the von Neumann Syndrome

Loading in 2 Seconds...

play fullscreen
1 / 172

Reconfigurable Computing and the von Neumann Syndrome - PowerPoint PPT Presentation


  • 164 Views
  • Uploaded on

Reconfigurable Computing and the von Neumann Syndrome. Reiner Hartenstein. Questions ?. familiar with FPGAs ? Programming easy? Who is familiar with systolic arrays ? Duality: data streams vs. instruction streams ? Programming a multicore microprocessor: will it be easy ?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Reconfigurable Computing and the von Neumann Syndrome' - komala


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
questions
Questions ?
  • familiar with FPGAs ? Programming easy?
  • Who is familiar with systolic arrays ?
  • Duality: data streams vs. instruction streams ?
  • Programming a multicore microprocessor: will it be easy ?

2

outline
Outline
  • The Pervasiveness of FPGAs
  • The Reconfigurable Computing Paradox
  • The Gordon Moore gap
  • The von Neumann syndrome
  • We need a dual paradigm approach
  • Conclusions

4

slide6

Pervasiveness of RC

http://hartenstein.de/pervasiveness.html

http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html

6

slide7

RCeducation 2008

The 3rd International Workshop on Reconfigurable Computing Education

April 10, 2008, Montpellier, France

http://www.fpl.uni-kl.de/RCeducation08/

7

outline1
Outline
  • The Pervasiveness of FPGAs
  • The Reconfigurable Computing Paradox
  • The Gordon Moore gap
  • The von Neumann syndrome
  • We need a dual paradigm approach
  • Conclusions

the hardware / software chasm,

the configware / software chasm

the instruction stream tunnel

the overhead-prone paradigm

8

outline2
Outline
  • The Pervasiveness of FPGAs
  • The Reconfigurable Computing Paradox
  • The Gordon Moore gap
  • The von Neumann syndrome
  • We need a dual paradigm approach
  • Conclusions

instruction-stream vs. data stream

bridging the chasm: an old hat

stubborn curriculum task forces

9

outline3
Outline
  • The Pervasiveness of FPGAs
  • The Reconfigurable Computing Paradox
  • The Gordon Moore gap
  • The von Neumann syndrome
  • We need a dual paradigm approach
  • Conclusions

10

slide11

Outline

paradox

11

slide12

RC education

http://www.fpl.uni-kl.de/RCeducation/

http://www.fpl.uni-kl.de/ RCeducation08/pervasiveness.html

12

outline4
Outline
  • The Pervasiveness of FPGAs
  • The Reconfigurable Computing Paradox
  • The Gordon Moore gap
  • The von Neumann syndrome
  • We need a dual paradigm approach
  • Conclusions

platform FPGAs,

coarse-grained arrays

saving energy

13

fpga with island architecture

connect box

reconfigurable interconnect fabrics

switch box

reconfigurable logic box

FPGA with island architecture

FPGA with island architecture

FPGA with island architecture

14

deficiencies of reconfigurable fabrics fpga fine grained

density:

overhead:

FPGA

physical

wiring overhead

>> 10 000

FPGA

logical

FPGA

routed

Deficiencies of reconfigurable fabrics (FPGA)(fine-grained)

transistors / microchip

109

reconfigurability overhead>

(Gordon Moore curve)

106

routing congestion

(microprocessor)

immense area inefficiency

103

deficiency factor: >10,000

1st DeHon‘s Law

[1996: Ph. D thesis, MIT]

general purpose “simple” FPGA

power guzzler

100

slow clock

1980

1990

2000

2010

15

software to configware fpga migration

Reed-Solomon Decoding

pattern recognition

730

2400

MAC

SPIHT wavelet-based image compression

288

Smith-Waterman pattern matching

real-time face detection

457

1000

6000

Viterbi Decoding

400

100

FFT

crypto

oil and gas

video-rate stereo vision

17

1000

900

GRAPE

20

Astrophysics

52

BLAST

88

molecular dynamics simulation

1980

1990

2000

2010

protein identification

40

X 2/yr

Software-to-Configware (FPGA) Migration:

some published speed-up factors [2003 – 2005]

106

Image processing,

Pattern matching,

Multimedia

speedup factor

DSP and wireless

103

Bioinformatics

100

16

software to configware fpga migration1

The RC paradox

Reed-Solomon Decoding

pattern recognition

730

2400

MAC

SPIHT wavelet-based image compression

288

PISA

Smith-Waterman pattern matching

real-time face detection

457

1000

6000

Viterbi Decoding

400

100

FFT

3000

crypto

oil and gas

video-rate stereo vision

17

1000

deficiency factor: >10,000

speed-up factor:6,000

total discrepancy: >60,000,000

900

GRAPE

20

Astrophysics

52

BLAST

88

molecular dynamics simulation

1980

1990

2000

2010

protein identification

40

X 2/yr

Software-to-Configware (FPGA) Migration:

some published speed-up factors [2003 – 2005]

106

Image processing,

Pattern matching,

Multimedia

speedup factor

DSP and wireless

103

Bioinformatics

100

18

software to configware fpga migration2

The RC paradox

Reed-Solomon Decoding

pattern recognition

730

2400

MAC

SPIHT wavelet-based image compression

288

Smith-Waterman pattern matching

real-time face detection

457

1000

6000

Viterbi Decoding

400

100

FFT

3000

crypto

oil and gas

video-rate stereo vision

17

1000

deficiency factor: >10,000

speed-up factor:6,000

total discrepancy: >60,000,000

900

GRAPE

20

Astrophysics

52

BLAST

88

molecular dynamics simulation

1980

1990

2000

2010

protein identification

40

X 2/yr

Software-to-Configware (FPGA) Migration:

some published speed-up factors [2003 – 2005]

106

Image processing,

Pattern matching,

Multimedia

speedup factor

DSP and wireless

103

Bioinformatics

100

19

software to configware fpga migration3

The RC paradox

Reed-Solomon Decoding

pattern recognition

730

2400

MAC

SPIHT wavelet-based image compression

288

PISA

Smith-Waterman pattern matching

real-time face detection

457

1000

6000

Viterbi Decoding

400

100

FFT

crypto

oil and gas

video-rate stereo vision

17

1000

deficiency factor: >10,000

speed-up factor:6,000

total discrepancy: >60,000,000

900

GRAPE

20

Astrophysics

52

BLAST

88

molecular dynamics simulation

1980

1990

2000

2010

protein identification

40

X 2/yr

Software-to-Configware (FPGA) Migration:

some published speed-up factors [2003 – 2005]

106

Image processing,

Pattern matching,

Multimedia

speedup factor

DSP and wireless

103

Bioinformatics

100

20

software to configware fpga migration4
Software-to-Configware (FPGA) Migration:

some published speed-up factors [2003 – 2005]

These examples worked fine with on-chip memory

There are other algorithms more difficult to accelerate …

… where d-daching might be useful (ASM)

21

slide22

Outline

platform-FPGA

22

how much on chip embedded bram
How much on-chip embedded BRAM ?

256 – 1704 BGA

8 – 32

DPU:

coarse-grained

56 – 424

fast on-chip block RAMs: BRAMs

On-chip LatticeCS series

23

coarse grained reconfigurable array

rDPU

Coarse-grained Reconfigurable Array

note: software perspective without instruction streams: pipelining

question after the talk: „but you can‘t implement decisions!“

SNN filter on (supersystolic) KressArray (mainly a pipe network)

rout thru only

no CPU

reconfigurable Data Path Unit, 32 bits wide

array size:

10 x 16 rDPUs

compiled by Nageldinger‘s KressArray Xplorer with Juergen Becker‘s CoDe-X inside

not used

backbus connect

25

much less deficiencies by coarse grained

CPU

program

counter

DPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPA logical

rDPA physical

rDPU

rDPU

rDPU

rDPU

Much less deficiencies by coarse-grained

transistors / microchip

DPU

109

rDPU

(Gordon Moore curve)

106

area efficiency very close to Moore‘s law

103

Hartenstein‘s Law[1996: ISIS, Austin, TX]

very compact configuration code: very fast reconfiguration

100

1980

1990

2000

2010

27

software to configware fpga migration5

oil and gas

17

1980

1990

2000

2010

X 2/yr

Software-to-Configware (FPGA) Migration:

Oil and gas [2005]

106

side effect: slashing the electricity bill

by more than an order of magnitude

speedup factor

103

100

29

an accidentially discovered side effect

What about higher speed-up factors ?

More dramatic electricity savings?

An accidentially discovered side effect

Herb Riley, R. Associates

$70in 2010?

Saves > $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack

  • Software to FPGA migration of an oil and gas application:
  • Speed-up factor of 17
  • Electricity bill down to <10%
  • Hardware cost down to <10%
  • All other publications reporting speed-up did not report energy consumption.

- This will change.

30

what s really going on with oil prices businessweek january 29 2007
What’s Really Going On With Oil Prices? [BusinessWeek, January 29, 2007]

$52 Price of delivery in February 2007 [New York Mercantile Exchange: Jan. 17]

$200 Minimum oil price in 2010, in a bet by investment banker Matthew Simmons

31

energy as a strategic issue
Energy as a strategic issue
  • Google‘s annual electricity bill: 50,000,000 $
  • Amsterdam‘s electricity: 25% into server farms
  • NY city server farms: 1/4 km2 building floor area
  • Predicted f. USA in 2020: 30-50% of the entire national electricity consumption goes into cyber infrastructure

[Mark P. Mills]

  • petaFlop supercomputer (by 2012 ?): extreme power consumption

32

energy an im portant motivation
Energy: an im portant motivation

*) feasible also on reconfigurable platforms

33

outline5
Outline
  • The Pervasiveness of FPGAs
  • The Reconfigurable Computing Paradox
  • The Gordon Moore gap
  • The von Neumann syndrome
  • We need a dual paradigm approach
  • Conclusions

& the multicore crisis

35

what is the reason of the paradox

Moore’s law not applicable to all aspects of VLSI

the law of Gates

What is the reason of the paradox ?

The Gordon Moore curve does not indicate performance

The peak clock frequency does not indicate performance

36

rapid decline of computational density

200

DEC alpha

[BWRC, UC Berkeley, 2004]

175

150

memory wall, caches, ...

125

100

SPECfp2000/MHz/Billion Transistors

CPU

75

IBM

50

SUN

25

HP

0

1990

1995

2000

2005

stolen from Bob Colwell

Rapid Decline of Computational Density

primary design goal: avoiding a paradigm shift

dramatic demo of the von Neumann Syndrome

alpha: down by 100 in 6 yrs

IBM: down by 20 in 6 yrs

37

monstrous steam engines of computing

Crossbar weight: 220 t, 3000 km of thick cable,

Monstrous Steam Engines of Computing

power measured in tens of megawatts,

floor space measured in tens of thousands of square feet

5120 Processors, 5000 pins each

larger than a battleship

ready 2003

38

dead supercomputer society
ACRI

Alliant

American Supercomputer

Ametek

Applied Dynamics

Astronautics

BBN

CDC

Convex

Cray Computer

Cray Research

Culler-Harris

Culler Scientific

Cydrome

Dana/Ardent/ Stellar/Stardent

DAPP

Denelcor

Elexsi

ETA Systems

Evans and Sutherland

Computer

Floating Point Systems

Galaxy YH-1

Goodyear Aerospace MPP

Gould NPL

Guiltech

ICL

Intel Scientific Computers

International Parallel . Machines

Kendall Square Research

Key Computer Laboratories

Dead Supercomputer Society

Research 1985 – 1995[Gordon Bell, keynote ISCA 2000]

  • MasPar
  • Meiko
  • Multiflow
  • Myrias
  • Numerix
  • Prisma
  • Tera
  • Thinking Machines
  • Saxpy
  • Scientific Computer
  • Systems (SCS)
  • Soviet Supercomputers
  • Supertek
  • Supercomputer Systems
  • Suprenum
  • Vitesse Electronics

39

we are in a computing crisis

supercomputing crisis

microprocessor crisis

MPP parallelism does not scale

going multi core

We are in a Computing Crisis

*) feasible also with rDPA

40

the von neumann paradigm trap

[Burks, Goldstein, von Neumann; 1946]

  • RAM (memory cells have adresses ….)
The von Neumann Paradigm Trap

CS education got stuck in this paradigm trap which stems from technology of the 1940s

  • Program counter (auto-increment, jump, goto, branch)
  • Datapath Unit with ALU etc.,
  • I/O unit, ….

CS education’s right eye is blind, and its left eye suffers from tunnel view

We need a dual paradigm approach

42

what is the reason of the paradox1

The Law of More:

drastically declining programmer productivity

What is the reason of the paradox ?

the von Neumann Syndrome

Result from decades of tunnel view in CS R&D and education

basic mind set completely wrong

“CPU: most flexible platform” ?

>1000 CPUs running in parallel: the most inflexible platform

However, FPGA & rDPA are very flexible

43

understanding the paradox
Understanding the Paradox ?

Executive Summary doesn‘t help

We must first understand the nature of the paradigm

von Neumann chickens ?

45

von neumann cpu

RAM

memory

DPU

DPU

CPU

program

counter

Von Neumann CPU

(tunnel view with the left eye)

Program Source: Software

- World of Software -Engineering

48

von neumann is not the common model

software

instruction-stream-based

data-stream-based

RAM

memory

von Neumann bottleneck

accelerator

DPU

hardware

CPU

co-processors

program

counter

CPU

von Neumann is not the common model

microprocessor age:

mainframe age:

von Neumann instruction-stream-based machine

49

here is the contemporary common model

software

instruction-stream-based

data-stream-based

RAM

memory

von Neumann bottleneck

accelerator

DPU

hardware

CPU

co-processors

program

counter

CPU

CPU

reconfigurable

hardwired

accelerator

accelerator

Here is the contemporary common model

microprocessor age:

mainframe age:

Now we are in the configware age:

von Neumann instruction-stream-based machine

50

machine models

RAM

memory

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

program

counter

DPU

CPU

DPU

data

counter

data

counter

data

counter

RAM

RAM

RAM

rDPU

rDPU

rDPU

rDPU

*) “transport-triggered”

**) does not have a program counter

machine models

von Neumann

Anti machine

- no instruction fetch at run time

51

nick tredennick s paradigm shifts

Early historic machines

CPU

1 programming source needed

resources: fixed

resources: fixed

Von Neumann

algorithm: fixed

algorithm: variable

software

Nick Tredennick’s Paradigm Shifts:

(slowly preparing to use both eyes for a dual paradigm point of view)

53

compilation software

Software Engineering

source program

software

compiler

instruction schedule

software code

(Befehls-Fahrplan)

sequential

Compilation: Software

(von Neumann model)

54

nick tredennick s paradigm shifts1

Early historic machines

CPU

1 programming source needed

resources: fixed

resources: fixed

Von Neumann

algorithm: fixed

algorithm: variable

software

configware

flowware

resources: variable

Reconfigurable Computing

2 programming sources needed

algorithm:variable

Nick Tredennick’s Paradigm Shifts

55

configware compilation

Configware Engineering

source „program“

placement & routing

x

x

x

x

x

x

x

x

|

x

configware

compiler

|

|

mapper

ASM: Auto-Sequencing Memories

-

-

-

x

x

x

-

-

-

-

x

x

x

x

x

x

x

x

x

-

-

-

-

-

x

x

x

-

|

|

|

x

x

x

-

-

|

|

|

programming the data counters

|

|

|

GAG

GAG

GAG

GAG

x

|

|

|

scheduler

x

x

|

|

x

x

x

data streams

x

x

data

counter

data

counter

data

counter

data

counter

x

RAM

RAM

RAM

RAM

rDPA

pipe network

ASM

ASM

ASM

ASM

Configware Compilation

C, FORTRAN

MATHLAB

configware compilation fundamentally different from software compilation

configware code

data

flowware code

56

the first archetype machine model

Software Industry

Software Industry’s

Secret of Success

compile or

assemble

procedural

personalization

The first archetype machine model

But now we live in the Configware Age

simple basic .

Machine Paradigm

instruction-stream-

based mind set

main

frame

CPU

personalization:

RAM-based

“von Neumann”

57

synthesis method
Synthesis Method?

reductionist approach

of course algebraic (linear projection)

only for applications with regular data dependencies

Mathematicians caught by their own paradigm trap

The super-systolic array: a generalization of the systolic array

1995

Rainer Kress discarded their algebraic synthesis methods and replaced it by simulated annealing:

rDPA

59

having introduced data streams

time

(pipe network)

input data stream

DPA

x

x

x

x

x

x

x

x

time

port #

|

x

|

|

time

-

-

-

x

x

x

execution transport-triggered

-

-

-

-

x

x

x

x

x

x

x

x

x

-

-

-

-

-

x

x

x

-

|

|

|

x

x

x

-

-

|

|

|

port #

|

|

|

port #

x

|

|

|

x

x

|

|

x

x

x

output data streams

x

x

„data streams“

x

time

Having introduced Data streams

The road map to HPC: ignored for decades

H. T. Kung

~1980

no memory wall

systolic array research: throughout the 80ies: Mathematicians‘ hobby

60

who generates the data streams

x

x

x

x

x

x

x

x

|

x

|

|

-

-

-

x

x

x

-

-

-

-

x

x

x

x

x

x

x

x

x

-

-

-

-

-

x

x

x

-

|

|

|

x

x

x

-

-

|

|

|

|

|

|

x

|

|

|

x

x

|

|

x

x

x

x

x

x

Who generates the Data Streams?

Mathematicians: it‘s not our job

„systolic“

(it‘s not algebraic)

61

without a sequencer

(it‘s not our job)

resources

Machine:

sequencer

Without a sequencer …

reductionist approach:

… it’s not a machine

Mathematicians have missed to invent the new machine paradigm

... the anti machine

62

the counterpart of the von neumann machine

ASM

ASM

ASM

ASM

ASM

ASM

data counters: located at memory

(not at data path)

(r)DPA

x

x

x

x

x

x

x

x

|

x

ASM

ASM

|

|

-

-

-

x

x

x

ASM

ASM

-

-

-

-

x

x

x

x

x

x

x

x

x

-

-

-

-

-

ASM

ASM

x

x

x

-

|

|

|

x

x

x

-

-

RAM

|

|

|

GAG

|

|

|

x

|

|

|

x

x

|

data

counter

|

x

x

x

x

x

x

ASM: Auto-Sequencing Memory

The counterpart of the von Neumann machine

coarse-grained

data counters instead of a program counter

Kress /Kung Anti Machine

63

acceleration mechanisms by asm based momsw
Acceleration Mechanisms by ASM-based MoMSW
  • parallelism by multi bank memory architecture
  • reconfigurable address compuattion – before run time
  • avoiding multiple accesses to the same data.
  • avoiding memory cycles for address computation
  • improve parallelism by storage scheme transformations
  • minimize data movement across chip boundaries

64

fpgas in supercomputing

DataPath Units

32 Bit, 64 Bit

DPU

DPU

CPU

program

counter

reconfigurable logic box: 1 Bit

FPGAs in Supercomputing
  • Synergisms: coarse-grained parallelism through conventional parallel processing,
  • and:fine-grained parallelism through direct configware execution on the FPGAs

(millions of rLBs embedded in a reconfigurable interconnect fabrics)

65

anti machine

hardwired anti machine:

reconfigurable anti machine:

resources

resources

memory

memory

sequencer

sequencer

data counters

data counters

flowware

flowware

algorithms

algorithms

configware

Anti machine

66

von neumann machine

von Neumann machine:

memory

program counter

software

resources

resources

algorithms

Machine:

sequencer

sequencer

von Neumann machine

67

the clash of paradigms

procedural

structural

hardware guy

programmer

µprocessor

accelerators

The clash of paradigms

kind of data-stream-based mind set

the basic mind set is

instruction-stream-based

microprocessor age:

the software / hardware chasm

a programmer does not understand function evaluation without machine mechanisms - without a pogram counter …

we need a datastream based machine paradigm

68

xputer principles

reconfigurable

addr. generators

ASM

Xputer

CPU

DPLA

rALU

We used the VAX-11/750 of my group

reconfigurable

Data Path

Xputer Principles

contemporary ?

1984: first FPGAs: very tiny & very expensive

69

outline6
Outline
  • The von Neumann Paradigm
  • Accelerators and FPGAs
  • The Reconfigurable Computing Paradox
  • The new Paradigm
  • Coarse-grained
  • Bridging the Paradigm Chasm
  • Conclusions

72

fpga modes of operation

Execution phase

Configuration phase

C ph

C ph

E ph

E ph

time

FPGA Modes of Operation

Legend:

(requiring new OS principles)

configware code loaded from external flash memory, e. g. after power-on (~milliseconds)

simple, static reconfigurability

off

74

illustrating dynamically reconfigurable

configware OS fundamentally different from software OS

macro X

macro

Z

configware macro Y

C ph

module X

C ph

C ph

C ph

C ph

E ph

E ph

X configures Y

module Y

Reconfigurable Computing at Microsoft

E ph

time

module z

E ph

E ph

illustrating dynamically reconfigurable

partially reconfigurable

FPGA

Swapping and scheduling ofrelocatable configware codemacros is managed by aconfigwareoperating system

Configware OS

established R&D area

module no.

Microsoft ReconVista

?

75

gliederung
Gliederung
  • The von Neumann Paradigm
  • Accelerators and FPGAs
  • The Reconfigurable Computing Paradox
  • The new Paradigm
  • Coarse-grained
  • Bridging the Paradigm Chasm
  • Conclusions

76

reconfigurable hpc
Reconfigurable HPC
  • This area is almost 10 years old

77

reconfigurable hpc1
Reconfigurable HPC
  • This area is almost 10 years old

78

have to re think basic assumptions
Have to re-think basic assumptions

Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs.

Not processing is costly, but movingdataandmessages

We’ve to re-think basic assumptions behind computing

79

illustrating the von neumann paradigm trap

The instruction-stream-based approach

The data-stream-based approach

von Neumann bottle-neck

Illustrating the von Neumann paradigm trap

the watering pot model [Hartenstein]

many watering pots

has no von Neumann bottle-neck

80

have to re think basic assumptions1
Have to re-think basic assumptions

Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs.

Not processing is costly, but movingdataandmessages

We’ve to re-think basic assumptions behind computing

81

outline7
Outline
  • The (non-v-N) anti-machine (Xputer)
  • Speed-up by address generators
  • Data-procedural Programming Language
  • Generalization of the Systolic Array
  • Partitioning Compilation Techniques
  • Design Space Exploration
  • Bridging the Paradigm Chasm

82

more compute power by configware than software

75% of all (micro)processors are embedded

4 : 1

25% embedded µProc. accelerated by FPGA(s)

1 : 4

-> 1 : 1

-> Every 2nd µProc accelerated by FPGA(s)

avarage acceleration factor >2

-> rMIPS* : MIPS > 2

*) rMIPS: MIPS replaced by FPGA compute power

More compute power by Configware than Software

(a very cautious estimation**)

Conclusion: most compute power from Configware

(difference probably an order of magnitude)

83

programming language paradigms

very easy to learn

multiple

GAGs

Programming Language Paradigms

Principles of MoPL [1994]

86

slide87

Avoiding the paradigm shift?

„It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“

Tarek El-Ghazawi, panelist at SuperComputing 2006

„A leap too far for the existing HPC community“

panelist Allan J. Cantle

We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques.

A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective

87

SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors

outline8
Outline
  • The von Neumann Paradigm
  • Accelerators and FPGAs
  • The Reconfigurable Computing Paradox
  • The new Paradigm
  • Coarse-grained
  • Bridging the Paradigm Chasm
  • Conclusions

88

we need a new machine paradigm

data-stream-based mind set

x

x

x

x

x

x

x

x

|

x

|

|

-

-

-

x

x

x

-

-

-

-

x

x

x

x

x

x

a programmer does not understand function evaluation without machine mechanisms - without a pogram counter …

x

x

x

-

-

-

-

-

data

x

x

x

-

|

|

|

x

x

x

-

-

|

|

|

|

|

|

x

|

|

|

x

x

|

|

x

x

x

data streams

x

x

x

it was pepared almost 30 years ago

We need a new machine paradigm

we urgently need a datastream based machine paradigm

89

generic address generator gag

GAG

data

counter

Generic Address Generator GAG

Generalization of the DMA

Acceleration factors by:

  • address computation without memory cycles

avoid e.g. 94% address computation overhead*

  • storge scheme optimization methodology, etc.

GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik]

*) Software to Xputer migration

90

patented by TI 1995

the 2nd archetype machine model

Configware Industry

Configware Industry’s

Secret of Success

compile

structural

personalization

reconfigurable

accelerator

The 2nd “archetype” machine model

simple basic .

Machine Paradigm

data-stream-

based mind set

personalization:

RAM-based

“Kress-Kung”

91

outline9
Outline
  • The von Neumann Paradigm
  • Accelerators and FPGAs
  • The Reconfigurable Computing Paradox
  • The new Paradigm
  • Coarse-grained
  • Bridging the Paradigm Chasm
  • Conclusions

92

symptom of the von neumann syndrome

rDPU

Symptom of the von Neumann Syndrome

note: software perspective without instruction streams

question after the talk: „but you can‘t implement decisions!“

SNN filter on (supersystolic) KressArray (mainly a pipe network)

rout thru only

A High level R&D manager of a large Japanese IT industry group

array size:

10 x 16

= 160 rDPUs

yielded by single-paradigm mind set

no CPU

Executive summary? Forget it !

How about a microprocessor giant having >100 vice presidents ?

reconfigurable Data Path Unit, e. g. 32 bits wide

not used

backbus connect

if clause turns into multiplexer

93

dual paradigm application development

Juergen Becker’s CoDe-X, 1996

Partitioner

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

CW

SW

compiler

compiler

generating a pipe network

CPU

rDPU

rDPU

rDPU

rDPU

Dual Paradigm Application Development

automatic parallelization by loop transformations

C language source

placement and routing

94

hybrid multi core example

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

Hybrid Multi Core example

each core can run CPU mode or rDPU mode

twin paradigm machine

64 cores

How about a microprocessor giant having >100 vice presidents ?

disabled for the paradigm shift ?

customer refuses the pradigm shift?

95

compilation for dual paradigm multicore

Juergen Becker’s CoDe-X, 1996

Partitioner

CW

SW

compiler

compiler

generating a pipe network

rDPU

rDPU

rDPU

CPU

CPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

Compilation for Dual Paradigm Multicore

automatic parallelization by loop transformations

C language source

placement and routing

compile to hybrid multicore

96

outline10
Outline
  • The von Neumann Paradigm
  • Accelerators and FPGAs
  • The Reconfigurable Computing Paradox
  • The new Paradigm
  • Coarse-grained
  • Bridging the Paradigm Chasm
  • Conclusions

97

here is the common model

software

instruction-stream-based

data-stream-based

RAM

memory

von Neumann bottleneck

accelerator

DPU

hardware

CPU

co-processors

program

counter

CPU

CPU

CPU

software/configware

co-compiler

hardwired

reconfigurable

reconfigurable

accelerator

accelerator

accelerator

Here is the common model

microprocessor age:

mainframe age:

configware age:

von Neumann instruction-stream-based machine

software

configware

98

outline11
Outline
  • The von Neumann Paradigm
  • Accelerators and FPGAs
  • The Reconfigurable Computing Paradox
  • The new Paradigm
  • Coarse-grained
  • Bridging the Paradigm Chasm
  • Conclusions

99

multi core just more cpus
Multi Core: Just more CPUs ?

Complexity and clock frequency of single-core microprocessors come to an end

Multi-core microprocessor chips emerging: soon 32 cores on an AMD chip, and 80 on an intel

Without a paradigm shift just more CPUs on chip lead to the dead roads known from supercomputing

Multi-threading is not the silver bullet

We’ve to re-think basic assumptions behind computing

100

solution not expected from cs officers
Solution not expected from CS officers

Progress of the joint task force on CS curriculum recommendations is extremely disillusioning

it‘s more like a lobby: „my area is the most important“

We need mutual efforts, like EE/CS cooperation known from the Mead & Conway revolution

The personal supercomputer: a far-ranging massive push of innovation in all areas of science and economy:

by Reconfigurable Computing

For RC other motivations are similarly high-grade: growing cost and looming shortage of energy.

101

computing sciences are in a severe crisis
Computing Sciences are in a severe Crisis

We urgently need to shape the Reconfigurable Computing Revolution for enabling to go toward incredibly promising new horizons of affordable highest performance computing

This cannot be achieved with the classical software-based mind set

We need a new dual paradigm approach

Supercomputing titans may be your enemies

Watch out not to get screwed !

102

the configware age
The Configware Age
  • Attempts to avoid the paradigm shift will again create a disaster
  • Mainframe age and microprocessor(-only) age are history
  • We are living in the configware age right now!

103

slide109
GAG

109

mom scan window momsw illustration

ASM: Auto-Sequencing Memory

ASM: Auto-Sequencing Memory

MoM Scan window (MoMSW) Illustration

MoM architectural primary features:

  • 2-dimensional (data) memory address space
  • Multiple* vari-size reconfigurable MoMSW scan windows
  • MoMSW controlled by reconfigurable GAG (generic address generators)

*) typically 3

110

reconfigurable generic address generator gag

GAG

data

counter

Reconfigurable Generic Address Generator GAG

Generalization of the DMA

Acceleration factors by:

  • address computation without memory cycles

avoid e.g. 94% address computation overhead*

  • storge scheme optimization methodology, etc.
  • supporting scratch optimization strategies (smart d-caching)

GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik]

112

patented by TI 1995

jpeg zigzag scan pattern

*> Declarations

goto PixMap[1,1]

HalfZigZag;

SouthWestScan

uturn (HalfZigZag)

x

EastScan is

step by [1,0]

end EastScan;

4

y

SouthScan is

step by [0,1]

endSouthScan;

2

NorthEastScan is

loop8 times until [*,1]

step by [1,-1]

endloop

end NorthEastScan;

HalfZigZag

HalfZigZag

3

data counter

data counter

SouthWestScan is

loop8 times until [1,*]

step by [-1,1]

endloop

end SouthWestScan;

1

HalfZigZag is

EastScan

loop 3 times

SouthWestScan

SouthScan

NorthEastScan

EastScan

endloop

end HalfZigZag;

data counter

data counter

JPEG zigzag scan pattern

116

significance of momsw reconfigurable scan windows
Significance of MoMSW Reconfigurable Scan Windows
  • MoMSW Scan windows have the potential to drastically reduce traffic to/from slow off-chip memory.
  • No instruction streams needed to implement scratch pad optimization strategies using fast on-chip memory
  • MoMSW Scan windows may contribute to speed-up by a factor of 10 and sometimes even much more
  • MoMSW Scan windows are the deterministic alternative („d-caching“) to (indeterministic and speculative) classical cache usage: performance can be well predicted
  • For data-stream-based computing scan windows are highly effective, whereas classical caches are entirely useless

117

linear filter application

Parallelized Merged Buffer Linear Filter Application

with example image of x=22 by y=11 pixel

final design

after inner scan line loop unrolling

after scan line unrolling

hardw. level access optim.

initial design

Linear Filter Application

Speed-up factor >11

due to MoMSW-based d-caching & storage scheme optimization

118

processing 4 by 4 reference patterns

reconfigurable

accelerator

Processing 4-by-4 Reference Patterns

PISA DRC accelerator [ICCAD 1984]

DPLA: fabricated by the E.I.S. Multi University Project:

Mead-&-Conway nMOS Design Rules:

256 4-by-4 reference patterns

Mead-&-Conway CMOS Design Rules:

>800 4-by-4 reference patterns

vN Software: some reference patterns can be skipped, depending on earlier patterns

MoM: all reference patterns matched in a single clock cycle

PISA: a forerunner of the MoM

Reference patterns automatically generated from Design Rules

1984: 1 DPLA replaces 256 FPGAs

120

outline12
Outline
  • The (non-v-N) anti-machine (Xputer)
  • Speed-up by address generators
  • Data-procedural Programming Language
  • Generalization of the Systolic Array
  • Partitioning Compilation Techniques
  • Design Space Exploration
  • Bridging the Paradigm Chasm

124

significance of address generators
Significance of Address Generators
  • Address generators have the potential to reduce computation time significantly.
  • In a grid-based design rule check a speed-up of more than 2000 has been achieved*
  • reconfigured address generators contributed a factor of 10 - avoiding memory cycles for address computation overhead

*) 15,000 if the same algorithm is used

125

hardware vs software perspective

for software people

for softwarepeople

hardware vs. software perspective

**) without soft cores

*) with soft cores and/or on-chip microprocessor

126

ingredients

rDPU

ASM

ASM

simple FPGA

on-chip

coarse-grained array

CPU

CPU

BRAM

rDPU

BRAM

BRAM

RAM

ASM

CPU

anti machine (Xputer)

hardwired special functions

rDPU

rLB

rLB

Soft

CPU

Soft

CPU

CPU

and, for running

legacy software

rDPU

rLB

program

counter

program

counter

Soft

CPU

data

counter

CPU with reconfigurable instruction set extension

platform FPGA

Ingredients

all multi core!

(Kress/Kung machine)

127

perspective what expertise needed hardware
perspective ? what expertise needed ? hardware ?

von Neumann: software perspective

  • microprocessor (also multi core)
  • simple FPGA (fine-grained)
  • platform FPGA (domain-specific core assortment, embedded in FPGA fabrics)
  • coarse-grained reconfigurable array
  • reconfigurable instruction set processor

hardware perspective

mishmash model – a nightmare for under-graduate studies

but by far best optimization potential

software perspective

mishmash model (s. a.)

128

slide129

Objectives

for every area which needs:

cheap, compact vHPC

rapid prototyping, field-patching, emulation

avoiding specific silicon

flexibility (for accelerators)

129

slide130

Conclusion (1)

Reconfigurable Computing opens many spectacular new horizons:

Cheap vHPC without needing specific silicon, no mask ....

Cheap embedded vHPC

Cheap desktop supercomputer (a new market)

Replacing expensive hardwired accelerators

Fast and cheap prototyping

Flexibility for systems with unstable multiple standards by dynamic reconfigurability

Supporting fault tolerance, self-repair and self-organization

Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace …)

Massive reduction of the electricity bill: locally and national

130

slide131

Conclusion (2)

Needed:

Universal vHPC co-architecture demonstrator

For widely spreading its use successfully:

The compilation tool problem to be solved

Language selection problem to be solved

Education backlog problems to be solved

Use this to develop a very good high school and undergraduate lab course

select killer applications for demo

A motivator: preparing for the top 500 contest

131

more compute power by configware than software1

75% of all (micro)processors are embedded

4 : 1

25% embedded µProc. accelerated by FPGA(s)

1 : 4

-> 1 : 1

-> Every 2nd µProc accelerated by FPGA(s)

avarage acceleration factor >2

-> rMIPS* : MIPS > 2

*) rMIPS: MIPS replaced by FPGA compute power

More compute power by Configware than Software

(a very cautious estimation**)

Conclusion: most compute power from Configware

(difference probably an order of magnitude)

132

slide133

Conclusion (3)

Self-Repair and Self-Organization methodology

Embedded r-emulation logistics methodology

Universal vHPC co-architecture demonstrator

For widely spreading its use successfully:

select a killer application for demo

133

slide134

Application co-development environment for

Hardware non-experts, ....

Acceptability by software-type users, ...

some Goals

Universal HPC co-architecture for:

embedded vHPC (nomadic, automotive, ...)

desktop vHPC (scientific computing ...)

Meet product lifetime >> embedded syst. life:

FPGA emulation logistics from development downto maintenance and repair stations

examples: automotive, aerospace, industrial, ..

134

slide135

SuperComputing 06

SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors

Panel

Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?

Tarek El-Ghazawi, The George Washington University

-

Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?

Dave Bennett, Xilinx, Inc

-

Reconfigurable Computing: The Future of HPC

Daniel S. Poznanovic, SRC Computers, Inc.

-

Is High-Performance Reconfigurable Computing the Next Supercomputing Paradigm?

Allan J. Cantle , Nallatech Ltd.

-

Challenges for Reconfigurable Computing in HPC

Keith D. Underwood, Sandia National Laboratories

-

Reconfigurable Computing - Are We There Yet?

Rob Pennington, National Center for Supercomputing Applications

-

Reconfigurable Computing: The Road Ahead

Duncan Buell, University of South Carolina

-

Opportunities and Challenges with Reconfigurable HPC

Alan D. George, University of Florida

135

outline13
Outline
  • The (non-v-N) anti-machine (Xputer)
  • Speed-up by address generators
  • Data-procedural Programming Language
  • Generalization of the Systolic Array
  • Partitioning Compilation Techniques
  • Design Space Exploration
  • Bridging the Paradigm Chasm

136

outline14
Outline
  • The (non-v-N) anti-machine (Xputer)
  • Speed-up by address generators
  • Data-procedural Programming Language
  • Generalization of the Systolic Array
  • Partitioning Compilation Techniques
  • Design Space Exploration
  • Bridging the Paradigm Chasm

137

acceleration mechanisms by asm based momsw1
Acceleration Mechanisms by ASM-based MoMSW
  • parallelism by multi bank memory architecture
  • reconfigurable address compuattion – before run time
  • avoiding multiple accesses to the same data.
  • avoiding memory cycles for address computation
  • improve parallelism by storage scheme transformations
  • minimize data movement across chip boundaries

138

outline15
Outline
  • The (non-v-N) anti-machine (Xputer)
  • Speed-up by address generators
  • Data-procedural Programming Language
  • Generalization of the Systolic Array
  • Partitioning Compilation Techniques
  • Design Space Exploration
  • Bridging the Paradigm Chasm

139

outline16
Outline
  • The (non-v-N) anti-machine (Xputer)
  • Speed-up by address generators
  • Data-procedural Programming Language
  • Generalization of the Systolic Array
  • Partitioning Compilation Techniques
  • Design Space Exploration
  • Bridging the Paradigm Chasm

140

outline17
Outline
  • The (non-v-N) anti-machine (Xputer)
  • Speed-up by address generators
  • Data-procedural Programming Language
  • Generalization of the Systolic Array
  • Partitioning Compilation Techniques
  • Design Space Exploration
  • Bridging the Paradigm Chasm

141

c or fortran

Gordon Bell:

or C (X-C)

Reiner Hartenstein (conclusion of this talk):

C or FORTRAN ?

it’s a shorter leap

Computer scientists haven’t been interested in programming clusters. If putting the cluster on a chip is what excites them, fine.

It will still have to run Fortran!

Classical programming languages, but with a slightly different semantics (data-procedural) are good candidates for parallel programming.

Support tools have been demonstrated by academia

142

*) like CoDe-X

newton s 1st law
Newton’s 1st Law

Newton’s 1st Law à la Gordon Bell:

Scientists do not change their direction

##

###

###

##

a

##’

###

143

*) like CoDe-X

dual paradigm an old hat

token bit

evoke

FF

Dual paradigm: an old hat

Software mind set: instruction-stream-based: flow chart -> control instructions

(FSM: state transition)

Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer

-> Register Transfer Modules (DEC: mid 1970ies); similar concept: Case Western Reserve Univ. ;

145

dual paradigm an old hat 2

token bit

evoke

FF

Dual paradigm: an old hat (2)

Hardware Description Language scene ~1970:

“It is so simple!

why did it take 25 years to find out ?”

Because of the reductionists’ tunnel view

Because of a lack of transdisciplinary thinking

146

dual paradigm an old hat 3
Dual paradigm: an old hat (3)

Hardware Description Languages;

“procedure call” or function call

call Module-name (parameters);

Software: time domain

Hardware description: space domain

147

slide148
ASM

148

apropos hipeac software configware co compilation

software

instruction-stream-based

data-stream-based

RAM

memory

Partitioner

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

von Neumann bottleneck

CoDe-X, 1996

accelerator

DPU

C language source

hardware

CPU

co-processors

program

counter

CW

SW

compiler

compiler

CPU

CPU

CPU

software/configware

co-compiler

reconfigurable

accelerator

rDPU

rDPU

rDPU

rDPU

Apropos HiPEAC: Software / Configware Co-Compilation

microprocessor age:

mainframe age:

automatic parallelization by loop transformations

configware age:

von Neumann instruction-stream-based machine

software

configware

152

j rgen becker s code x 1 co compiler

Computer

machine

X-C

Antimachine

paradigm

X-C is C language

Partitioner

para

d

igm

extended by MoPL

X-C

GNU C

Analyzer

compiler

compiler

/ Profiler

CPU

Xputer

Jürgen Becker’sCoDE-X -1 Co-Compiler

& running legacy software

rALU: => array size: 1-by-1

153

j rgen becker s code x 2 co compiler

X-C

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

Computermachine

X-C is C language

Partitioner

para

d

igm

extended by MoPL

Antimachine

supporting

KressArray

family

paradigm

X-C

GNU C

Analyzer

compiler

compiler

/ Profiler

CPU

DPSS

Resource Parameters

rDPU

rDPU

rDPU

rDPU

Jürgen Becker’sCoDE-X -2 Co-Compiler

Pipelining: A Shorter Leap

154

j rgen becker s code x 2 co compiler1

X-C

Computermachine

X-C is C language

Partitioner

para

d

igm

extended by MoPL

Antimachine

paradigm

X-C

GNU C

Analyzer

compiler

compiler

/ Profiler

rDPU

rDPU

rDPU

CPU

CPU

DPSS

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

rDPU

Jürgen Becker’sCoDE-X -2 Co-Compiler

heterogenous multi-core by dual mode cores: CPU mode vs. rDPU mode

155

hardware vs software perspective1

for software people

for softwarepeople

hardware vs. software perspective

**) without soft cores

*) with soft cores and/or on-chip microprocessor

158

data meeting the processing unit pu

by Software

by

Configware

Data meeting the Processing Unit (PU)

... partly explaining the RC paradox

We have 2 choices

routing the data by memory-cycle-hungry instruction streams thru shared memory

placement of the execution locality ...

pipe network generated by configware compilation

159

data meeting the processing unit

by

Configware

Data meeting the Processing Unit

placement of the execution locality ...

… pipe network generated by configware compilation

160

slide163
END

163

slide164
END

164

slide165

A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective

A shorter leap

Avoiding the paradigm shift?

„It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“

Tarek El-Ghazawi, panelist at SuperComputing 2006

„A leap too far for the existing HPC community“

panelist Allan J. Cantle

We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques.

165

SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors

slide166

A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective

A shorter leap

Avoiding the paradigm shift?

„A leap too far for the existing HPC community“

  • … the promise of almost unimagined computing power
  • have the hardware developers raced too far ahead of many programmers' ability to create software ?
  • parallel computing has been an esoteric skill limited to people involved with high-performance supercomputing. That is changing now that desktop computers and even laptops aregoing multicore.
  • "High-performance computing experts have learned to deal with this, but they are a fraction of the programmers," Saied says. “
  • In the future you won't be able to get a computer that's not multicore
  • multicore chips become ubiquitous, all programmers will have to learn new tricks."
  • Even in high-performance computing there are areas that aren't yet ready for the new multicore machines.
  • "In industry, much of their high-performance code is not parallel," Saied says. "These corporations have a lot of time and money invested in their software, and they are rightly worried about having to re-engineer that code base."

We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques.

166

slide167

Avoiding the paradigm shift?

  • "Moore's Gap."
  • Steve Kirsch, an engineering fellow for Raytheon Systems Co., says that multicore computing presents both the dream of infinite computing power and the nightmare of programming.
  • "The real lesson here is that the hardware and software industries have to pay attention to each other," Kirsch says. "Their futures are tied together in a way that they haven't been in recent memory, and that will change the way both businesses will operate."

February, Intel released research details about a chip with 80 cores, a fingernail sized chip that has the same processing power that in 1996 required a supercomputer with a 2,000-square-foot footprint and using 1,000 times the electrical power.

a problem for those who depend on previously written software that has been steadily improving and evolving over decades. "Our legacy software is a real concern to us.

parallel programming for multicore computers may require new computer languages.

"Today we program in sequential languages

Do we need to express our algorithms at a higher level of abstraction?

Research into these areas is critical to our success."

167

slide168

Avoiding the paradigm shift?

  • ""Our programming languages researchers are exploring new programming paradigms and models," Hambrusch says. "Our course on multicore architectures is also preparing students for future software development positions. Purdue is clearly playing a defining role in this critical technology."

"In five or six years, laptop computers will have the same capabilities, and face the same obstacles, as today's supercomputers," Saied says. "This challenge will face people who program for desktop computers, too. People who think they have nothing to do with supercomputers and parallel processing will find out that they need these skills, too."

Remote Direct Memory Access (RDMA) is a technology that allows computers in a network to exchange data in main memory without involving the processor, cache, or operating system of either computer. Like locally-based Direct Memory Access (DMA), RDMA improves throughput and performance because it frees up resources. RDMA also facilitates a faster data transfer rate. RDMA implements a transport protocol in the network interface card (NIC) hardw

168

slide169

Avoiding the paradigm shift?

  • Three Ways to Make Multicore Work
  • -- Number 1:
  • -- Mathematics: Do more computational work with less data motion
  • – E.g., Higher-order methods
  • • Trades memory motion for more operations per word, producing an accurate answer in less elapsed time than lower order methods
  • – Different problem decompositions (no stratified solvers)
  • • The mathematical equivalent of loop fusion
  • • E.g., nonlinear Schwarz methods
  • – Ensemble calculations
  • • Compute ensemble values directly
  • – It is time (really past time) to rethink algorithms for memory locality and latency tolerance
  • I didn’t say threads
  • • See, e.g., Edward A. Lee, "The Problem with Threads," Computer, vol. 39, no. 5, pp. 33-42, May, 2006.
  • • “Night of the Living Threads”,
  • http://weblogs.mozillazine.org/roc/archives/2005/12/night_of_the_living_threads.html , 2005
  • • Robert O'Callahan: “Why Threads Are A Bad Idea (for most purposes)” John Ousterhout (~2004)
  • •Allen Holub: “If I were king: A proposal for fixing the Java programming language's threading problems” http://www128.ibm.com/developerworks/library/j-king.html, 2000 Allen Holub has been working in the computer industry since 1979. He is widely published in magazines (Dr. Dobb's Journal, Programmers Journal, Byte, MSJ, among others), and he writes the "Java Toolbox" column for the online magazine JavaWorld .

Breaking the Assumptions

-- Don’t have any off-chip memory

– Consequence: Need algorithms, programming models, and software tools to work in more limited memory (a few GB)

-- Have off-chip memory, but manage it more effectively

– Consequence: Need to find a true, general-purpose hardware/software model

-- Overlap latency with split operations

– Consequence: Need to find massive amounts of concurrency; need to manage the programming challenges of split operations (these are hard for programmers to use correctly - may be an opportunity for formal methods)

Multicore doesn’t just stress bandwidth, it increases the need for perfectly parallel algorithms

-- All systems will look like attached processors - high latency, low (relative) bandwidth to main memory

128 cores? “When [a] request for data from Core 1 results in a L1 cache miss, the request is sent to the L2 cache. If this request hits a modified line in the L1 data cache of Core 2, certain internal conditions may cause incorrect data to be returned to the Core 1.”

Everything does not double: traveling from New York to Chicago: before 1830: 3 weeks - 1857: 1+1/2 days; now: 6 hours - only a factor of 6

MPI on Multi-Core: 340 ns MPI ping/pong latency improvement will require better SWE tools Benchmarks

• Ping-pong latency

– Ring-based ping-pong exchange between all nodes

• Nearest-neighbor ghost-area exchange

– Test code from Argonne used to evaluate onesided and point-to-point operations

• CPU availability

– Calculates percentage of CPU available at receiver by doing a fixed amount of work during message arrival

169

in memoriam

in Memoriam Stamatis Vassiliadis

in Memoriam Richard Newton

1951 - 2007

1951 - 2007

in Memoriam …

170

kressarray dpss

Xplorer

Application

Set

KressArray

Xplorer

(Platform Design Space Explorer)

ALE-X

Compiler

expr.

tree

interm.

form 2

ALEX

Code

Architecture

Estimator

User

HDL

Generator

Simulator

Suggestion

VHDL

Verilog

User

Interface

Selection

Design

Rules

Architecture

Editor

interm.

form 3

Improvement Proposal Generator

Mapping

Editor

Datapath

Generator

Generator

Mapper

data

stream

Schedule

Kress

rDPU

Layout

Scheduler

Delay

Estim.

Sug-

gest-

ion

statist.

Data

Power

Estimator

DPSS

KressArray

family

parameters

Power

Data

Inference

Engine (FOX)

Analyzer

KressArrayDPSS

published at ASP-DAC 1995

171

kressarray family generic fabrics a few examples

Select mode, number, width of NNports

Select Function Repertory

16

8

32

rout-through

only

rout-through

and function

+

24

2

rDPU

more NNports:

rich Rout Resources

select Nearest Neighbour (NN) Interconnect: an example

4

Examples of

2nd Level

Interconnect:

layouted over

rDPU cell - no separate routing areas !

KressArray Family generic Fabrics: a few examples

http://kressarray.de

172