spatial computation computing without general purpose processors l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Spatial Computation PowerPoint Presentation
Download Presentation
Spatial Computation

Loading in 2 Seconds...

play fullscreen
1 / 72

Spatial Computation - PowerPoint PPT Presentation


  • 260 Views
  • Uploaded on

computation. Low ILP computation + OS + VM. CPU. ASH. Memory. 15. Outline ... Asynchronous Computation. data. valid. ack. 1. 2. 3. 4. 8. 7. 6. 5. latch. 22. Distributed Control ...

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Spatial Computation' - Kelvin_Ajay


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
spatial computation computing without general purpose processors

Spatial ComputationComputing without General-Purpose Processors

Mihai Budiu

mihaib@cs.cmu.edu

Carnegie Mellon University

July 8, 2004

spatial computation
Spatial Computation

Spatial Computation

  • A computation model based on:
  • application-specific hardware
  • no interpretation
  • minimal resource sharing

Mihai Budiu

mihaib@cs.cmu.edu

Carnegie Mellon University

the engine behind this talk
The Engine Behind This Talk

main( )

{

signal(SIGINT, welcome);

while (slides( ) && time( )) {

talk( );

}

}

research scope
Research Scope

Object: future architectures

Tool:compilers

Evaluation:simulators

research methodology

incremental

evolution

new solutions

Research Methodology

Y (e.g., cost)

“reasonable limits”

state-of-the-art

X (e.g., power)

Constraint Space

outline

100

10

1

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

Outline

1000

  • Introduction: problems of current architectures
  • Compiling Application-Specific Hardware
  • Pipelining
  • ASH Evaluation
  • Conclusions

Performance

resources
Resources

[Intel]

  • We do not worry about not having hardware resources
  • We worry about being able to use hardware resources
design complexity
Design Complexity

1010

109

108

107

Chip size

Transistors

106

105

Designer productivity

104

1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

2007

2009

communication vs computation
Communication vs. Computation

wire

gate

5ps

20ps

Power consumption on wires is also dominant

power consumption
Power Consumption

Toasted CPU: about 2 sec after removing cooler. (Tom’s Hardware Guide)

clock speed
Clock Speed

3GHz

6GHz

10GHz

Cannot rely on global signals

(clock is a global signal)

instruction set architecture

VERY rigid to changes

(e.g. x86 vs Itanium)

Instruction-Set Architecture

Software

ISA

Hardware

our proposal

CPU

ASH

Low ILP computation

+ OS + VM

High-ILP

computation

$

Memory

Our Proposal
  • ASH addresses these problems
  • ASH is not a panacea
  • ASH “complementary” to CPU
outline15
Outline
  • Problems of current architectures
  • CASH: Compiling ASH
    • program representation
    • compiling C programs
  • Pipelining
  • ASH Evaluation
  • Conclusions
application specific hardware

SW

HW

ISA

HW backend

Dataflow machine

Application-Specific Hardware

C program

Compiler

Dataflow IR

Reconfigurable/custom hw

application specific hardware17
Application-Specific Hardware

Soft

C program

Compiler

Dataflow IR

SW backend

Machine code

CPU [predication]

key intermediate representation
Key: Intermediate Representation

Our IR

Traditionally

  • SSA + predication + speculation
  • Uniform for scalars and memory
  • Explicitly encodes may-depend
  • Executable
  • Precise semantics
  • Dataflow IR
  • Close to asynchronous target

may-dep.

CFG

...

def-use

computation dataflow
Computation = Dataflow

Programs

Circuits

a

7

x = a & 7;

...

y = x >> 2;

&

2

x

>>

  • Operations ) functional units
  • Variables ) wires
  • No interpretation
basic computation
Basic Computation

+

latch

data

ack

valid

asynchronous computation

+

+

+

2

3

4

+

+

+

+

latch

5

6

7

8

Asynchronous Computation

+

data

ack

valid

1

distributed control logic

globalFSM

Distributed Control Logic

ack

rdy

+

-

short, local wires

asynchronous control

outline23
Outline
  • Problems of current architectures
  • CASH: Compiling ASH
    • program representation
    • compiling C programs
  • Pipelining
  • ASH Evaluation
  • Conclusions
mux forward branches

SSA

= no arbitration

MUX: Forward Branches

b

x

0

if (x > 0)

y = -x;

else

y = b*x;

*

-

>

!

f

y

critical path

Conditionals ) Speculation

control flow data flow

p

!

Split (branch)

Control Flow ) Data Flow

data

f

Merge (label)

data

data

predicate

Gateway

loops

0

i

*

0

+1

< 100

sum

+

return sum;

!

ret

Loops

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;

predication and side effects

sequencing

of side-effects

no speculation

Predication and Side-Effects

addr

token

to

memory

Load

pred

data

token

memory access
Memory Access

LD

Monolithic

Memory

pipelined

arbitrated

network

ST

LD

local communication

global structures

Future work: fragment this!

complexity

related work

cash optimizations
CASH Optimizations
  • SSA-based optimizations
    • unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining
  • Memory optimizations
    • dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling
  • Boolean optimizations
    • Espresso CAD tool, bitwidth analysis
outline30
Outline
  • Problems of current architectures
  • Compiling ASH
  • Pipelining
  • Evaluation: CASH vs. clocked designs
  • Conclusions
pipelining

i

Pipelining

1

+

*

100

<=

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;

pipelined

multiplier

(8 stages)

sum

+

step 1

pipelining32

i

Pipelining

1

+

*

100

<=

sum

+

step 2

pipelining33

i

Pipelining

1

+

*

100

<=

sum

+

step 3

pipelining34

i

Pipelining

1

+

*

100

<=

sum

+

step 4

pipelining35

i

Pipelining

1

+

i=1

100

<=

i=0

sum

+

step 5

pipelining36

i

Pipelining

1

+

*

100

i=1

<=

i=0

sum

+

step 6

pipelining37

i’s loop

Longlatency pipe

predicate

sum’s loop

i

Pipelining

1

+

*

100

<=

sum

+

step 7

pipelining38

i’s loop

sum’s loop

i

Pipelining

1

+

*

100

critical path

<=

Predicate ackedge is on the

critical path.

sum

+

pipeline balancing

i’s loop

sum’s loop

i

Pipeline balancing

1

+

*

100

<=

decoupling

FIFO

sum

+

step 7

pipeline balancing40

i

Pipeline balancing

1

+

*

100

critical path

<=

i’s loop

decoupling

FIFO

sum

sum’s loop

+

outline41
Outline
  • Problems of current architectures
  • Compiling ASH
  • Pipelining
  • Evaluation: CASH vs. clocked designs
  • Conclusions
evaluating ash
Evaluating ASH

Mediabench kernels

(1 hot function/benchmark)

C

CASHcore

Verilog back-end

Synopsys,Cadence P/R

180nm std. cell library, 2V

~1999

technology

Mem

ModelSim

(Verilog simulation)

performancenumbers

ASIC

ash area
ASH Area

P4: 217

minimal RISC core

normalized area

bottleneck memory protocol

LSQ

  • Token release to dependents: requires round-trip to memory.
  • Limit study: round trip zero time ) up to 6x speed-up.
  • Exploring protocol for in-order data delivery & fast token release.
Bottleneck: Memory Protocol

LD

Memory

ST

power
Power

Xeon [+cache]

67000

mP

4000

DSP

110

energy efficiency47

1000x

Energy Efficiency

Dedicated hardware

ASH media kernels

Asynchronous P

FPGAs

General-purpose DSP

Microprocessors

0

.

1

1

0

1

1

0

0

0

0

0

1

1

0

0

.

Energy Efficiency [Operations/nJ]

outline48
Outline

Problems of current architectures

  • Compiling ASH
  • Pipelining
  • ASH Evaluation
  • Future/related work & conclusions
related work
Related Work

Asynchronouscircuits

Nanotechnology

Dataflowmachines

Embeddedsystems

High-levelsynthesis

Reconfigurablecomputing

Computerarchitecture

Compilation

future work
Future Work
  • Optimizations for area/speed/power
  • Memory partitioning
  • Concurrency
  • Compiler-guided layout
  • Explore extensible ISAs
  • Hybridization with superscalar mechanisms
  • Reconfigurable hardware support for ASH
  • Formal verification
grand vision certified circuit generation
Grand Vision:Certified Circuit Generation
  • Translation validation: input ´ output
  • Preserve input properties
    • e.g., C programs cannot deadlock
    • e.g., type-safe programs cannot crash
  • Debug, test, verify only at source-level

How far can you go?

HLL

IR

IRopt

Verilog

gates

layout

formally validated

conclusions
Conclusions

Spatial computation strengths

backup slides
Backup Slides
  • Reconfigurable hardware
  • Critical paths
  • Control logic
  • ASH vs ...
  • ASH weaknesses
  • Exceptions
  • Normalized area
  • Why C?
  • Splitting memory
  • More performance
  • Recursive calls
reconfigurable hardware

Interconnection

network

Universal gates

and/or

storage elements

Programmable switches

Reconfigurable Hardware
main rh ingredient ram cell
Main RH Ingredient: RAM Cell

0

0

0

1

a0

data

a0

a1 & a2

a1

a1

Universal gate = RAM

data in

0

control

Switch controlled by a 1-bit RAM cell

back

critical paths
Critical Paths

b

x

0

if (x > 0)

y = -x;

else

y = b*x;

*

-

>

!

y

lenient operations

-

>

Lenient Operations

b

x

0

if (x > 0)

y = -x;

else

y = b*x;

*

!

y

Solves the problem of unbalanced paths

back

back to talk

asynchronous control

Asynchronous Control

ackout

C

rdyin

D

ackin

rdyout

=

Reg

dataout

datain

back

back to talk

hll to hw
HLL to HW

High-level Synthesis

Behavioral

HDL

Synchronous

Hardware

ReconfigurableComputing

C

[subsets]

Hardware

configuration

(spatial computation)

Asynchronous

circuits

Concurrent

Language

Asynchronous

Hardware

Prior work

This research

cash vs high level synthesis
CASH vs High-Level Synthesis
  • CASH: the only existing tool to translate complete ANSI C to hardware
  • CASH generates asynchronous circuits
  • CASH does not treat C as an HDL
    • no annotations required
    • no reactivity model
    • does not handle non-C, e.g., concurrency

back

ash weaknesses
ASH Weaknesses
  • Low efficiency for low-ILP code
  • Does not adapt at runtime
  • Monolithic memory
  • Resource waste
  • Not flexible
  • No support for exceptions
ash weaknesses 2
ASH Weaknesses (2)
  • Both branch and join not free
  • Static dataflow (no re-issue of same instr)
  • Memory is “far”
  • Fully static
    • No branch prediction
    • No dynamic unrolling
    • No register renaming
  • Calls/returns not lenient

back

branch prediction

ASH crit path

CPU crit path

Predicted not taken

Effectively a noop for CPU!

result available before inputs

Predicted taken.

Branch Prediction

i

1

+

for (i=0; i < N; i++) {

...

if (exception) break;

}

<

exception

!

&

back

exceptions
Exceptions
  • Strictly speaking, C has no exceptions
  • In practice hard to accommodate exceptions in hardware implementations
  • An advantage of software flexibility: PC is single point of execution control

CPU

ASH

Low ILP computation

+ OS + VM + exceptions

High-ILP

computation

$$$

Memory

back

why c
Why C
  • Huge installed base
  • Embedded specifications written in C
  • Small and simple language
    • Can leverage existing tools
    • Simpler compiler
  • Techniques generally applicable
  • Not a toy language

back

normalized area
Normalized Area

back

back to talk

memory partitioning
Memory Partitioning
  • MIT RAW project: Babb FCCM ‘99, Barua HiPC ‘00,Lee ASPLOS ‘00
  • Stanford SpC: Semeria DAC ‘01, TVLSI ‘02
  • Berkeley CCured: Necula POPL ‘02
  • Illinois FlexRAM: Fraguella PPoPP ‘03
  • Hand-annotations #pragma

back

back to talk

memory complexity
Memory Complexity

RAM

LSQ

addr

data

back

back to talk

recursion
Recursion

save live values

recursive call

restore live values

stack

back