Spatial Computation

Spatial ComputationComputing without General-Purpose Processors Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University July 8, 2004

Spatial Computation Spatial Computation • A computation model based on: • application-specific hardware • no interpretation • minimal resource sharing Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University

The Engine Behind This Talk main( ) { signal(SIGINT, welcome); while (slides( ) && time( )) { talk( ); } }

Research Scope Object: future architectures Tool:compilers Evaluation:simulators

incremental evolution new solutions Research Methodology Y (e.g., cost) “reasonable limits” state-of-the-art X (e.g., power) Constraint Space

100 10 1 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 Outline 1000 • Introduction: problems of current architectures • Compiling Application-Specific Hardware • Pipelining • ASH Evaluation • Conclusions Performance

Resources [Intel] • We do not worry about not having hardware resources • We worry about being able to use hardware resources

Design Complexity 1010 109 108 107 Chip size Transistors 106 105 Designer productivity 104 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009

Communication vs. Computation wire gate 5ps 20ps Power consumption on wires is also dominant

Power Consumption Toasted CPU: about 2 sec after removing cooler. (Tom’s Hardware Guide)

ALUs Energy Efficiency Pentium 4

Clock Speed 3GHz 6GHz 10GHz Cannot rely on global signals (clock is a global signal)

VERY rigid to changes (e.g. x86 vs Itanium) Instruction-Set Architecture Software ISA Hardware

CPU ASH Low ILP computation + OS + VM High-ILP computation $ Memory Our Proposal • ASH addresses these problems • ASH is not a panacea • ASH “complementary” to CPU

Outline • Problems of current architectures • CASH: Compiling ASH • program representation • compiling C programs • Pipelining • ASH Evaluation • Conclusions

SW HW ISA HW backend Dataflow machine Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw

Application-Specific Hardware Soft C program Compiler Dataflow IR SW backend Machine code CPU [predication]

Key: Intermediate Representation Our IR Traditionally • SSA + predication + speculation • Uniform for scalars and memory • Explicitly encodes may-depend • Executable • Precise semantics • Dataflow IR • Close to asynchronous target may-dep. CFG ... def-use

Computation = Dataflow Programs Circuits a 7 x = a & 7; ... y = x >> 2; & 2 x >> • Operations ) functional units • Variables ) wires • No interpretation

Basic Computation + latch data ack valid

+ + + 2 3 4 + + + + latch 5 6 7 8 Asynchronous Computation + data ack valid 1

globalFSM Distributed Control Logic ack rdy + - short, local wires asynchronous control

Outline • Problems of current architectures • CASH: Compiling ASH • program representation • compiling C programs • Pipelining • ASH Evaluation • Conclusions

SSA = no arbitration MUX: Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! f y critical path Conditionals ) Speculation

p ! Split (branch) Control Flow ) Data Flow data f Merge (label) data data predicate Gateway

0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum;

sequencing of side-effects no speculation Predication and Side-Effects addr token to memory Load pred data token

Memory Access LD Monolithic Memory pipelined arbitrated network ST LD local communication global structures Future work: fragment this! complexity related work

CASH Optimizations • SSA-based optimizations • unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining • Memory optimizations • dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling • Boolean optimizations • Espresso CAD tool, bitwidth analysis

Outline • Problems of current architectures • Compiling ASH • Pipelining • Evaluation: CASH vs. clocked designs • Conclusions

i Pipelining 1 + * 100 <= int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; pipelined multiplier (8 stages) sum + step 1

i Pipelining 1 + * 100 <= sum + step 2

i Pipelining 1 + i=1 100 <= i=0 sum + step 5

i Pipelining 1 + * 100 i=1 <= i=0 sum + step 6

i’s loop Longlatency pipe predicate sum’s loop i Pipelining 1 + * 100 <= sum + step 7

i’s loop sum’s loop i Pipelining 1 + * 100 critical path <= Predicate ackedge is on the critical path. sum +

i’s loop sum’s loop i Pipeline balancing 1 + * 100 <= decoupling FIFO sum + step 7

i Pipeline balancing 1 + * 100 critical path <= i’s loop decoupling FIFO sum sum’s loop +

Outline • Problems of current architectures • Compiling ASH • Pipelining • Evaluation: CASH vs. clocked designs • Conclusions

Evaluating ASH Mediabench kernels (1 hot function/benchmark) C CASHcore Verilog back-end Synopsys,Cadence P/R 180nm std. cell library, 2V ~1999 technology Mem ModelSim (Verilog simulation) performancenumbers ASIC

ASH Area P4: 217 minimal RISC core normalized area

ASH vs 600MHz CPU [.18 mm]

LSQ • Token release to dependents: requires round-trip to memory. • Limit study: round trip zero time ) up to 6x speed-up. • Exploring protocol for in-order data delivery & fast token release. Bottleneck: Memory Protocol LD Memory ST

Power Xeon [+cache] 67000 mP 4000 DSP 110

1000x Energy Efficiency Dedicated hardware ASH media kernels Asynchronous P FPGAs General-purpose DSP Microprocessors 0 . 1 1 0 1 1 0 0 0 0 0 1 1 0 0 . Energy Efficiency [Operations/nJ]

Outline Problems of current architectures • Compiling ASH • Pipelining • ASH Evaluation • Future/related work & conclusions

Related Work Asynchronouscircuits Nanotechnology Dataflowmachines Embeddedsystems High-levelsynthesis Reconfigurablecomputing Computerarchitecture Compilation

Future Work • Optimizations for area/speed/power • Memory partitioning • Concurrency • Compiler-guided layout • Explore extensible ISAs • Hybridization with superscalar mechanisms • Reconfigurable hardware support for ASH • Formal verification

Spatial Computation