Spatial Computation

Spatial Computation Mihai Budiu CMU CS Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS

Spatial Computation A model of general-purpose computationbased on Application-Specific Hardware. Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS

Thesis Statement Application-Specific Hardware (ASH): • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs withhigh ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. not!

Outline • Introduction • Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions

CPU Problems • Complexity • Power • Global Signals • Limited ILP

Design Complexity from Michael Flynn’s FCRC 2003 talk

Communication vs. Computation wire gate 5ps 20ps Power consumption on wires is also dominant

Our Approach: ASHApplication-Specific Hardware

Resource Binding Time 1. 1. Programs 2. 2. Programs CPU ASH

Hardware Interface software software ISA virtual ISA gates hardware hardware CPU ASH

Application-Specific Hardware C program Dataflow IR Compiler dataflow machine Reconfigurable/custom hw

systems theory Contributions Computerarchitecture Embeddedsystems Reconfigurablecomputing Compilation Asynchronouscircuits High-levelsynthesis Nanotechnology Dataflowmachines

Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions

Computation = Dataflow Programs Circuits a 7 x = a & 7; ... y = x >> 2; & 2 x >> • Operations ) functional units • Variables ) wires • No interpretation

Basic Operation + latch data ack valid

+ + + 2 3 4 + + + + latch 5 6 7 8 Asynchronous Computation + data ack valid 1

FSM Distributed Control Logic ack rdy + - short, local wires asynchronous control

Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y critical path Conditionals ) Speculation

p ! Split (branch) Control Flow ) Data Flow data Merge (label) data data predicate Gateway

0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum;

sequencing of side-effects no speculation Predication and Side-Effects addr token to memory Load pred data token

Thesis Statement Application-Specific Hardware: • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs withhigh ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. not!

Outline • Introduction • CASH: Compiling for ASH • An optimization on the SIDE • Media processing on ASH • ASH vs. superscalar processors • Conclusions skip to

Availability Dataflow Analysis y y = a*b; ... if (x) { ... ... = a*b; }

Dataflow Analysis Is Conservative if (x) { ... y = a*b; } ... ... = a*b; y?

Static Instantiation, Dynamic Evaluation flag = false; if (x) { ... y = a*b; flag = true; } ... ... = flag ? y : a*b;

SIDE Register Promotion Impact Loads % reduction Stores

Performance Evaluation Mem L2 1/4M ASH L1 8K LSQ limited BW CPU: 4-way OOO Assumption: all operations have the same latency.

Media Kernels, vs 4-way OOO

Media Kernels, IPC

Speed-up / IPC Correlation

Low-Level Evaluation C CASHcore Results shown so far. All results in thesis. Verilog back-end Synopsys,Cadence P/R 180nm std. cell library, 2V ~1999 technology Results in the next two slides. ASIC

Area Reference: P4 in 180nm has 217mm2

Power vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W

Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH • dataflow pipelining • ASH vs. superscalar processors • Conclusions skip to

i 1 Pipelining + * 100 <= int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; pipelined multiplier (8 stages) sum + cycle=1

i 1 Pipelining + * 100 <= sum + cycle=2

i 1 Pipelining + i=1 100 <= i=0 sum + cycle=5 pipeline balancing

wrong! This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good).

SpecInt95, ASH vs 4-way OOO

ASH crit path CPU crit path Predicted not taken Effectively a noop for CPU! result available before inputs Predicted taken. Branch Prediction i 1 + for (i=0; i < N; i++) { ... if (exception) break; } < exception ! &

SpecInt95, perfect prediction

ASH Problems • Both branch and join not free • Static dataflow(no re-issue of same instr) • Memory is “far” • Fully static • No branch prediction • No dynamic unrolling • No register renaming • Calls/returns not lenient • ...

Outline Introduction • CASH: Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions

Spatial Computation