Scalar Operand Networks for Tiled Microprocessors

Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

Until 3 years ago – computer architects have been using the N-way superscalar to encapsulate the ideal for a parallel processor… - nearly “perfect” but not attainable (or VLIW) (hw scheduler or compiler)

mul $2,$3,$4 add $6,$5,$2 • What’s great about superscalar microprocessors? •  It’s the networks! • Fast low-latency tightly-coupled networks • (0-1 cycles of latency, • no occupancy) • For the lack of a better name • let’s call them Scalar Operand Networks (SONs) • - Can we incorporate the benefits of superscalar • communication + multicore scalability • Can we build Scalable Scalar Operand Networks? • (I agree with Jose: “We need low-latency tightly-coupled … network • interfaces” – Jose Duato, OCIN, Dec 6, 2006)

The industry shift toward Multicore - attainable but hardly ideal

What we’d like – neither superscalar nor multicore Superscalars have fast networks and great usability Multicore has great scalability and efficiency

send occupancy receive occupancy Transport Cost send overhead receive overhead send latency receive latency Why communication is expensive on multicore Multiprocessor Node 1 Multiprocessor Node 2

send occupancy send latency Multiprocessor SON Operand Routing Multiprocessor Node 1 Destination node name Sequence number Value Launch sequence Commit Latency Network injection

receive occupancy receive latency Multiprocessor SON Operand Routing Multiprocessor Node 2 receive sequence demultiplexing branch mispredictions injection cost .. similar overheads for shared memory multiprocessors - store instr, commit latency, spin locks (+ attndt br. mispredicts)

Defining a figure of merit forscalar operand networks 5-tuple <SO, SL, NHL, RL, RO>: Send Occupancy Send Latency We can use this metric to quantitatively differentiate SONs from existing multiprocessor networks… Network Hop Latency Receive Latency Receive Occupancy Tip: Ordering follows timing of message from sender to receiver

Proc 0 Proc 1 nothing to do Impact of Occupancy (“o” = so+ro) if (o * “surface area” > “volume”)  not worth it to offload: overhead too high (parallelism too fine-grained) Impact of Latency The lower the latency, the less work needed to keep myself busy waiting for answer  not worth it to offload: could have done it myself faster (not enough parallelism to hide latency)

The interesting region Power4 <2, 14, 0, 14,4> (on-chip) Superscalar < 0, 0, 0, 0, 0> (not scalable)

Tiled Microprocessors (or “Tiled Multicore”) (w/ scalable SON)

Tiled Microprocessors (or “Tiled Multicore”)

Transforming from multicore or superscalar to tiled add scalability Superscalar Tiled add scalable SON CMP/multicore

The interesting region Power4 <2, 14, 0, 14,4> (on-chip) Raw < 0, 0, 1, 2, 0> Tiled “Famous Brand 2” < 0, 0, 1, 0, 0> Superscalar < 0, 0, 0, 0, 0> (not scalable)

PC Control RF Wide Fetch (16 inst) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net Unified Load/Store Queue Scalability Problems in Wide Issue Microprocessors

Area and Frequency Scalability Problems ~N2 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ~N3 N ALUs RF Bypass Net Ex: Itanium 2 Without modification, freq decreases linearly or worse.

Operand Routing is Global ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU + RF >> Bypass Net

Idea: Make Operand Routing Local ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF Bypass Net

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net Idea: Exploit Locality RF

Replace the crossbar with a point-to-point, pipelined, routed scalar operand network. ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF

Replace the crossbar with a point-to-point, pipelined, routed scalar operand network. ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU + RF >>

We can route more operands per unit time if we are able to map communicating instructions nearby. Operand Transport Scaling – Bandwidth and Area For N ALUs and N½ bisection bandwidth: as in conventional superscalar Scales as 2-D VLSI

Latency bonus if we map communicating instructions nearby so communication is local. Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs.

Distribute the Register File RF RF RF RF RF RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF RF

RF PC Control Wide Fetch (16 inst) RF RF RF RF RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF Unified Load/Store Queue SCALABLE

More Scalability Problems PC Wide Fetch (16 inst) Control Unified Load/Store Queue

Distribute the rest: Raw – a Fully-Tiled Microprocessor RF PC Wide Fetch (16 inst) PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF Control RF Unified Load/Store Queue

Tiles! RF PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF RF

Tiles!

Tiled Microprocessors • fast inter-tile communication • through SON • easy to scale (same reasons • as multicore)

Outline 1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.

Raw Microprocessor Tiled scalable microprocessor Point-to-point pipelined networks 16 tiles, 16 issue Each 4 mm x 4mm tile: MIPS-style compute processor - single-issue 8-stage pipe - 32b FPU - 32K D Cache, I Cache 4 on-chip networks - two for operands - one for cache misses - one for message passing

Raw Microprocessor Components Cross- bar Functional Units Cross- bar Switch Processor Instruction Cache Fetch Unit Instruction Cache Intra-tile SON Inter-tile SON Inter-tile Network Links Static Router Data Cache Dynamic Router “MDN” Trusted Core Execution Core Dynamic Router “GDN” Untrusted Core Compute Processor Generalized Transport Networks

r24 r24 r25 r25 r26 r26 E r27 r27 M1 M2 A TL RF IF D F P U Raw Compute Processor Internals Ex: fadd r24, r25, r26

Tile-Tile Communication add $25,$1,$2

Tile-Tile Communication Route $P->$E add $25,$1,$2

Tile-Tile Communication Route $W->$P Route $P->$E add $25,$1,$2

Tile-Tile Communication Route $W->$P Route $P->$E add $25,$1,$2 sub $20,$1,$25

Compilation RawCC assigns instructions to the tiles, maximizing locality. It also generates the static router instructions that transfer operands between tiles. tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. seed.0=seed pval5=seed.0*6.0 pval1=seed.0*3.0 pval4=pval5+2.0 pval0=pval1+2.0 tmp3.6=pval4/3.0 seed.0=seed tmp3=tmp3.6 tmp0.1=pval0/2.0 v3.10=tmp3.6-v2.7 tmp0=tmp0.1 pval1=seed.0*3.0 v1.2=v1 v3=v3.10 v2.4=v2 pval5=seed.0*6.0 pval2=seed.0*v1.2 pval0=pval1+2.0 pval3=seed.o*v2.4 pval4=pval5+2.0 v1.2=v1 v2.4=v2 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp2.5=pval3+2.0 tmp1.3=pval2+2.0 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 tmp2=tmp2.5 tmp1=tmp1.3 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v2=v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v1=v1.8 v0=v0.9 v0=v0.9 v2=v2.7 v3=v3.10

One cycle in the life of a tiled micro mem mem mem Direct I/O stream into Scalar Operand Network 2-thread MPI app 4-way automatically parallelized C program httpd Zzz... An application uses only as many tiles as needed to exploit the parallelism intrinsic to that application…

Tile 4 Tile 1 Tile 2 Tile 7 Tile 0 Tile 3 Tile 5 Tile 6 Tile 10 Tile 9 Tile 15 Tile 12 Tile 8 Tile 13 Tile 14 Tile 11 One Streaming Application on Raw very different traffic patterns than RawCC-style parallelization

Auto-Parallelization Approach #2: Streamit Language + Compiler Splitter Splitter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter Joiner Joiner Splitter Splitter Vec Mult Vec Mult Vec Mult Vec Mult Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector FIRFilter FIRFilter FIRFilter FIRFilter Magnitude Magnitude Magnitude Magnitude Detector Detector Detector Detector Joiner Joiner Original After fusion

Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector FIRFilter FIRFilter FIRFilter FIRFilter Joiner Joiner FIRFilter FIRFilter FIRFilter FIRFilter End Results – auto-parallelized by MIT Streamit to 8 tiles.

AsTrO Taxonomy: Classifying SON diversity ALU ALU ALU ALU ALU ALU ALU ALU Assignment (Static/Dynamic) + + Is instruction assignment to ALUs predetermined? / & % Transport (Static/Dynamic) >> >> Are operand routes predetermined? Ordering (Static/Dynamic) Is the execution order of instructions assigned to a node predetermined?

Assignment Transport Ordering Microprocessor SON diversity using AsTrO taxonomy Dynamic Static Static Dynamic Dynamic Static Static Dynamic Static Dynamic TRIPS WaveScalar RawDyn Raw Scale ILDP

Outline 1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.

Raw Chips October 02

16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V Frequency competitive with IBM-implemented PowerPCs in same process. Raw 18W average power

Raw motherboard Support Chipset implemented in FPGA

Scalar Operand Networks for Tiled Microprocessors

Scalar Operand Networks for Tiled Microprocessors

Presentation Transcript

Tiled Convolutional Neural Networks

Microprocessors

Microprocessors

Microprocessors

All Tiled Up

Microprocessors

Hybrid Electric/Photonic Networks for Scientific Applications on Tiled CMPs

Microprocessors

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Microprocessors

Scalar

Microprocessors

Microprocessors

Microprocessors

Microprocessors

Microprocessors