Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Scalar Operand Networks for Tiled Microprocessors PowerPoint Presentation
Download Presentation
Scalar Operand Networks for Tiled Microprocessors

Scalar Operand Networks for Tiled Microprocessors

97 Views Download Presentation
Download Presentation

Scalar Operand Networks for Tiled Microprocessors

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD)

  2. Until 3 years ago – computer architects have been using the N-way superscalar to encapsulate the ideal for a parallel processor… - nearly “perfect” but not attainable (or VLIW) (hw scheduler or compiler)

  3. mul $2,$3,$4 add $6,$5,$2 • What’s great about superscalar microprocessors? •  It’s the networks! • Fast low-latency tightly-coupled networks • (0-1 cycles of latency, • no occupancy) • For the lack of a better name • let’s call them Scalar Operand Networks (SONs) • - Can we incorporate the benefits of superscalar • communication + multicore scalability • Can we build Scalable Scalar Operand Networks? • (I agree with Jose: “We need low-latency tightly-coupled … network • interfaces” – Jose Duato, OCIN, Dec 6, 2006)

  4. The industry shift toward Multicore - attainable but hardly ideal

  5. What we’d like – neither superscalar nor multicore Superscalars have fast networks and great usability Multicore has great scalability and efficiency

  6. send occupancy receive occupancy Transport Cost send overhead receive overhead send latency receive latency Why communication is expensive on multicore Multiprocessor Node 1 Multiprocessor Node 2

  7. send occupancy send latency Multiprocessor SON Operand Routing Multiprocessor Node 1 Destination node name Sequence number Value Launch sequence Commit Latency Network injection

  8. receive occupancy receive latency Multiprocessor SON Operand Routing Multiprocessor Node 2 receive sequence demultiplexing branch mispredictions injection cost .. similar overheads for shared memory multiprocessors - store instr, commit latency, spin locks (+ attndt br. mispredicts)

  9. Defining a figure of merit forscalar operand networks 5-tuple <SO, SL, NHL, RL, RO>: Send Occupancy Send Latency We can use this metric to quantitatively differentiate SONs from existing multiprocessor networks… Network Hop Latency Receive Latency Receive Occupancy Tip: Ordering follows timing of message from sender to receiver

  10. Proc 0 Proc 1 nothing to do Impact of Occupancy (“o” = so+ro) if (o * “surface area” > “volume”)  not worth it to offload: overhead too high (parallelism too fine-grained) Impact of Latency The lower the latency, the less work needed to keep myself busy waiting for answer  not worth it to offload: could have done it myself faster (not enough parallelism to hide latency)

  11. The interesting region Power4 <2, 14, 0, 14,4> (on-chip) Superscalar < 0, 0, 0, 0, 0> (not scalable)

  12. Tiled Microprocessors (or “Tiled Multicore”) (w/ scalable SON)

  13. Tiled Microprocessors (or “Tiled Multicore”)

  14. Transforming from multicore or superscalar to tiled add scalability Superscalar Tiled add scalable SON CMP/multicore

  15. The interesting region Power4 <2, 14, 0, 14,4> (on-chip) Raw < 0, 0, 1, 2, 0> Tiled “Famous Brand 2” < 0, 0, 1, 0, 0> Superscalar < 0, 0, 0, 0, 0> (not scalable)

  16. PC Control RF Wide Fetch (16 inst) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net Unified Load/Store Queue Scalability Problems in Wide Issue Microprocessors

  17. Area and Frequency Scalability Problems ~N2 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ~N3 N ALUs RF Bypass Net Ex: Itanium 2 Without modification, freq decreases linearly or worse.

  18. Operand Routing is Global ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU + RF >> Bypass Net

  19. Idea: Make Operand Routing Local ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF Bypass Net

  20. ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net Idea: Exploit Locality RF

  21. Replace the crossbar with a point-to-point, pipelined, routed scalar operand network. ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF

  22. Replace the crossbar with a point-to-point, pipelined, routed scalar operand network. ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU + RF >>

  23. We can route more operands per unit time if we are able to map communicating instructions nearby. Operand Transport Scaling – Bandwidth and Area For N ALUs and N½ bisection bandwidth: as in conventional superscalar Scales as 2-D VLSI

  24. Latency bonus if we map communicating instructions nearby so communication is local. Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs.

  25. Distribute the Register File RF RF RF RF RF RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF RF

  26. RF PC Control Wide Fetch (16 inst) RF RF RF RF RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF Unified Load/Store Queue SCALABLE

  27. More Scalability Problems PC Wide Fetch (16 inst) Control Unified Load/Store Queue

  28. Distribute the rest: Raw – a Fully-Tiled Microprocessor RF PC Wide Fetch (16 inst) PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF Control RF Unified Load/Store Queue

  29. Tiles! RF PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF RF

  30. Tiles!

  31. Tiled Microprocessors • fast inter-tile communication • through SON • easy to scale (same reasons • as multicore)

  32. Outline 1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.

  33. Raw Microprocessor Tiled scalable microprocessor Point-to-point pipelined networks 16 tiles, 16 issue Each 4 mm x 4mm tile: MIPS-style compute processor - single-issue 8-stage pipe - 32b FPU - 32K D Cache, I Cache 4 on-chip networks - two for operands - one for cache misses - one for message passing

  34. Raw Microprocessor Components Cross- bar Functional Units Cross- bar Switch Processor Instruction Cache Fetch Unit Instruction Cache Intra-tile SON Inter-tile SON Inter-tile Network Links Static Router Data Cache Dynamic Router “MDN” Trusted Core Execution Core Dynamic Router “GDN” Untrusted Core Compute Processor Generalized Transport Networks

  35. r24 r24 r25 r25 r26 r26 E r27 r27 M1 M2 A TL RF IF D F P U Raw Compute Processor Internals Ex: fadd r24, r25, r26

  36. Tile-Tile Communication add $25,$1,$2

  37. Tile-Tile Communication Route $P->$E add $25,$1,$2

  38. Tile-Tile Communication Route $W->$P Route $P->$E add $25,$1,$2

  39. Tile-Tile Communication Route $W->$P Route $P->$E add $25,$1,$2 sub $20,$1,$25

  40. Compilation RawCC assigns instructions to the tiles, maximizing locality. It also generates the static router instructions that transfer operands between tiles. tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. seed.0=seed pval5=seed.0*6.0 pval1=seed.0*3.0 pval4=pval5+2.0 pval0=pval1+2.0 tmp3.6=pval4/3.0 seed.0=seed tmp3=tmp3.6 tmp0.1=pval0/2.0 v3.10=tmp3.6-v2.7 tmp0=tmp0.1 pval1=seed.0*3.0 v1.2=v1 v3=v3.10 v2.4=v2 pval5=seed.0*6.0 pval2=seed.0*v1.2 pval0=pval1+2.0 pval3=seed.o*v2.4 pval4=pval5+2.0 v1.2=v1 v2.4=v2 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp2.5=pval3+2.0 tmp1.3=pval2+2.0 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 tmp2=tmp2.5 tmp1=tmp1.3 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 pval7=tmp1.3+tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v2=v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v3.10=tmp3.6-v2.7 v1=v1.8 v0=v0.9 v0=v0.9 v2=v2.7 v3=v3.10

  41. One cycle in the life of a tiled micro mem mem mem Direct I/O stream into Scalar Operand Network 2-thread MPI app 4-way automatically parallelized C program httpd Zzz... An application uses only as many tiles as needed to exploit the parallelism intrinsic to that application…

  42. Tile 4 Tile 1 Tile 2 Tile 7 Tile 0 Tile 3 Tile 5 Tile 6 Tile 10 Tile 9 Tile 15 Tile 12 Tile 8 Tile 13 Tile 14 Tile 11 One Streaming Application on Raw very different traffic patterns than RawCC-style parallelization

  43. Auto-Parallelization Approach #2: Streamit Language + Compiler Splitter Splitter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter Joiner Joiner Splitter Splitter Vec Mult Vec Mult Vec Mult Vec Mult Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector FIRFilter FIRFilter FIRFilter FIRFilter Magnitude Magnitude Magnitude Magnitude Detector Detector Detector Detector Joiner Joiner Original After fusion

  44. Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector FIRFilter FIRFilter FIRFilter FIRFilter Joiner Joiner FIRFilter FIRFilter FIRFilter FIRFilter End Results – auto-parallelized by MIT Streamit to 8 tiles.

  45. AsTrO Taxonomy: Classifying SON diversity ALU ALU ALU ALU ALU ALU ALU ALU Assignment (Static/Dynamic) + + Is instruction assignment to ALUs predetermined? / & % Transport (Static/Dynamic) >> >> Are operand routes predetermined? Ordering (Static/Dynamic) Is the execution order of instructions assigned to a node predetermined?

  46. Assignment Transport Ordering Microprocessor SON diversity using AsTrO taxonomy Dynamic Static Static Dynamic Dynamic Static Static Dynamic Static Dynamic TRIPS WaveScalar RawDyn Raw Scale ILDP

  47. Outline 1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network.

  48. Raw Chips October 02

  49. 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V Frequency competitive with IBM-implemented PowerPCs in same process. Raw 18W average power

  50. Raw motherboard Support Chipset implemented in FPGA