1 / 53

Advanced Microarchitecture

Advanced Microarchitecture. Lecture 2: Pipelining and Superscalar Review. Pipelined Design. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth or Throughput = Performance BW = num. tasks/unit time

taran
Download Presentation

Advanced Microarchitecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Microarchitecture Lecture 2: Pipelining and Superscalar Review

  2. Pipelined Design • Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) • Bandwidth or Throughput = Performance • BW = num. tasks/unit time • For a system that operates on one task at a time: BW = 1 / latency • Pipelining can increase BW if many repetitions of same operation/task • Latency per task remains same or increases Lecture 2: Pipelining and Superscalar Review

  3. Pipelining Illustrated Combinatorial Logic N Gate Delays BW = ~(1/n) Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays BW = ~(2/n) Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates BW = ~(3/n) Lecture 2: Pipelining and Superscalar Review

  4. T Performance Model • Starting from an unpipelined version with propagation delay T and BW=1/T Perfpipe = BWpipe = 1 / (T/k + S) where S = latch delay where k = num stages k-stage pipelined unpipelined T/k S T/k S Lecture 2: Pipelining and Superscalar Review

  5. G Hardware Cost Model • Starting from an unpipelined version with hardware cost G Costpipe = G + kL where L = latch cost incl. control where k = num stages k-stage pipelined unpipelined G/k L G/k L Lecture 2: Pipelining and Superscalar Review

  6. Cost/Performance Tradeoff C/P Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S) = LT + GS + LSk + GT/k Optimal Cost/Performance: find min. C/P w.r.t. choice of k k æ ö Lk + G ç ÷ d ç ÷ GT G T 1 = 0 + 0 + LS - k = - - - - - - - - ç ç o p t dk T L S k2 ç ç + S ç ÷ k è ø Lecture 2: Pipelining and Superscalar Review

  7. “Optimal” Pipeline Depth: kopt x104 Cost/Performance Ratio (C/P) G=175, L=41, T=400, S=22 G=175, L=21, T=400, S=11 Pipeline Depth k Lecture 2: Pipelining and Superscalar Review

  8. Cost? • “Hardware Cost” • Transistor/Gate Count • Should include additional logic to control the pipeline • Area (related to gate count) • Power! • More gates  more switching • More gates  more leakage • Many metrics to optimize • Very difficult to determine what really is “optimal” Lecture 2: Pipelining and Superscalar Review

  9. Good Examples: Automobile assembly line Floating-Point multiplier Instruction pipeline (?) Pipelining Idealism • Uniform Suboperations • The operation to be pipelined can be evenly partitioned into uniform-latency suboperations • Repetition of Identical Operations • The same operations are to be performed repeatedly on a large number of different inputs • Repetition of Independent Operations • All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts Lecture 2: Pipelining and Superscalar Review

  10. Instruction Pipeline Design • Uniform Suboperations … NOT! • Balance pipeline stages • Stage quantization to yield balanced stages • Minimize internal fragmentation (some waiting stages) • Identical operations … NOT! • Unifying instruction types • Coalescing instruction types into one “multi-function” pipe • Minimize external fragmentation (some idling stages) • Independent operations … NOT! • Resolve data and resource hazards • Inter-instruction dependency detection and resolution • Minimize performance loss Lecture 2: Pipelining and Superscalar Review

  11. The Generic Instruction Cycle • The “computation” to be pipelined: • Instruction Fetch (IF) • Instruction Decode (ID) • Operand(s) Fetch (OF) • Instruction Execution (EX) • Operand Store (OS) • a.k.a. writeback (WB) • Update Program Counter (PC) Lecture 2: Pipelining and Superscalar Review

  12. The Generic Instruction Pipeline Based on Obvious Subcomputations: IF Instruction Fetch ID Instruction Decode OF/RF Operand Fetch EX Instruction Execute OS/WB Operand Store Lecture 2: Pipelining and Superscalar Review

  13. Balancing Pipeline Stages IF TIF= 6 units • Without pipelining • Tcyc TIF+TID+TOF+TEX+TOS • = 31 • Pipelined • Tcyc max{TIF, TID, TOF, TEX, TOS} • = 9 • Speedup= 31 / 9 • Can we do better in terms of either performance or efficiency? ID TID= 2 units OF/RF TID= 9 units EX TEX= 5 units OS/WB TOS= 9 units Lecture 2: Pipelining and Superscalar Review

  14. Balancing Pipeline Stages • Two methods for stage quantization • Merging multiple subcomputations into one • Subdividing a subcomputation into multiple smaller ones • Recent/Current trends • Deeper pipelines (more and more stages) • To a certain point: then cost function takes over • Multiple different pipelines/subpipelines • Pipelining of memory accesses (tricky) Lecture 2: Pipelining and Superscalar Review

  15. Granularity of Pipeline Stages Finer-Grained Machine Cycle: 11 machine cyc /instruction Coarser-Grained Machine Cycle: 4 machine cyc / instruction IF IF TIF&ID= 8 units IF ID ID OF TOF= 9 units OF OF Tcyc= 3 units OF EX TEX= 5 units EX EX OS OS TOS= 9 units OS OS TIF,TID,TOF,TEX,TOS = (6/2/9/5/9) Lecture 2: Pipelining and Superscalar Review

  16. Hardware Requirements • Logic needed for each pipeline stage • Register file ports needed to support all (relevant) stages • Memory accessing ports needed to support all (relevant) stages IF IF IF ID ID OF OF OF OF EX EX EX OS OS OS OS Lecture 2: Pipelining and Superscalar Review

  17. Pipeline Examples AMDAHL 470V/7 IF PC GEN MIPS R2000/R3000 Cache Read IF Cache Read IF ID ID Decode OF RD Read REG OF Add GEN ALU EX Cache Read Cache Read MEM OS EX EX 1 EX 2 WB OS Check Result Write Result Lecture 2: Pipelining and Superscalar Review

  18. Instruction Dependencies • Data Dependence • True Dependence (RAW) • Instruction must wait for all required input operands • Anti-Dependence (WAR) • Later write must not clobber a still-pending earlier read • Output Dependence (WAW) • Earlier write must not clobber an already-finished later write • Control Dependence (a.k.a. Procedural Dependence) • Conditional branches cause uncertainty to instruction sequencing • Instructions following a conditional branch depends on the execution of the branch instruction • Instructions following a computed branch depends on the execution of the branch instruction Lecture 2: Pipelining and Superscalar Review

  19. Example: Quick Sort on MIPS • bge $10, $9, $36 • mul $15, $10, 4 • addu $24, $6, $15 • lw $25, 0($24) • mul $13, $8, 4 • addu $14, $6, $13 • lw $15, 0($14) • bge $25, $15, $36 • $35: • addu $10, $10, 1 • . . . • $36: • addu $11, $11, -1 • . . . • # for (;(j<high)&&(array[j]<array[low]);++j); • # $10 = j; $9 = high; $6 = array; $8 = low Lecture 2: Pipelining and Superscalar Review

  20. Hardware Dependency Analysis • Processor must handle • Register Data Dependencies • RAW, WAW, WAR • Memory Data Dependencies • RAW, WAW, WAR • Control Dependencies Lecture 2: Pipelining and Superscalar Review

  21. Terminology • Pipeline Hazards: • Potential violations of program dependencies • Must ensure program dependencies are not violated • Hazard Resolution: • Static method: performed at compile time in software • Dynamic method: performed at runtime using hardware Stall, Flush or Forward • Pipeline Interlock: • Hardware mechanism for dynamic hazard resolution • Must detect and enforce dependencies at runtime Lecture 2: Pipelining and Superscalar Review

  22. Pipeline: Steady State t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

  23. Pipeline: Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

  24. Pipeline: Stall on Data Hazard t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB IF ID RD ALU MEM WB Instj+1 IF ID Stalled in RD RD ALU MEM WB Instj+2 IF Stalled in ID ID RD ALU MEM WB Instj+3 Stalled in IF IF ID RD ALU MEM Instj+4 IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

  25. Different View Lecture 2: Pipelining and Superscalar Review

  26. Pipeline: Forwarding Paths t0 t1 t2 t3 t4 t5 Instj IF ID RD ALU MEM WB Many possible paths IF ID RD ALU MEM WB Instj+1 IF ID RD ALU MEM WB Instj+2 IF ID RD ALU MEM WB Instj+3 IF ID RD ALU MEM WB Instj+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Requires stalling even with fwding paths MEM ALU Lecture 2: Pipelining and Superscalar Review

  27. ALU Forwarding Paths src1 IF ID Register File src2 dest = = Deeper pipeline may require additional forwarding paths = = ALU MEM Lecture 2: Pipelining and Superscalar Review

  28. Pipeline: Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 IF ID RD ALU MEM WB Insti+2 IF ID RD ALU MEM WB Insti+3 IF ID RD ALU MEM WB Insti+4 IF ID RD ALU MEM IF ID RD ALU IF ID RD IF ID IF Lecture 2: Pipelining and Superscalar Review

  29. Pipeline: Stall on Control Hazard t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB IF ID RD ALU MEM WB Insti+1 Stalled in IF IF ID RD ALU MEM Insti+2 IF ID RD ALU Insti+3 IF ID RD Insti+4 IF ID IF Lecture 2: Pipelining and Superscalar Review

  30. nop nop nop ALU nop RD ALU ID RD nop nop nop Pipeline: Prediction for Control Hazards t0 t1 t2 t3 t4 t5 Insti IF ID RD ALU MEM WB Speculative State Cleared IF ID RD ALU MEM WB Insti+1 IF ID RD ALU nop nop Insti+2 IF ID RD nop nop Insti+3 IF ID nop nop Insti+4 IF ID RD New Insti+2 Fetch Resteered IF ID New Insti+3 IF New Insti+4 Lecture 2: Pipelining and Superscalar Review

  31. Going Beyond Scalar • Simple pipeline limited to execution of CPI ≥ 1.0 • “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) • Superscalar means executing more than one scalar instruction in parallel (e.g., add + xor + mul) • Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions) Lecture 2: Pipelining and Superscalar Review

  32. Architectures for Instruction Parallelism • Scalar pipeline (baseline) • Instruction/overlap parallelism = D • Operation Latency = 1 • Peak IPC = 1 D D different instructions overlapped Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles Lecture 2: Pipelining and Superscalar Review

  33. Superscalar Machine • Superscalar (pipelined) Execution • Instruction parallelism = D x N • Operation Latency = 1 • Peak IPC = N per cycle D x N different instructions overlapped N Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 Time in cycles Lecture 2: Pipelining and Superscalar Review

  34. Ex. Original Pentium Prefetch 4× 32-byte buffers Decode1 Decode up to 2 insts Decode2 Decode2 Read operands, Addr comp Asymmetric pipes Execute Execute Both u-pipe v-pipe mov, lea, simple ALU, push/pop test/cmp shift rotate some FP jmp, jcc, call, fxch Writeback Writeback Lecture 2: Pipelining and Superscalar Review

  35. Pentium Hazards, Stalls • “Pairing Rules” (when can/can’t two insts exec at the same time?) • read/flow dependence moveax, 8 mov [ebp], eax • output dependence moveax, 8 moveax, [ebp] • partial register stalls mov al, 1 mov ah, 0 • function unit rules • some instructions can never be paired: MUL, DIV, PUSHA, MOVS, some FP Lecture 2: Pipelining and Superscalar Review

  36. Limitations of In-Order Pipelines • CPI of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point • i.e., when N approaches the average distance between dependent instructions • Forwarding is no longer effective • Must stall more often • Pipeline may never be full due to frequency of dependency stalls Lecture 2: Pipelining and Superscalar Review

  37. N Instruction Limit Pentium: Superscalar degree N=2 is reasonable… going much further encounters rapidly diminishing returns Ex. Superscalar degree N = 4 Dependent inst must be N = 4 instructions away Any dependency between these instructions will cause a stall On average, the parent- child separation is only about 5± instructions! (Franklin and Sohi ’92) Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism Lecture 2: Pipelining and Superscalar Review

  38. In Search of Parallelism • “Trivial” Parallelism is limited • What is trivial parallelism? • In-order: sequential instructions do not have dependencies • in all previous examples, all instructions executed either at the same time or after earlier instructions • previous slides show that superscalar execution quickly hits a ceiling • So what is “non-trivial” parallelism? … Lecture 2: Pipelining and Superscalar Review

  39. What is Parallelism? • Work T1: time to complete a computation on a sequential system • Critical Path T: time to complete the same computation on an infinitely-parallel system • Average Parallelism Pavg = T1/ T • For a p-wide system Tp max{T1/p , T} Pavg >> p Tp T1/p x = a + b; y = b * 2 z =(x-y) * (x+y) Lecture 2: Pipelining and Superscalar Review

  40. ILP: Instruction-Level Parallelism • ILP is a measure of the amount of inter-dependencies between instructions • Average ILP = num instructions / longest path code1: ILP = 1 (must execute serially) T1 = 3, T = 3 code2: ILP = 3 (can execute at the same time) T1 = 3, T = 1 code2:r1  r2 + 1 r3  r9 / 17 r4  r0 - r10 code1:r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 Lecture 2: Pipelining and Superscalar Review

  41. ILP != IPC • Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions • ILP is more a property of the program dataflow • IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine • The ILP of a program is an upper-bound on the attainable IPC Lecture 2: Pipelining and Superscalar Review

  42. ILP=3 ILP=1 ILP=2 Scope of ILP Analysis r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 r11  r12 + 1 r13  r19 / 17 r14  r0 - r20 Lecture 2: Pipelining and Superscalar Review

  43. DFG Analysis A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Lecture 2: Pipelining and Superscalar Review

  44. In-Order Issue, Out-of-Order Completion In-order Inst. Stream Execution Begins In-order INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Fmul3 Out-of-order Completion Issue = send an instruction to execution Lecture 2: Pipelining and Superscalar Review

  45. A B 2: C 3: D 4: 5: 6: E F G 7: H J 8: K Example A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Cycle 1: A B C IPC = 10/8 = 1.25 D G E F J H K Lecture 2: Pipelining and Superscalar Review

  46. A B 2: C 3: D E F G 4: 5: 6: H J 7: K Example (2) A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R9 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R9] J: R1 = R9 – 1 K: R3  ST 0[R1] Cycle 1: A B E C D F G H J K IPC = 10/7 = 1.43 Lecture 2: Pipelining and Superscalar Review

  47. Track with Simple Scoreboarding • Scoreboard: a bit-array, 1-bit for each GPR • If the bit is not set: the register has valid data • If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it • Issue in Order: RD Fn (RS, RT) • If SB[RS] or SB[RT] is set  RAW, stall • If SB[RD] is set  WAW, stall • Else, dispatch to FU (Fn) and set SB[RD] • Complete out-of-order • Update GPR[RD], clear SB[RD] Lecture 2: Pipelining and Superscalar Review

  48. Out-of-Order Issue In-order Inst. Stream Need an extra Stage/buffers for Dependency Resolution DR DR DR DR Out of Program Order Execution INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Fmul3 Out-of-order Completion Lecture 2: Pipelining and Superscalar Review

  49. OOO Scoreboarding • Similar to In-Order scoreboarding • Need new tables to track status of individual instructions and functional units • Still enforce dependencies • Stall dispatch on WAW • Stall issue on RAW • Stall completion on WAR • Limitations of Scoreboarding? • Hints • No structural hazards • Can always write a RAW-free code sequence Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; … • Think about x86 ISA with only 8 registers Finite number of registers in any ISA will force you to reuse register names at some point  WAR, WAW  stalls Lecture 2: Pipelining and Superscalar Review

  50. Lessons thus Far • More out-of-orderness More ILP exposed • But more hazards • Stalling is a generic technique to ensure sequencing • RAW stall is a fundamental requirement (?) • Compiler analysis and scheduling can help (not covered in this course) Lecture 2: Pipelining and Superscalar Review

More Related