1 / 50

September 15, 2000 Prof. John Kubiatowicz

CS252 Graduate Computer Architecture Lecture 5 Software Scheduling around Hazards Hardware* Out-of-order Scheduling. September 15, 2000 Prof. John Kubiatowicz. Techniques to Increase ILP. Forwarding Branch Prediction Superpipelining Superscalar with Static Multiple Issue VLIW

chaela
Download Presentation

September 15, 2000 Prof. John Kubiatowicz

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS252Graduate Computer ArchitectureLecture 5 Software Scheduling around HazardsHardware* Out-of-order Scheduling September 15, 2000 Prof. John Kubiatowicz

  2. Lecture 19 - Pipelining 3 Techniques to Increase ILP • Forwarding • Branch Prediction • Superpipelining • Superscalar with Static Multiple Issue VLIW • Superscalar with Dynamic Multiple Issue • Superscalar with Speculation • Superscalar with Simultaneous Multithreading (SMT)

  3. Instruction type Pipe stages ALU or branch IF ID EX MEM WB Load/ Store IF ID EX MEM WB ALU or branch IF ID EX MEM WB Load/ Store IF ID EX MEM WB ALU or branch IF ID EX MEM WB Load/ Store IF ID EX MEM WB ALU or branch IF ID EX MEM WB Load/ Store IF ID EX MEM WB (Fig. 6.44, old 6.57) Static Multiple Issue • Key idea: issue (decode & execute) multiple instructions in each clock cycle • Example: Issue load/store and ALU/branch in MIPS Lecture 19 - Pipelining 3

  4. Executes ALU/Branch Instructions (Fig. 6.45, old 6.58) Executes Load/Store Instructions Example - A Static Multiple Issue MIPS Lecture 19 - Pipelining 3

  5. VLIW / EPIC Processors • VLIW - Very Long Instruction Words • Functional units exposed in instruction word • Static scheduling by compiler • Pipeline is exposed; compiler must schedule delays to get right result • Examples: Philips Trimedia, Texas Instruments C6000 • Explicit Parallel Instruction Computer (EPIC) • 3 41-bit instructions in each instruction packet • Compiler determines parallelism • Hardware checks dependencies and fowards/stalls • Examples: Intel Itanium, Itanium 2 Lecture 19 - Pipelining 3

  6. Source: Extreme Tech www.extremetech.com Itanium Block Diagram Lecture 19 - Pipelining 3

  7. Software Manipulation to Increase ILP • Software Transformations can increase ILP • Code reordering to reduce stalls • Loop unrolling • Example (p. 438) Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, -4 # decrement ptr bne $s1, $zero, Loop • Goal: reorder to speed superscalar execution Lecture 19 - Pipelining 3

  8. ALU or branch instruction Data transfer instruction Clock Loop: lw $t0, 0($s1) 1 addi $s1, $s1, -4 2 addu $t0, $t0, $s2 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 Software ManipulationReordering Code • Note sparse utilization of superscalar pipeline! • End result: • 5 instructions in 4 clocks • CPI = 0.8 • IPC = 1.25 Lecture 19 - Pipelining 3

  9. ALU or branch instruction Data transfer instruction Clock Loop: addi $s1, $s1, -16 lw $t0, 0($s1) 1 lw $t1, 12($s1) 2 lw $t2, 8($s1) 3 addu $t0, $t0, $s2 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 0($s1) 5 addu $t3, $t3, $s2 sw $t1, 12($s1) 6 sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 Software Manipulation - Loop Unrolling • Assume loop count a multiple of 4 & unroll • End result: • 4 loop iterations in 8 clocks • IPC = 1.75 • 2 clocks / iteration! Lecture 19 - Pipelining 3

  10. Techniques to Increase ILP • Forwarding • Branch Prediction • Superpipelining • Superscalar with Static Multiple Issue VLIW • Superscalar with Dynamic Multiple Issue \ • Superscalar with Speculation • Superscalar with Simultaneous Multithreading (SMT) Lecture 19 - Pipelining 3

  11. addustalls until $t0 available sub is ready to execute but blocked by stall! Dynamic Multiple Issue • Key ideas: • ”Look past" stalls for instructions that can execute lw $t0, 20($t2) addu $t1, $t0, $s2 sub $s4, $s4, $s3 slti $t5, $s4, 20 • Execute instructions out of order • Use multiple functional units for parallel execution • Forward results between functional units when necessary • Update registers (in original order of execution) Lecture 19 - Pipelining 3

  12. Speculation • Guess about the outcome of an instruction (e.g., branch or load) • Based on guess, start executing instructions • Cancel started instructions if guess is incorrect • Complicating factors • Must buffer instruction results until outcome known • Exceptions in speculated instructions - how can you have an exception in an instruction that didn’t execute? Lecture 19 - Pipelining 3

  13. Instruction Fetchand decode unit Reservation station Reservation station Reservation station Reservation station Integer Integer Floating point Load/Store Functional units Commit unit (Fig. 6.49, old 6.61) Superscalar Dynamic Pipelining In-order issue Out-of-order execute In-order commit Lecture 19 - Pipelining 3

  14. Key idea #2: Register RenamingDIVD F0,F2,F4 DIVD F0,F2,F4 ADDD F10,F0,F8 ADDD F10,F0,F8 SUBD F0,F8,F14 SUBD F100,F8,F14 MULD F6,F10,F0 MULD F6,F10,F100Totally removes WAR and WAW hazards. Can HW reduce CPI to 1- or IPC to 1+? • Why in HW/at run time? • Works when can’t know real dependence at compile time • Compiler simpler • Code for one machine runs well on another • Key idea #1: Allow instructions behind stall to proceedDIVD F0,F2,F4 ADDD F10,F0,F8SUBD F12,F8,F14Out-of-order execution out-of-order completion?

  15. RAW WAR Moving beyond the five-stage pipeline: • Why limit performance for slow/less frequent ops? • Variable latencies -> out-of-order execution desirable • How do we prevent WAR and WAW hazards? • How do we deal with variable latency? • Forwarding for RAW hazards will be harder.

  16. FP Mult FP Mult FP Divide FP Add Integer Scoreboard Architecture(CDC 6600) Registers Functional Units SCOREBOARD Memory

  17. Lecture 19 - Pipelining 3 Basic Pipelined MIPS

  18. Four Stages of Scoreboard Control • Issue—decode instructions & check for structural hazards (ID1) • Instructions issued in program order (for hazard checking) • Don’t issue if structural hazard • Don’t issue if instruction is output dependent on any previously issued but uncompleted instruction (WAW hazards) • Read operands—wait until no data hazards, then read operands (ID2) • All real dependencies (RAW hazards) resolved in this stage, since we wait for instructions to write back data. • No forwarding of data in this model!

  19. Four Stages of Scoreboard Control • Execution—operate on operands (EX) • The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. • Write result—finish execution (WB) • Stall until no WAR hazards with previous instructions:Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14CDC 6600 scoreboard would stall SUBD until ADDD reads operands

  20. Three Parts of the Scoreboard • Instruction status:Which of 4 steps the instruction is in • Functional unit status:—Indicates the state of the functional unit (FU). 9 fields for each functional unitBusy: Indicates whether the unit is busy or notOp: Operation to perform in the unit (e.g., + or –)Fi: Destination registerFj,Fk: Source-register numbersQj,Qk: Functional units producing source registers Fj, FkRj,Rk: Flags indicating when Fj, Fk are ready • Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register

  21. Scoreboard Example

  22. Scoreboard Example: Cycle 1

  23. Instruction status Wait until Bookkeeping Issue Not busy (FU) and not result(D) Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’; Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU; Read operands Rj and Rk Rj No; Rk No Execution complete Functional unit done Write result f((Fj(f)Fi(FU) or Rj(f)=No) & (Fk(f)Fi(FU) or Rk( f )=No)) f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rk(f) Yes); Result(Fi(FU)) 0; Busy(FU) No Detailed Scoreboard Pipeline Control

  24. Scoreboard Example: Cycle 2 • Can we enter Issue for 2nd LD?

  25. Scoreboard Example: Cycle 3 • Issue MULT (in order)?

  26. Scoreboard Example: Cycle 4

  27. Scoreboard Example: Cycle 5

  28. Scoreboard Example: Cycle 6

  29. Scoreboard Example: Cycle 7 • Read multiply operands?

  30. Scoreboard Example: Cycle 8a(First half of clock cycle)

  31. Scoreboard Example: Cycle 8b(Second half of clock cycle)

  32. Scoreboard Example: Cycle 9 Note Remaining • Read operands for MULT & SUB? Issue ADDD?

  33. Scoreboard Example: Cycle 10

  34. Scoreboard Example: Cycle 11

  35. Scoreboard Example: Cycle 12 • Read operands for DIVD?

  36. Scoreboard Example: Cycle 13

  37. Scoreboard Example: Cycle 14

  38. Scoreboard Example: Cycle 15

  39. Scoreboard Example: Cycle 16

  40. WAR Hazard! Scoreboard Example: Cycle 17 • Why not write result of ADD???

  41. Scoreboard Example: Cycle 18

  42. Scoreboard Example: Cycle 19

  43. Scoreboard Example: Cycle 20

  44. Scoreboard Example: Cycle 21 • WAR Hazard is now gone...

  45. Scoreboard Example: Cycle 22

  46. (skip a few cycles)

  47. Scoreboard Example: Cycle 61

  48. Scoreboard Example: Cycle 62

  49. Review: Scoreboard Example: Cycle 62 • In-order issue; out-of-order execute & commit

  50. Instruction status Wait until Bookkeeping Issue Not busy (FU) and not result(D) Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’; Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU; Read operands Rj and Rk Rj No; Rk No Execution complete Functional unit done Write result f((Fj(f)Fi(FU) or Rj(f)=No) & (Fk(f)Fi(FU) or Rk( f )=No)) f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rk(f) Yes); Result(Fi(FU)) 0; Busy(FU) No Detailed Scoreboard Pipeline Control

More Related