1 / 53

CS252 Graduate Computer Architecture Lecture 7 Dynamic Scheduling 2: Precise Interrupts

CS252 Graduate Computer Architecture Lecture 7 Dynamic Scheduling 2: Precise Interrupts. John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252. Review: Tomasulo Organization. FP Registers. From Mem.

dschroeder
Download Presentation

CS252 Graduate Computer Architecture Lecture 7 Dynamic Scheduling 2: Precise Interrupts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS252Graduate Computer ArchitectureLecture 7 Dynamic Scheduling 2: Precise Interrupts John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

  2. Review: Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem FP adders FP multipliers Common Data Bus (CDB) CS252-S09, lecture 7

  3. Review: Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available • Normal data bus: data + destination (“go to” bus) • Common data bus: data + source (“come from” bus) • 64 bits of data + 4 bits of Functional Unit source address • Write if matches expected Functional Unit (produces result) • Does the broadcast CS252-S09, lecture 7

  4. Review: Renaming • Purpose of Renaming: removing “Anti-dependencies” • Get rid of WAR and WAW hazards, since these are not “real” dependencies • Implicit Renaming: i.e. Tomasulo • Registers changed into values or response tags • We call this “implicit” because space in register file may or may not be used by results! • Explicit Renaming: more physical registers than needed by ISA. • Rename table: tracks current association between architectural registers and physical registers • Uses a translation table to perform compiler-like transformation on the fly • With Explicit Renaming: • All registers concentrated in single register file • Can utilize bypass network that looks more like 5-stage pipeline • Introduces a register-allocation problem • Need to handle branch misprediction and precise exceptions differently, but ultimately makes things simpler CS252-S09, lecture 7

  5. 2 1 3 4 5 6 How important is renaming?Consider execution without it latency 1 LD F2, 34(R2) 1 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4, F2, F8 4 6 ADDD F10, F6, F4 1 In-order: 1 (2,1) . . . . . . 2 3 4 43 5 . . . 5 6 6 Out-of-order: 1 (2,1) 4 4 . . . . 2 3 . . 3 5 . . . 5 6 6 Out-of-order execution did not allow any significant improvement! CS252-S09, lecture 7

  6. Which features of an ISA limit the number of instructions in the pipeline? Which features of a program limit the number of instructions in the pipeline? How many instructions can be in the pipeline? Number of Registers Control transfers Out-of-order dispatch by itself does not provide any significant performance improvement ! CS252-S09, lecture 7

  7. Little’s Law Throughput (T) = Number in Flight (N) / Latency (L) WB Execution Issue • Example: • 4 floating point registers • 8 cycles per floating point operation • maximum of ½ issue per cycle without renaming! CS252-S09, lecture 7

  8. 2 1 3 4 5 6 Instruction-level Parallelism via Renaming latency 1 LD F2, 34(R2) 1 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4’, F2, F8 4 6 ADDD F10, F6, F4’1 X In-order: 1 (2,1) . . . . . . 2 3 4 43 5 . . . 5 6 6 Out-of-order: 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6 • Any antidependence can be eliminated by renaming. • (renaming  additional storage) • Can be done either in Software or Hardware CS252-S09, lecture 7

  9. Review: Loop Example Cycle 9 • Dataflow graph constructed completely in hardware • Renaming detaches early iterations from registers CS252-S09, lecture 7

  10. B A + * / + X(0) Y X Data-Flow Architectures • Basic Idea: Hardware respresents direct encoding of compiler dataflow graphs: • Data flows along arcs in“Tokens”. • When two tokens arrive atcompute box, box “fires” andproduces new token. • Split operations produce copiesof tokens Input: a,b y:= (a+b)/x x:= (a*(a+b))+b output: y,x CS252-S09, lecture 7

  11. Operation Unit 0 Instruction Cell Operation Unit m-1 Instruction Instruction Cell 0 Operation Packet Operand 1 Data Packets Instruction Cell 1 Operand 2 Memory Instruction Cell n-1 Paper by Dennis and Misunas “Reservation Station?” CS252-S09, lecture 7

  12. Administrative • No class on Monday (President’s Day) • Midterm I: Wednesday 3/18Location: 310 Soda HallTIME: 6:00—9:00 • Can have 1 sheet of 8½x11 handwritten notes – both sides • No microfiche of the book! • This info is on the Lecture page (has been) • Meet at LaVal’s afterwards for Pizza and Beverages • Great way for me to get to know you better • I’ll Buy! CS252-S09, lecture 7

  13. Stream of Instructions To Execute Instruction Fetch with Branch Prediction Out-Of-Order Execution Unit Correctness Feedback On Branch Results Problem: “Fetch” unit • Instruction fetch decoupled from execution • Often issue logic (+ rename) included with Fetch CS252-S09, lecture 7

  14. Branches must be resolved quickly for loop overlap! • In our loop-unrolling example, we relied on the fact that branches were under control of “fast” integer unit in order to get overlap! Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop • What happens if branch depends on result of multd?? • We completely lose all of our advantages! • Need to be able to “predict” branch outcome. • If we were to predict that branch was taken, this would be right most of the time. • Problem much worse for superscalar machines! CS252-S09, lecture 7

  15. Prediction: Branches, Dependencies, Data • Prediction has become essential to getting good performance from scalar instruction streams. • Next time, we will discuss predicting branches • However, architects are now predicting everything: data dependencies, actual data, and results of groups of instructions: • At what point does computation become a probabilistic operation + verification? • We are pretty close with control hazards already… • Why does prediction work? • Underlying algorithm has regularities. • Data that is being operated on has regularities. • Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems. • Prediction  Compressible information streams? • Next few sets of readings are about prediction! CS252-S09, lecture 7

  16. What about Precise Exceptions/Interrupts? • Both Scoreboard and Tomasulo have: • In-order issue, out-of-order execution, out-of-order completion • Recall: An interrupt or exception is precise if there is a single instruction for which: • All instructions before that have committed their state • No following instructions (including the interrupting instruction) have modified any state. • Need way to resynchronize execution with instruction stream (I.e. with issue-order) • Easiest way is with in-order completion (i.e. reorder buffer) • Other Techniques (Smith paper): Future File, History Buffer CS252-S09, lecture 7

  17. Commit Point Inst. Mem Decode Data Mem + Exc D PC D PC D Exc E E W PC E Exc M PC M M Illegal Opcode Data Addr Except Kill Writeback Overflow Select Handler PC PC Address Exceptions Cause EPC Kill F Stage Kill D Stage Kill E Stage Asynchronous Interrupts Exception Handling(In-Order Five-Stage Pipeline) • Hold exception flags in pipeline until commit point (M stage) • Exceptions in earlier pipe stages override later exceptions • Inject external interrupts at commit point (override others) • If exception at commit: update Cause and EPC registers, kill • all stages, inject handler PC into fetch stage CS252-S09, lecture 7

  18. Fetch: Instruction bits retrieved from cache. Phases of Instruction Execution PC I-cache Fetch Buffer Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer Issue Buffer Execute: Instructions and operands sent to execution units . When execution completes, all results and exception flags are available. Func. Units Result Buffer Commit: Instruction irrevocably updates architectural state (aka “graduation” or “completion”). Arch. State CS252-S09, lecture 7

  19. W PC D X2 W Xn X2 Xn X3 X1 X1 X2 X3 X2 X3 X3 Xn Xn Complex In-Order Pipeline: Precise Exceptions • Delay writeback so all operations have same latencyto W stage • Write ports never oversubscribed (one inst. in & one inst. out every cycle) • Instructions commit in order, simplifies precise exception implementation • How to prevent increase latency for single-cycle ops? • Bypassing • However: can be very expensive • Other downside: no out-of-order execution Data Mem Inst. Mem Decode GPRs + FPRs Fadd Commit Point Fmul FDiv Unpipelined divider CS252-S09, lecture 7

  20. In-Order Commit for Precise Exceptions In-order Out-of-order In-order Commit Fetch Decode Reorder Buffer Kill Kill Kill Exception? Execute Inject handler PC • Instructions fetched and decoded into instruction • reorder buffer in-order • Execution is out-of-order (  out-of-order completion) • Commit (write-back to architectural state, i.e., regfile & • memory) is in-order Temporary storage needed to hold results before commit (shadow registers and store buffers) CS252-S09, lecture 7

  21. Reorder Buffer • Idea: • record instruction issue order • Allow them to execute out of order • Reorder them so that they commit in-order • On issue: • Reserve slot at tail of ROB • Record dest reg, PC • Tag u-op with ROB slot • Done execute • Deposit result in ROB slot • Mark exception state • WB head of ROB • Check exception, handle • Write register value, or • Commit the store IFetch RF Opfetch/Dcd Write Back CS252-S09, lecture 7

  22. Reorder Buffer + Forwarding • Idea: • Forward uncommitted results to later uncommitted operations • Trap • Discard remainder of ROB • Opfetch / Exec • Match source reg against all dest regs in ROB • Forward last (once available) IFetch Reg Opfetch/Dcd Write Back CS252-S09, lecture 7

  23. Reorder Buffer + Forwarding + Speculation • Idea: • Issue branch into ROB • Mark with prediction • Fetch and issue predicted instructions speculatively • Branch must resolve before leaving ROB • Resolve correct • Commit following instr • Resolve incorrect • Mark following instr in ROB as invalid • Let them clear IFetch Reg Opfetch/Dcd Write Back CS252-S09, lecture 7

  24. History File: a hardware undo log • Maintain issue order, like ROB • Each entry records dest reg and old value of dest. Register • What if old value not available when instruction issues? • FUs write results into register file • Forward into correct entry in history file • When exception reaches head • Restore architected registers from tail to head IFetch Reg Opfetch/Dcd Write Back CS252-S09, lecture 7

  25. Future file • Idea • Arch registers reflect state at commit point • Future register reflect whatever instructions have completed • On WB update future • On commit update arch • On exception • Discard future • Replace with arch • Dest w/I ROB IFetch Future Opfetch/Dcd Reg Write Back CS252-S09, lecture 7

  26. Reorder Buffer FP Op Queue Compar network Program Counter Exceptions? Dest Reg FP Regs Result Valid Reorder Table Res Stations Res Stations FP Adder FP Adder What are the hardware complexities with reorder buffer (ROB)? • How do you find the latest version of a register? • As specified by Smith paper, need associative comparison network • Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value • Need as many ports on ROB as register file CS252-S09, lecture 7

  27. Four Steps of Speculative Tomasulo 1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands &reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”) CS252-S09, lecture 7

  28. 1 10+R2 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest F0 LD F0,10(R2) N Registers To Memory Dest from Memory Dest Dest Reservation Stations FP adders FP multipliers CS252-S09, lecture 7

  29. F10 ADDD F10,F4,F0 N F0 LD F0,10(R2) N 1 10+R2 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest from Memory Dest 2 ADDD R(F4),ROB1 Dest Reservation Stations FP adders FP multipliers CS252-S09, lecture 7

  30. F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N F0 LD F0,10(R2) N 3 DIVD ROB2,R(F6) 1 10+R2 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest from Memory Dest 2 ADDD R(F4),ROB1 Dest Reservation Stations FP adders FP multipliers CS252-S09, lecture 7

  31. F0 ADDD F0,F4,F6 N F4 LD F4,0(R3) N -- BNE F2,<…> N F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N F0 LD F0,10(R2) N 3 DIVD ROB2,R(F6) Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest from Memory Dest 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Dest Reservation Stations 1 10+R2 6 0+R3 FP adders FP multipliers CS252-S09, lecture 7

  32. -- ROB5 ST 0(R3),F4 N F0 ADDD F0,F4,F6 N F4 LD F4,0(R3) N -- BNE F2,<…> N F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N F0 LD F0,10(R2) N 3 DIVD ROB2,R(F6) Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest from Memory Dest 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Dest Reservation Stations 1 10+R2 6 0+R3 FP adders FP multipliers CS252-S09, lecture 7

  33. -- M[10] ST 0(R3),F4 Y F0 ADDD F0,F4,F6 N F4 M[10] LD F4,0(R3) Y -- BNE F2,<…> N F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N F0 LD F0,10(R2) N 3 DIVD ROB2,R(F6) 1 10+R2 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest from Memory Dest 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6) Dest Reservation Stations FP adders FP multipliers CS252-S09, lecture 7

  34. -- M[10] ST 0(R3),F4 Y F0 <val2> ADDD F0,F4,F6 Ex F4 M[10] LD F4,0(R3) Y -- BNE F2,<…> N F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N F0 LD F0,10(R2) N 3 DIVD ROB2,R(F6) 1 10+R2 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer Oldest Registers To Memory Dest from Memory Dest 2 ADDD R(F4),ROB1 Dest Reservation Stations FP adders FP multipliers CS252-S09, lecture 7

  35. -- M[10] ST 0(R3),F4 Y F0 <val2> ADDD F0,F4,F6 Ex F4 M[10] LD F4,0(R3) Y -- BNE F2,<…> N What about memory hazards??? 3 DIVD ROB2,R(F6) 1 10+R2 Tomasulo With Reorder buffer: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer F2 DIVD F2,F10,F6 N F10 ADDD F10,F4,F0 N Oldest F0 LD F0,10(R2) N Registers To Memory Dest from Memory Dest 2 ADDD R(F4),ROB1 Dest Reservation Stations FP adders FP multipliers CS252-S09, lecture 7

  36. Memory Disambiguation:Sorting out RAW Hazards in memory • Question: Given a load that follows a store in program order, are the two related? • (Alternatively: is there a RAW hazard between the store and the load)? Eg: st 0(R2),R5 ld R6,0(R3) • Can we go ahead and start the load early? • Store address could be delayed for a long time by some calculation that leads to R2 (divide?). • We might want to issue/begin execution of both operations in same cycle. • Today: Answer is that we are not allowed to start load until we know that address 0(R2)  0(R3) • Next Week: We might guess at whether or not they are dependent (called “dependence speculation”) and use reorder buffer to fixup if we are wrong. CS252-S09, lecture 7

  37. Hardware Support for Memory Disambiguation • Need buffer to keep track of all outstanding stores to memory, in program order. • Keep track of address (when becomes available) and value (when becomes available) • FIFO ordering: will retire stores from this buffer in program order • When issuing a load, record current head of store queue (know which stores are ahead of you). • When have address for load, check store queue: • If any store prior to load is waiting for its address, stall load. • If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard: • store value available  return value • store value not available  return ROB number of source • Otherwise, send out request to memory • Actual stores commit in order, so no worry about WAR/WAW hazards through memory. CS252-S09, lecture 7

  38. Memory Disambiguation: Done? FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Reorder Buffer -- LD F4, 10(R3) N F2 R[F5] ST 10(R3), F5 N F0 LD F0,32(R2) N Oldest -- <val 1> ST 0(R3), F4 Y Registers To Memory Dest from Memory Dest Dest Reservation Stations 2 32+R2 4ROB3 FP adders FP multipliers CS252-S09, lecture 7

  39. Relationship between precise interrupts and speculation: • Speculation is a form of guessing • Branch prediction, data prediction • If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly • This is exactly same as precise exceptions! • Branch prediction is a very important! • Need to “take our best shot” at predicting branch direction. • If we issue multiple instructions per cycle, lose lots of potential instructions otherwise: • Consider 4 instructions per cycle • If take single cycle to decide on branch, waste from 4 - 7 instruction slots! • Technique for both precise interrupts/exceptions and speculation: in-order completion or commit • This is why reorder buffers in all new processors CS252-S09, lecture 7

  40. Quick Recap: Explicit Register Renaming • Make use of a physical register file that is larger than number of registers specified by ISA • Keep a translation table: • ISA register => physical register mapping • When register is written, replace table entry with new register from freelist. • Physical register becomes free when not being used by any instructions in progress. Fetch Decode/ Rename Execute Rename Table CS252-S09, lecture 7

  41. P0 P2 P4 F6 F8 P10 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Newest Done? Oldest  P32 P34 P36 P38 P60 P62 Explicit register renaming:R10000 Freelist Management • Physical register file larger than ISA register file • On issue, each instruction that modifies a register is allocated new physical register from freelist • Used on: R10000, Alpha 21264, HP PA8000 Current Map Table Freelist CS252-S09, lecture 7

  42. P32 P2 P4 F6 F8 P10 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Newest Oldest  P34 P36 P38 P40 P60 P62 Explicit register renaming:R10000 Freelist Management • Note that physical register P0 is “dead” (or not “live”) past the point of this load. • When we go to commit the load, we free up Done? Current Map Table Freelist F0 P0 LD P32,10(R2) N CS252-S09, lecture 7

  43. P32 P2 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30 Newest Oldest  P36 P38 P40 P42 P60 P62 Explicit register renaming:R10000 Freelist Management Done? Current Map Table F10 P10 ADDD P34,P4,P32 N Freelist F0 P0 LD P32,10(R2) N CS252-S09, lecture 7

  44. P32 P32 P36 P36 P4 P4 F6 F6 F8 F8 P34 P34 P12 P12 P14 P14 P16 P16 P18 P18 P20 P20 P22 P22 P24 P24 p26 p26 P28 P28 P30 P30 Newest Done? -- Oldest -- BNE P36,<…> N F2 P2 DIVD P36,P34,P6 N F10 P10 ADDD P34,P4,P32 N  P38 P38 P40 P40 P44 P44 P48 P48 P60 P62 F0 P0 LD P32,10(R2) N Explicit register renaming:R10000 Freelist Management Current Map Table Freelist  Checkpoint at BNE instruction P60 P62 CS252-S09, lecture 7

  45. P40 P32 P36 P36 P4 P38 F6 F6 F8 F8 P34 P34 P12 P12 P14 P14 P16 P16 P18 P18 P20 P20 P22 P22 P24 P24 p26 p26 P28 P28 P30 P30 Newest Oldest  P42 P38 P40 P44 P44 P48 P50 P48 P0 P10 Explicit register renaming:R10000 Freelist Management Done? Current Map Table -- ST 0(R3),P40 Y F0 P32 ADDD P40,P38,P6 Y F4 P4 LD P38,0(R3) Y -- BNE P36,<…> N F2 P2 DIVD P36,P34,P6 N F10 P10 ADDD P34,P4,P32 y Freelist F0 P0 LD P32,10(R2) y  Checkpoint at BNE instruction P60 P62 CS252-S09, lecture 7

  46. P32 P32 P36 P36 P4 P4 F6 F6 F8 F8 P34 P34 P12 P12 P14 P14 P16 P16 P18 P18 P20 P20 P22 P22 P24 P24 p26 p26 P28 P28 P30 P30 Newest Oldest  P38 P38 P40 P40 P44 P44 P48 P48 P0 P10 Explicit register renaming:R10000 Freelist Management Done? Current Map Table F2 P2 DIVD P36,P34,P6 N F10 P10 ADDD P34,P4,P32 y Freelist F0 P0 LD P32,10(R2) y Error fixed by restoring map table and merging freelist  Checkpoint at BNE instruction P60 P62 CS252-S09, lecture 7

  47. Advantages of Explicit Renaming • Decouples renamingfrom scheduling: • Pipeline can be exactly like “standard” DLX pipeline (perhaps with multiple operations issued per cycle) • Or, pipeline could be tomasulo-like or a scoreboard, etc. • Standard forwarding or bypassing could be used • Allows data to be fetched from single register file • No need to bypass values from reorder buffer • This can be important for balancing pipeline • Many processors use a variant of this technique: • R10000, Alpha 21264, HP PA8000 • Another way to get precise interrupt points: • All that needs to be “undone” for precise break pointis to undo the table mappings • Provides an interesting mix between reorder buffer and future file • Results are written immediately back to register file • Registers names are “freed” in program order (by ROB) CS252-S09, lecture 7

  48. Dynamic Scheduling for Out-Of-Order Superscalar • How to issue two instructions and keep in-order instruction issue for Tomasulo? • Assume 1 integer + 1 floating point • 1 Tomasulo control for integer, 1 for floating point • Issue 2X Clock Rate, so that issue remains in order • Only FP loads might cause dependency between integer and FP issue: • Replace load reservation station with a load queue; operands must be read in the order they are fetched • Load checks addresses in Store Queue to avoid RAW violation • Store checks addresses in Load Queue to avoid WAR,WAW • Called “decoupled architecture” CS252-S09, lecture 7

  49. Multiple Issue Challenges • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: • Exactly 50% FP operations • No hazards • If more instructions issue at same time, greater difficulty of decode and issue: • Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue • Multiported rename logic: must be able to rename same register multiple times in one cycle! • Rename logic one of key complexities in the way of multiple issue! • VLIW: tradeoff instruction space for simple decoding • The long instruction word has room for many operations • By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel • E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide • Need compiling technique that schedules across several branches CS252-S09, lecture 7

  50. Loop Unrolling in VLIW Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch LD F0,0(R1) LD F6,-8(R1) 1 LD F10,-16(R1) LD F14,-24(R1) 2 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3 LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4 ADDD F20,F18,F2 ADDD F24,F22,F2 5 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6 SD -16(R1),F12 SD -24(R1),F16 7 SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8 SD -0(R1),F28 BNEZ R1,LOOP 9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) CS252-S09, lecture 7

More Related