1 / 17

Single issue machine with multiple pipes

This article discusses the use of multiple pipelines in a single-issue machine, allowing for different operations to be performed simultaneously. Hazards such as RAW, WAW, and out-of-order completion are also addressed.

gilliand
Download Presentation

Single issue machine with multiple pipes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Single issue machine with multiple pipes • Motivation: single issue but different pipes for integer and f-p operations • We still fetch only 1 instruction/cycle • We still decode only 1 instruction/cycle • But we might have several pipelines and units which are not pipelined • At decode stage decision on which pipe to use • When a unit is pipelined, an operation can be initiated every cycle; if not pipelined must wait for latency CSE 471 Multiple pipes

  2. EX Two sets of registers: integer and f-p; but load/store of f-p registers go through integer pipe: hence conflicts in WB stage Me WB ID M1 M7 IF A1 A4 both Needed at beg of cycle & ready at end of cycle Div CSE 471 Multiple pipes

  3. Unit latencies • Pipelines might have an EXE stage that takes multiple cycles , for example • EXE integer: latency 0 (pipelined) • FP adder: latency 3 (pipelined) • FP (and integer multiply) latency 6 (pipelined) • FP divide (and integer divide) latency 25 (not pipelined) • Result of instruction I can be forwarded to instruction I + 1 + latency CSE 471 Multiple pipes

  4. Hazards in example multiple cycle pipeline • Structural: Yes • Divide unit is not pipelined. Any Divides separated by less than 25 cycles will stall the pipe • RAW: yes • Essentially handled as in integer pipe but with higher frequency of stalls and more forwarding paths • Several writes might be “ready” at the same time • WAW : yes (see in a few slides) • Out of order completion : yes (see in a few slides) CSE 471 Multiple pipes

  5. RAW Example F4 <- Load IF ID EX MeWB F0 <- F4 * F6 IF ID st M1 M2 M3 M4 M5 M6 M7 Me WB F2 <- F0 + F8 IF ID st st stst st A1 A2 A3 A4 Me WB Store <- F2 IF ID EX st st st st st st st Me WB CSE 471 Multiple pipes

  6. Conflict in using the WB stage • Several instructions might want to use the WB stage at the same time • E.g.,A Multd issued at time t and an addd issued at time t + 3 • Solution: reserve the WB stage at ID stage (scheme already used in CRAY-1, a supercomputer built in 1976) • Keep track of WB stage usage in shift register • reserve the right slot. If busy, stall for a cycle and repeat • shift every clock cycle CSE 471 Multiple pipes

  7. WAW Hazards • Instruction I writes f-p register Fx at time t Instruction I + k writes f-p register Fx at time t - m And no instruction I + 1, I +2, I+k uses Fx (otherwise there would be a stall) • Seems unlikely but can occur as result of optimizations but will happen when we look at OOO execution • Only requirement is that I + k ‘s result mot be overwritten • Solutions (besides register renaming that we’ll see later): • Squash I : difficult to know where it is in the pipe • At ID stage check that result register is not a result register in all subsequent stages of other units. If it is, stall appropriately. CSE 471 Multiple pipes

  8. Out-of-order completion • Problem with exception/interrupts • Instruction I finishes at time t Instruction I + k finishes at time t - m No hazard etc. • What happens if instruction I causes an exception at a time in [t-m+1,t] and instruction I + k has written its result? CSE 471 Multiple pipes

  9. Exception handling • Solutions • Do nothing (imprecise exceptions; bad with virtual memory) • Have a precise (by use of testing instructions) and an imprecise mode; restricts concurrency of f-p operations • Buffer results until previous (in order) instructions have completed; can be costly when large differences in latencies but the same technique is used for OOO execution • Restrict concurrency of f-p operations and on an exception “simulate in software” the instructions in between the faulting and the finished one. • Flag early those operations that might result in an exception and stall accordingly CSE 471 Multiple pipes

  10. MIPS pipelines (R4000) • R4000 (about 1993; first 64-bit architecture) • 8 stage integer pipe. • Load delay 2 cycles • Branch delay : 1 delay slot + 2 cycles, no branch prediction hardware (default prediction of branch not taken) • 8 stage f-p pipe • 3 functional units: adder, multiplier, and divider • Stages can be used in any order, multiple times • Thus potential conflicts between independent instructions (structural hazards) • There exists a whole theory on how to deal with this (reservation tables) CSE 471 Multiple pipes

  11. Alpha pipelines • Alpha 21064 (2-way superscalar @1993) and Alpha 21164 (4-way superscalar @1995) • Fastest clock-wise at the time of introduction • 21064 • Ibox (Ifetch and decode: 4 cycles) common to: • Ebox (Integer execution unit: 3 stages) • Fbox (Floating-point execution unit: 6 stages) • Abox (load-store unit: 3 stages) • Stalls can occur only in the first 4 stages. CSE 471 Multiple pipes

  12. Alpha 21164 • Two integer pipelines (1 of them used also for load-store) • Two floating-point pipelines • Still 7 stages for integer and memory pipelines but • Load delay only 1 cycle instead of 2 in 21064 (faster check for TLB) • Mispredict penalty 5 cycles instead of 4 (better branch prediction though) CSE 471 Multiple pipes

  13. MIPS pipelines (R10000) • MIPS R10000 (4-way out-of-order issue ) • 5 pipelines • Common first 2 stages (IF, ID) • 2 Integer ALU’s with 3 more stages (one ALU used for compares; apparently 3 cycles branch taken penalty but … less because of the resume buffer) • 1 Load-store with 4 more stages (1 cycle load delay) • 2 FP units with 5 more stages, 1 for Add, 1 for Mpy and long latency ops such as Div and Sqrt) CSE 471 Multiple pipes

  14. Mispred. penalty: 3 cycles IF ID RF EX WB 2 int ALU’s Load delay: 1 cycle RF Addr Mem WB 1 load/store RF WB EX1 EX2 EX3 1 FP add 1 FP mpy CSE 471 Multiple pipes

  15. Power PC • 601 -- 2-way issue; Slower than Alpha 21064 but OOO • Branch unit (based on condition codes), integer/load/store, f-p • 620 -- 4-way issue; OOO (@1995) • “Traditional” 5 stage pipeline • 2 integer ALU’s + 1 mult/div • 1 load/store unit • 1 FPU unit • 1 Branch unit (sophisticated) and branching based on CC’s • If instruction setting CC’s and branch separated by at least 2 cycles, prediction is always correct • 2k- entries BPT (called BHT) and 256-entries BTB (called BTAC -branch target address cache) • Misprediction penalty 2 or 3 cycles CSE 471 Multiple pipes

  16. Pentium • 2-way superscalar (@1992) • 2 integer ALU’s of the 5 stage variety (not quite) since more stages needed for fetch/align and decode (2 1/2 stages) • First 2 stages common to both pipes • F-P unit has 8 stages (including the common 2); latency of 3 cycles. • Branch penalty. If correct prediction in BTB or branch not taken no delay; otherwise 3 or 4 cycles CSE 471 Multiple pipes

  17. Pentium Pro • OOO issue and completion (@1995) • Separation between • Fetch/decode unit • Functional units • Retire unit • CISC instructions are transformed into RISC uops CSE 471 Multiple pipes

More Related