1 / 48

COMP541 Pipelined MIPS

COMP541 Pipelined MIPS. Montek Singh Mar 30, 2010. Topics. Pipelining Can think of as A way to parallelize, or A way to make better utilization of the hardware. Goal: use all hardware every cycle Section 7.5 of text. Parallelism. Two types of parallelism: Spatial parallelism

redell
Download Presentation

COMP541 Pipelined MIPS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP541Pipelined MIPS Montek Singh Mar 30, 2010

  2. Topics • Pipelining Can think of as • A way to parallelize, or • A way to make better utilization of the hardware. Goal: use all hardware every cycle • Section 7.5 of text

  3. Parallelism • Two types of parallelism: • Spatial parallelism • duplicate hardware performs multiple tasks at once • Temporal parallelism • task is broken into multiple stages • also called pipelining • for example, an assembly line

  4. Parallelism Definitions • Some definitions: • Token: A group of inputs processed to produce a group of outputs • Latency: Time for one token to pass from start to end • Throughput: The number of tokens that can be produced per unit time • Parallelism increases throughput • Often sacrificing latency

  5. Parallelism Example • Ben is baking cookies • It takes 5 minutes to roll the cookies and 15 minutes to bake them. • After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism? Latency = 5 + 15 = 20 minutes = 1/3 hour Throughput = 1 tray/ 1/3 hour = 3 trays/hour

  6. Parallelism Example • What is the latency and throughput if Ben uses parallelism? • Spatial parallelism: Ben asks Allysa to help, using her own oven • Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on.

  7. Spatial Parallelism Latency = ? Throughput = ?

  8. Spatial Parallelism Latency = 5 + 15 = 20 minutes = 1/3 hour (same) Throughput = 2 trays/ 1/3 hour = 6 trays/hour (doubled)

  9. Temporal Parallelism Latency = ? Throughput = ?

  10. Temporal Parallelism Latency = 5 + 15 = 20 minutes = 1/3 hour Throughput = 1 trays/ 1/4 hour = 4 trays/hour Using both techniques, the throughput would be 8 trays/hour

  11. Pipelined MIPS • Temporal parallelism • Divide single-cycle processor into 5 stages: • Fetch • Decode • Execute • Memory • Writeback • Add pipeline registers between stages

  12. Single-Cycle vs. Pipelined Performance

  13. Pipelining Abstraction

  14. Single-Cycle and Pipelined Datapath

  15. Multi-Cycle and Pipelined Datapath

  16. Corrected Pipelined Datapath • WriteReg must arrive at the same time as Result

  17. Pipelined Control Same control unit as single-cycle processor Control delayed to proper pipeline stage

  18. Pipeline Hazard • Occurs when an instruction depends on results from previous instruction that hasn’t completed. • Types of hazards: • Data hazard: register value not written back to register file yet • Control hazard: next instruction not decided yet (caused by branches)

  19. Data Hazard

  20. Handling Data Hazards • Static • Insert nops in code at compile time • Rearrange code at compile time • Dynamic • Forward data at run time • Stall the processor at run time

  21. Compile-Time Hazard Elimination • Insert enough nops for result to be ready • Or move independent useful instructions forward

  22. Data Forwarding • Also known as bypassing

  23. Data Forwarding

  24. Data Forwarding • Forward to Execute stage from either: • Memory stage or • Writeback stage • Forwarding logic for ForwardAE: if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE = 01 else ForwardAE = 00 • Forwarding logic for ForwardBE same, but replace rsE with rtE

  25. Data Forwarding if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE = 01 else ForwardAE = 00

  26. Forwarding can fail… lw has a 2-cycle latency!

  27. Stalling

  28. Stalling Hardware

  29. Stalling Hardware • Stalling logic: lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall

  30. Stalling Control lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall

  31. Control Hazards • beq: • branch is not determined until the fourth stage of the pipeline • Instructions after the branch are fetched before branch occurs • These instructions must be flushed if the branch happens

  32. Effect & Solutions • Could stall when branch decoded • Expensive: 3 cycles lost per branch! • Could predict and flush if wrong • Branch misprediction penalty • Instructions flushed when branch is taken • May be reduced by determining branch earlier

  33. Control Hazards: Flushing

  34. Control Hazards: Original Pipeline (for comparison)

  35. Control Hazards: Early Branch Resolution Introduced another data hazard in Decode stage (fix a few slides away)

  36. Control Hazards with Early Branch Resolution Penalty now only one lost cycle

  37. Aside: Delayed Branch • MIPS always executes instruction following a branch • So branch delayed • This allows us to avoid killing inst. • Compilers move instruction that has no conflict w/ branch into delay slot

  38. Example • This sequence add $4 $5 $6 beq $1 $2 40 • reordered to this beq $1 $2 40 add $4 $5 $6

  39. Handling the New Hazards

  40. Control Forwarding and Stalling Hardware • Forwarding logic: ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM • Stalling logic: branchstall = BranchD AND RegWriteE AND (WriteRegE == rsD OR WriteRegE == rtD) OR BranchD AND MemtoRegM AND (WriteRegM == rsD OR WriteRegM == rtD) StallF = StallD = FlushE = lwstall OR branchstall

  41. Branch Prediction • Especially important if branch penalty > 1 cycle • Guess whether branch will be taken • Backward branches are usually taken (loops) • Perhaps consider history of whether branch was previously taken to improve the guess • Good prediction reduces the fraction of branches requiring a flush

  42. Pipelined Performance Example • Ideally CPI = 1 • But less due to: stalls (caused by loads and branches) • SPECINT2000 benchmark: • 25% loads • 10% stores • 11% branches • 2% jumps • 52% R-type • Suppose: • 40% of loads used by next instruction • 25% of branches mispredicted • All jumps flush next instruction • What is the average CPI?

  43. Pipelined Performance Example • SPECINT2000 benchmark: • 25% loads • 10% stores • 11% branches • 2% jumps • 52% R-type • Suppose: • 40% of loads used by next instruction • 25% of branches mispredicted • All jumps flush next instruction • What is the average CPI? • Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus, • CPIlw = 1(0.6) + 2(0.4) = 1.4 • CPIbeq = 1(0.75) + 2(0.25) = 1.25 • Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15

  44. Pipelined Performance • Pipelined processor critical path: Tc = max { tpcq + tmem + tsetup 2(tRFread + tmux + teq + tAND + tmux + tsetup ) tpcq + tmux + tmux + tALU + tsetup tpcq + tmemwrite + tsetup 2(tpcq + tmux + tRFwrite) }

  45. Pipelined Performance Example Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup ) = 2[150 + 25 + 40 + 15 + 25 + 20] ps= 550 ps

  46. Pipelined Performance Example • For a program with 100 billion instructions executing on a pipelined MIPS processor, • CPI = 1.15 • Tc = 550 ps Execution Time = (# instructions) × CPI × Tc = (100 × 109)(1.15)(550 × 10-12) = 63 seconds

  47. Summary • Pipelining attempts to use hdw more efficiently • Throughput increases at cost of latency • Hazards ensue • Modern processors pipelined

  48. Next Time • I/O • Joysticks • Keyboard (and mouse?)

More Related