html5-img
1 / 30

Pipelined Implementation CSCi 2021: Computer Architecture and Organization

Pipelined Implementation CSCi 2021: Computer Architecture and Organization. Chapter 4.4, 4.5 (through 4.5.7). Today. Pipelining Performance Hazards Deviate a bit from the slides (defer branch prediction). Last Time. Sequential processor implementation Instruction execution. Idea

chaela
Download Presentation

Pipelined Implementation CSCi 2021: Computer Architecture and Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipelined ImplementationCSCi 2021: Computer Architecture and Organization Chapter 4.4, 4.5 (through 4.5.7)

  2. Today • Pipelining • Performance • Hazards • Deviate a bit from the slides (defer branch prediction)

  3. Last Time • Sequential processor implementation • Instruction execution

  4. Idea Divide process into independent stages Move objects through stages in sequence At any given times, multiple objects being processed Parallel Sequential Pipelined Real-World Pipelines: Car Washes

  5. 300 ps 20 ps Combinational logic R e g Delay = 320 ps Throughput = 3.12 GOPS gigainstructions/ops per sec Clock Computational Example • System • Computation requires total of 300 picoseconds • Additional 20 picoseconds to save result in register

  6. 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Delay = 360 ps Throughput = 8.33 GOPS Clock 3-Way Pipelined Version • System • Divide combinational logic into 3 blocks of 100 ps each • Can begin new operation as soon as previous one passes through stage A • Begin new operation every 120 ps • Overall latency increases • 360 ps from start to finish

  7. 50 ps 20 ps 150 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Delay = 510 ps Throughput = 5.88 GOPS Clock OP1 A A A B B B C C C OP2 OP3 Time Limitations: Nonuniform Delays • Throughput limited by slowest stage: clock ~ this value (worst case) • Other stages sit idle for much of the time • Challenging to partition system into balanced stages

  8. Delay = 420 ps, Throughput = 14.29 GOPS 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Clock Limitations: Register Overhead • As try to deepen pipeline, overhead of loading registers becomes more significant • Percentage of clock cycle spent loading register: • 3-stage pipeline: 16.67% (60/360) • 6-stage pipeline: 28.57% (?)

  9. Example

  10. R e g Combinational logic Clock OP1 OP2 OP3 Time Data Dependencies • System • Each operation may depend on result from preceding one • Not a car wash!

  11. Data Dependencies in Processors • Result from one instruction used as operand for another • Read-after-write (RAW) dependency • Very common in actual programs • Must make sure our pipeline handles these properly • Get correct results • Minimize performance impact 1irmovl $50, %eax 2addl %eax , %ebx 3mrmovl 100( %ebx ), %edx

  12. A A A A B B B B C C C C Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g OP1 OP2 OP3 OP4 Time Clock Data Hazards • Result does not feed back around in time for next operation • Pipelining has changed behavior of system

  13. Control Dependency loop: sub1 %edx, %ebx jnetarg irmovl $10, %edx jmp loop targ: Problem? take wrong path => control hazard

  14. Pipeline Stages • Fetch • Select current PC • Read instruction • Compute incremented PC • Decode • Read program registers • Execute • Operate ALU • Memory • Read or write data memory • Write Back • Update register file

  15. PIPE- Hardware • Pipeline registers hold intermediate values from instruction execution • Forward (Upward) Paths • Values passed from one stage to next

  16. 1 2 3 4 5 6 7 8 9 irmovl $1,%eax #I1 F D E M W irmovl $2,%ecx #I2 F D E M W F E W M D irmovl $3,%edx #I3 I4 I3 I2 I1 I5 irmovl $4,%ebx #I4 F D E M W halt #I5 F D E M W F D E M W Cycle 5 Pipeline Example

  17. 1 2 3 4 5 6 7 8 F D E M W 0x000: irmovl $10,% edx F D E M W 0x006: irmovl $3,% eax F D E M W 0x00c: addl % edx ,% eax F D E M W 0x00e: halt Cycle 4 M M_ valE = 10 M_ dstE = % edx E f e_ valE 0 + 3 = 3 E_ dstE = % eax D D Error f f valA valA R[ R[ ] ] = = 0 0 % % edx edx f f valB valB R[ R[ ] ] = = 0 0 % % eax eax Data Dependencies

  18. 1 2 3 4 5 6 7 8 9 F D E M W 0x000: irmovl $10,% edx F D E M W 0x006: irmovl $3,% eax F F D D E E M M W W 0x00c: nop F F D D E E M M W W 0x00d: addl % edx ,% eax F F D D E E M M W W 0x00f: halt Cycle 5 W W f f R[ R[ ] ] 10 10 % % edx edx M M_ valE = 3 M_ dstE = % eax • • • D D Error f f valA valA R[ R[ ] ] = = 0 0 % % edx edx f f valB valB R[ R[ ] ] = = 0 0 % % eax eax Data Dependencies: 1 Nop

  19. 1 2 3 4 5 6 7 8 9 10 F F D D E E M M W W 0x000: irmovl $10,% edx F F D D E E M M W W 0x006: irmovl $3,% eax F F D D E E M M W W 0x00c: nop F F D D E E M M W W 0x00d: nop F F D D E E M M W W 0x00e: addl % edx ,% eax F F D D E E M M W W 0x010: halt Cycle 6 W W W f f f R[ R[ R[ ] ] ] 3 3 3 % % % eax eax eax • • • • • • D D D f f f valA valA valA R[ R[ R[ ] ] ] = = = 10 10 10 Error % % % edx edx edx f f f valB valB valB R[ R[ R[ ] ] ] = = = 0 0 0 % % % eax eax eax Data Dependencies: 2 Nop’s

  20. 1 2 3 4 5 6 7 8 9 10 11 F F D D E E M M W W 0x000: irmovl $10,% edx F F D D E E M M W W 0x006: irmovl $3,% eax F F D D E E M M W W 0x00c: nop F F D D E E M M W W 0x00d: nop F F D D E E M M W W 0x00e: nop F F D D E E M M W W 0x00f: addl % edx ,% eax F F D D E E M M W W 0x011: halt Cycle 6 W W f f R[ R[ ] ] 3 3 % % eax eax Cycle 7 D D f f valA valA R[ R[ ] ] = = 10 10 % % edx edx f f valB valB R[ R[ ] ] = = 3 3 % % eax eax Data Dependencies: 3 Nop’s

  21. E M W F F D E M W Stalling for Data Dependencies 1 2 3 4 5 6 7 8 9 10 11 • If instruction follows too closely after one that writes register, slow it down • Hold instruction in decode • Dynamically inject nop(“bubble”)into execute stage 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W bubble 0x00e: addl %edx,%eax D D E M W 0x010: halt F F D E M W

  22. E M W F F D E M W Cycle 6 W W_dstE = %eax W_valE = 3 • • • D srcA = %edx srcB = %eax Detecting Stall Condition 1 2 3 4 5 6 7 8 9 10 11 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W bubble 0x00e: addl %edx,%eax D D E M W 0x010: halt F F D E M W

  23. Data Forwarding • Simple Pipeline • Register isn’t written until completion of write-back stage • Source operands read from register file in decode stage • Needs to be in register file at start of stage • Trick • Pass value directly from generating instruction to decode stage • Why is this preferred over bubbles?

  24. E M W F F D E M W Cycle 6 W W_dstE = %eax W_valE = 3 • • • D srcA = %edx srcB = %eax Forwarding 1 2 3 4 5 6 7 8 9 10 11 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W bubble 0x00e: addl %edx,%eax D D E M W 0x010: halt F F D E M W add control logic and lines

  25. Pipeline Summary • Concept • Break instruction execution into 5 stages • Run instructions through in pipelined mode • Limitations • Data dependencies/hazards • One instruction writes register, later one reads it • Control dependency/hazards • Instruction sets PC in way that pipeline did not predict correctly • Mispredictedbranch, return, etc • Fixing the Pipeline • bubbles, forwarding

  26. Next Time • Optimizing Performance • Chapter 5 • Have a great weekend!

  27. Predicting the PC • Start fetch of new instruction after current one has completed fetch stage • Not enough time to reliably determine next instruction • Guess which instruction will follow • Recover if prediction was incorrect

  28. Our Prediction Strategy • Instructions that Don’t Transfer Control • Predict next PC to be valP • Always reliable • Call and Unconditional Jumps • Predict next PC to be valC (destination) • Always reliable • Conditional Jumps • Predict next PC to be valC (destination) • Only correct if branch is taken • Typically right 60% of time • Return Instruction • Don’t try to predict

  29. 1 2 3 4 5 6 7 8 9 F D E M W 0x000: xorl % eax ,% eax F D E M W 0x002: jne t # Not taken … F D E M W 0x011: t: irmovl $3, % edx # Target F D E M W 0x017: irmovl $4, % ecx # Target+1 F F D D E E M M W W 0x007: irmovl $1, % eax # Fall Through Cycle 5 M M_Bch = 0 M_ valA = 0x007 E E f f valE valE 3 3 dstE dstE = = % % edx edx D D valC valC = = 4 4 dstE dstE = = % % ecx ecx F F f f valC valC 1 1 f f rB rB % % eax eax Branch Misprediction Trace …… • Incorrectly executes instructions at branch target • Solution: instruction cancellation or squashing before E stage • generate bubbles

More Related