1 / 13

Pipelined Design

Pipelined Design. תרגול 9. Latency and Throughput. Latency - Total time to perform a single operation from start to end. Throughput – operations / latency ratio or GOPS, giga-operations per second. The clock cycle is always bounded by the slowest operation. Picoseconds - 10^-12

lotte
Download Presentation

Pipelined Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipelined Design תרגול 9

  2. Latency and Throughput • Latency - Total time to perform a single operation from start to end. • Throughput – operations / latency ratio or GOPS, giga-operations per second. • The clock cycle is always bounded by the slowest operation. • Picoseconds - 10^-12 • Nanoseconds - 10^-9

  3. Latency and Throughput 80ps 30ps 60ps 50ps 70ps 10ps 20ps A B C D E F R E G • How to maximize throughput using an additional register? • Put it between C and D • How to maximize throughput using 2 additional registers? • AB, CD, EF • How to maximize throughput using 3 additional registers? • A, BC, D, EF • What is the cycle time, latency and throughput in that case? • 110ps, 440ps, (1/110) * 1000 = 9.09 • How to get max throughput with min stages? • 5 stages, since we cannot go lower than 100ps for a clock cycle

  4. Pipeline registers • Select A, Since only jxx and call need valP at further stages (E and M resp.) • Predicted PC

  5. Control Hazards • PC prediction (Fetch stage) • jxx , ret → ? • call, jmp → ValC • Other → ValP • Branch prediction strategies • Always-taken • Never-taken • Backward taken, forward not taken • Why are forward branches less common ? • How can we deal with stack prediction (ret) ? 5

  6. Data Hazards • Data dependencies in Processors • Result from one instruction is being used as an operand for another (nop example) • Stalling / forwarding / load-use hazard 6

  7. Data Dependencies: 2 Nop’s 1 2 3 4 5 6 7 8 9 10 # demo-h2.ys F F D D E E M M W W 0x000: irmovl $10,%edx F F D D E E M M W W 0x006: irmovl $3,%eax F F D D E E M M W W 0x00c: nop F F D D E E M M W W 0x00d: nop F F D D E E M M W W 0x00e: addl %edx,%eax F F D D E E M M W W 0x010: halt Cycle 6 W W W  ← R[ ]  R[ R[ 3 ]  ]  3 3 %eax %eax %eax • • • • • • D D D ←   valA valA valA R[ R[ R[ ] = 10 ] = 10 ] = 10 Error %edx %edx %edx   valB valB valB R[ R[ R[ ] = 0 ] = 0 ] = 0 %eax %eax %eax ←

  8. Stalling solution 1 2 3 4 5 6 7 8 9 10 # demo-h2.ys F D E M W 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop E M W bubble F D D E M W 0x00e: addl %edx,%eax F F D E M W 0x010: halt If instruction follows too closely after one that writes register,  slow it down Hold instruction in decode  Dynamically inject nop into execute stage 

  9. Data Forwarding Solution 1 2 3 4 5 6 7 8 9 # demo-h2.ys F F D D E E M M W W 0x000: irmovl $10,%edx F F D D E E M M W W 0x006: irmovl $3,%eax F F D D E E M M W W 0x00c: nop F F D D E E M M W W 0x00d: nop F F D D E E M M W W 0x00e: addl%edx,%eax F F D D E E M M W W 0x010: halt Cycle 6 in write- irmovl  W back stage ← R[ ]  3 W_dstE=  %eax %eax Destination value in W_valE = 3  W pipeline register • • • Forward as valB for  D decode stage ← valA R[ ] = 10 srcA=  %edx %edx ← srcB=  %eax valB W_valE= 3

  10. Limitation of Forwarding # demo-luh.ys 1 2 3 4 5 6 7 8 9 10 11 F D E M W 0x000: irmovl $128,%edx F D E M W 0x006: irmovl $3,%ecx F D E M W 0x00c: rmmovl %ecx, 0(%edx) F D E M W 0x012: irmovl $10,%ebx F F D D E E M M W W 0x018: mrmovl 0(%edx),%eax # Load %eax F D E M W 0x01e: addl%ebx,%eax # Use %eax F D E M W 0x020: halt Load-use dependency Cycle 7 Cycle 8 Value needed by end of  M M decode stage in cycle 7 M_dstE=  M_dstM=  %ebx %eax ← M_valE= 10 m_valM M[128] = 3 Value read from memory in  memory stage of cycle 8 • • • D D Error ← valA valA M_valE= 10 M_valE= 10  ← valB valB R[ R[ ] = 0 ] = 0 %eax %eax 

  11. Avoiding Load/Use Hazard # demo-luh.ys 1 2 3 4 5 6 7 8 9 10 11 F F D D E E M M W W 0x000: irmovl $128,%edx F F D D E E M M W W 0x006: irmovl $3,%ecx F F D D E E M M W W 0x00c: rmmovl %ecx, 0(%edx) F F D D E E M M W W 0x012: irmovl $10,%ebx F F F D D D E E E M M M W W W 0x018: mrmovl 0(%edx),%eax# Load %eax E M W bubble F D D E M W 0x01e:addl%ebx,%eax# Use %eax F F D E M W 0x020: halt Stall using instruction for  Cycle 8 one cycle W W Can then pick up loaded  W_dstE=  W_dstE=  %ebx %ebx value by forwarding from W_valE= 10 W_valE= 10 memory stage M M M_dstM=  M_dstM=  %eax %eax ← m_valM0M[128] = 3 m_valM M[128] = 3 • D D • valA valA W_valE= 10 W_valE= 10 ←  valB valB m_valM= 3 m_valM= 3 ← 

  12. Control Logic Processing “ret” Must stall until instruction reaches write back Load/use hazard Must stall between read memory and use Mis-predicted branch removing instructions from the pipe 12

  13. Implementing the forwarding logic • Note: 5 forwarding sources

More Related