1 / 19

Pipelining I CS429: Computer Organization, Architecture & Programming

Pipelining I CS429: Computer Organization, Architecture & Programming . Haran Boral. Overview. What’s wrong with the sequential (SEQ) Y86? It’s slow! Each piece of hardware is used only a small fraction of time

chipo
Download Presentation

Pipelining I CS429: Computer Organization, Architecture & Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pipelining I CS429: Computer Organization, Architecture & Programming Haran Boral

  2. Overview • What’s wrong with the sequential (SEQ) Y86? • It’s slow! • Each piece of hardware is used only a small fraction of time • We would like to find a way to get more performance with only a little more hardware • General Principles of Pipelining • Express task as a collection of stages • Move object through stages • Process several objects at any given moment • Creating a Pipelined Y86 Processor • Rearranging SEQ • Inserting pipeline registers • Problems with data and control hazards

  3. Laundry example • Four people, each with a load of clothes to wash, dry,fold and stash away • Washer takes 30 minutes • Dryer takes 30 minutes • “Folder” takes 30 minutes • “Stasher” takes 30 minutesto put clothes into drawers A B C D Slide courtesy of D. Patterson

  4. A B C D Sequential Laundry 2 AM 12 6 PM 1 8 7 11 10 9 • Sequential laundry takes 8 hours for 4 loads • If they learned pipelining, how long would laundry take? 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 T a s k O r d e r Time Slide courtesy of D. Patterson

  5. 30 30 30 30 30 30 30 A B C D Pipelined Laundry: Start ASAP 2 AM 12 6 PM 1 8 7 11 10 9 • Pipelined laundry takes 3.5 hours for 4 loads! Time T a s k O r d e r Slide courtesy of D. Patterson

  6. A B C D Pipelining Lessons 6 PM 7 8 9 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Pipeline rate limited by slowest pipeline stage • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup • Stall for Dependences Time T a s k O r d e r 30 30 30 30 30 30 30 Slide courtesy of D. Patterson

  7. 300 ps 20 ps Combinational logic R e g Delay = 320 ps Throughput = 3.12 GOPS Clock Computational Example • System • Computation requires total of 300 picoseconds • Additional 20 picoseconds to save result in register • Must have clock cycle of at least 320 ps

  8. 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Delay = 360 ps Throughput = 8.33 GOPS Clock 3-Way Pipelined Version • System • Divide combinational logic into 3 blocks of 100 ps each • Can begin new operation as soon as previous one passes through stage A. • Begin new operation every 120 ps • Overall latency increases • 360 ps from start to finish

  9. OP1 A A A B B B C C C OP2 OP3 OP1 Time OP2 Time OP3 Pipeline Diagrams • Unpipelined • Cannot start new operation until previous one completes • 3-Way Pipelined • Up to 3 operations in process simultaneously

  10. 241 359 300 239 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g A A A B B B C C C Clock OP1 100 ps 100 ps 100 ps 20 ps 20 ps 20 ps 100 ps 100 ps 100 ps 20 ps 20 ps 20 ps 100 ps 100 ps 100 ps 20 ps 20 ps 20 ps OP2 OP3 Clock Comb. logic A R e g Comb. logic B R e g Comb. logic C Comb. logic A Comb. logic A R e g R e g Comb. logic B Comb. logic B R e g R e g Comb. logic C Comb. logic C R e g R e g R e g 0 120 240 360 480 640 Time Clock Clock Clock Operating a Pipeline

  11. 50 ps 20 ps 150 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Delay = 510 ps Throughput = 5.88 GOPS Clock OP1 A A A B B B C C C OP2 OP3 Time Limitations: Non-uniform Delays • Throughput limited by slowest stage • Other stages sit idle for much of the time • Challenging to partition system into balanced stages

  12. Delay = 420 ps, Throughput = 14.29 GOPS 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Comb. logic R e g Clock Limitations: Register Overhead • As try to deepen pipeline, overhead of loading registers becomes more significant • Percentage of clock cycle spent loading register: • 1-stage pipeline: 6.25% • 3-stage pipeline: 16.67% • 6-stage pipeline: 28.57% • High speeds of modern processor designs obtained through very deep pipelining

  13. The Performance Equation • Clock Cycle Time • Improves by factor of almost N for N-deep pipeline • Not quite factor of N due to pipeline overheads • Cycles Per Instruction • In ideal world, CPI would stay the same • An individual instruction takes N cycles • But we have N instructions in flight at a time • So - average CPIpipe = CPIno_pipe * N/N • Thus performance can improve by up to factor of N

  14. R e g Combinational logic Clock OP1 OP2 OP3 Time Data Dependencies • System • Each operation depends on result from preceding one

  15. A A A A B B B B C C C C Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g OP1 OP2 OP3 OP4 Time Clock Data Hazards • Result does not feed back around in time for next operation • Pipelining has changed behavior of system

  16. Data Dependencies in Processors • Result from one instruction used as operand for another • Read-after-write (RAW) dependency • Very common in actual programs • Must make sure our pipeline handles these properly • Get correct results • Minimize performance impact 1irmovl $50, %eax 2addl %eax, %ebx 3mrmovl 100( %ebx ), %edx

  17. SEQ Hardware • Stages occur in sequence • One operation in process at a time • One stage for each logical pipeline operation • Fetch (get next instruction from memory) • Decode (figure out what instruction does and get values from regfile) • Execute (compute) • Memory (access data memory if necessary) • Write back (write any instruction result to regfile)

  18. SEQ+ Hardware • Still sequential implementation • Reorder PC stage to put at beginning • PC Stage • Task is to select PC for current instruction • Based on results computed by previous instruction • Processor State • PC is no longer stored in register • But, can determine PC based on other stored information

  19. W _ i c o d e , W _ v a l M W _ v a l E , W _ v a l M , W _ d s t E , W _ d s t M W v a l M D a t a D a t a M _ i c o d e , M e m o r y m e m o r y m e m o r y M _ B c h , M _ v a l A A d d r , D a t a M B c h v a l E C C C C E x e c u t e A L U A L U a l u A , a l u B E v a l A , v a l B d _ s r c A , D e c o d e A A B B d _ s r c B M M R R e e g g i i s s t t e e r r R R e e g g i i s s t t e e r r f f i i l l e e f f i i l l e e E E W r i t e b a c k D v a l P i c o d e , i f u n , r A , r B , v a l C v a l P I n s t r u c t i o n P C F e t c h I n s t r u c t i o n P C m e m o r y i n c r e m e n t m e m o r y i n c r e m e n t p r e d P C P C f _ P C F Adding Pipeline Registers

More Related