1 / 21

Instruction Level Parallelism

Instruction Level Parallelism. Scalar-processors the model so far SuperScalar multiple execution units in parallel VLIW multiple instructions read in parallel. Scalar Processors. T = Nq * CPI * Ct The time to perform a task

sidney
Download Presentation

Instruction Level Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Instruction Level Parallelism • Scalar-processors • the model so far • SuperScalar • multiple execution units in parallel • VLIW • multiple instructions read in parallel

  2. Scalar Processors • T = Nq * CPI * Ct • The time to perform a task • Nq, number of instruction, CPI cycles/instruction, Ct cycle-time • Pipeline • CPI = 1 • Ct determined by critical path • But: • Floating point operations slow in software • Even in hardware (FPU) takes several cycles WHY NOT USE SEVERAL FLOATING POINT UNITS?

  3. SuperScalar Processors 1-cycle issue completion ALU IF DE PFU 1 DM WB …. PFU n m-cycles Each unit may take several cycles for finish

  4. Instruction VS Machine Parallelism • Instruction Parallelism • Average nr of instructions that can be executed in parallel • Depends on; • “true dependencies” • Branches in relation to other instructions • Machine Parallelism • The ability of the hardware to utilize instruction parallelism • Depends on; • Nr of instructions that can be fetched and executed each cycle • instruction memory bandwidth and instruction buffer (window) • available resources • The ability to spot instruction parallelism

  5. Example 1 1) add $t0 $t1 $t2 2) addi $t0 $t0 1 3) sub $t3 $t1 $t2 4) subi $t3 $t3 1 dependent instruction lookahead or “prefetch” independent dependent 1) add $t0 $t1 $t2 2) addi $t0 $t0 1 3) sub $t3 $t1 $t2 4) subi $t3 $t3 1 Concurrentlyexecuted

  6. Issue & Completion • Out of order issue, (starts “out of order”) • RAW hazards • WAR hazard (write after read) • Out of order completion, (finishes “out of order”) • WAW, Antidependence hazard (result overwritten) Issue Completion 1) add $t0 $t1 $t2 2) addi $t0 $t0 1 3) sub $t3 $t1 $t2 4) subi $t3 $t3 1 1) 3) - - 2-parallel execution units 4-stage pipeline 2) 4) - - - - 1) 3) 2) 4)

  7. Tomasulo’s Algorithm A mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 IF DE B DM WB IDLE A C IDLE B IDLE C ... ... ... ... ...

  8. Instruction Issue mul $r1 2 3 2 A 3 mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 IF DE B DM WB BUSY A $r1 A C IDLE B IDLE C ... ... ... ... ...

  9. Instruction Issue mul $r1 2 3 2 A 3 mul $r2 A 4 mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 A IF DE B DM WB 4 BUSY A $r1 A C WAIT B $r2 B IDLE C ... ... ... ... ...

  10. Instruction Issue mul $r1 2 3 2 A 3 mul $r2 A 4 mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 A IF DE B DM WB 4 mul $r2 5 6 BUSY A $r1 A C WAIT B $r2 C BUSY C ... ... ... ... ... Reg $r2 gets newer value

  11. Clock until A and B finish mul $r1 2 3 2 A 3 mul $r2 6 4 mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 6 IF DE B DM WB 4 mul $r2 5 6 IDLE A $r1 6 C BUSY B $r2 30 IDLE C ... ... ... ... ...

  12. Clock until B finishes 2 A 3 mul $r2 6 4 mul $r1 2 3 mul $r2 $r1 4 mul $r2 5 6 6 IF DE B DM WB 2 IDLE A $r1 6 C IDLE B $r2 30 IDLE C ... ... ... ... ... NOT CHANGED!

  13. SuperScalar Designs • 3-8 times faster than Scalar designs depending on • Instruction parallelism (upper bound) • Machine parallelism • Pros • Backward compatible (optimization is done at run time) • Cons • Complex hardware implementation • Not scaleable (Instruction Parallelism)

  14. VLIW Why not let the compiler do the work? • Use a Very Long Instruction Word (VLIW) • Consisting of many instructions is parallel • Each time we read one VLIW instruction we actually issue all instructions contained in the VLIW instruction

  15. VLIW Usually the bottleneck 32 IF DE EX DM WB VLIW instruction 32 IF DE EX DM WB 128 32 IF DE EX DM WB 32 IF DE EX DM WB

  16. VLIW • Let the compiler can do the instruction issuing • Let it take it’s time we do this only once, ADVANCED • What if we change the architecture • Recompile the code • Could be done the first time you load a program • Only recompiled when architecture changed • We could also let the compiler know about • Cache configuration • Nr levels, line size, nr lines, replacement strategy, writeback/writethrough etc. Hot Research Area!

  17. VLIW • Pros • We get high bandwidth to instruction memory • Cheap compared to SuperScalar • Not much extra hardware needed • More parallelism • We spot parallelism at a higher level (C, MODULA, JAVA?) • We can use advanced algorithms for optimization • New architectures can be utilized by recompilation • Cons • Software compatibility • It has not “HIT THE MARKET” (yet).

  18. 4 State Branch Prediction BRA NO BRA loop : A bne 100times loop B j loop 1 2 Predict Branch BRA NO BRA BRA We always predict BRA (1) in the inner loop, when exit we fail once and go to (2). Next time we still predict BRA (2) and go to (1) NO BRA Predict no branch BRA NO BRA

  19. Branch Prediction • The 4-states are stored in 2 bits in the instruction cache together with the conditional Branch instruction • We predict the branch • We prefetch the predicted instructions • We issue these before we know if branch taken! • When predicting fails we abort issued instructions

  20. Branch Prediction loop 1) 2) 3) Instructions 1) 2) and 3) are prefetched and may already be issued when we know the value of $r1, since $r1 might be waiting for some unit to finish predict branch taken bne $r1 loop In case of prediction failure we have to abort the issued instructions and start fetching 4) 5) and 6)

  21. Multiple Branch Targets loop 1) 2) 3) Instructions 1) 2)3)4) 5) and 6) is prefetched and may already be issued when we know the value of $r1, since $r1 might be waiting for some unit to finish bne $r1 loop As soon as we know $r1 we abort the redundant instructions. VERY COMPLEX!!!

More Related