1 / 52

Lecture 20: Instruction Level Parallelism

Lecture 20: Instruction Level Parallelism. Computer Engineering 585 Fall 2001. Tomasulo Example Cycle 0. Inst. status. Wait until. Action or bookkeeping. ¹ 0). Issue. Station or buffer empty. if (Register['S1`].Qi. {RS[r].Qj. ¬. Register[`S1'].Qi}. else {RS[r].Vj. ¬.

ratana
Download Presentation

Lecture 20: Instruction Level Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 20: Instruction Level Parallelism Computer Engineering 585 Fall 2001

  2. Tomasulo Example Cycle 0

  3. Inst. status Wait until Action or bookkeeping ¹0) Issue Station or buffer empty if (Register['S1`].Qi {RS[r].Qj ¬ Register[`S1'].Qi} else {RS[r].Vj ¬ S1]; RS[r].Qj ¬ 0}; ¹ if (Register[S2].Qi 0) {RS[r].Qk ¬ Register[S2].Qi}; else {RS[r].Vk ¬ S2; RS[r].Qk ¬ 0} RS[r].Busy ¬ yes; Register['D`].Qi=r; Execute (RS[r].Qj=0) and None-operands are in Vj and Vk (RS[r].Qk=0) " Write result Execution completed at x(if (Register[x].Qi=r) {Fx ¬ result; r and CDB available Register[x].Qi ¬ 0}); " x(if (RS[x].Qj=r) {RS[x].Vj ¬ result; RS[x].Qj ¬ 0}); " x(if (RS[x].Qk=r) {RS[x].Vk ¬ result; RS[x].Qk ¬ 0}); " x(if (Store[x].Qi=r) {Store[x].V ¬ result; Store[x].Qi ¬ 0}); RS[r].Busy ¬ No Tomasulo Bookkeeping

  4. Tomasulo Example Cycle 1 Yes

  5. Tomasulo Example Cycle 2 Note: Unlike 6600, can have multiple loads outstanding

  6. Tomasulo Example Cycle 3 • Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard • Load1 completing; what is waiting for Load1?

  7. Tomasulo Example Cycle 4 • Load2 completing; what is waiting for it?

  8. Tomasulo Example Cycle 5

  9. Tomasulo Example Cycle 6 • Issue ADDD here vs. scoreboard?

  10. Tomasulo Example Cycle 7 • Add1 completing; what is waiting for it?

  11. Tomasulo Example Cycle 8

  12. Tomasulo Example Cycle 9

  13. Tomasulo Example Cycle 10 • Add2 completing; what is waiting for it?

  14. Tomasulo Example Cycle 11 • Write result of ADDD here vs. scoreboard?

  15. Tomasulo Example Cycle 12 • Note: all quick instructions complete already

  16. Tomasulo Example Cycle 13

  17. Tomasulo Example Cycle 14

  18. Tomasulo Example Cycle 15 • Mult1 completing; what is waiting for it?

  19. Tomasulo Example Cycle 16 • Note: Just waiting for divide

  20. Tomasulo Example Cycle 55

  21. Tomasulo Example Cycle 56 • Mult 2 completing; what is waiting for it?

  22. Tomasulo Example Cycle 57 • Again, in-oder issue, out-of-order execution, completion

  23. Compare to Scoreboard Cycle 62 • Why takes longer on Scoreboard/6600?

  24. Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600) Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1 ÷) window size: Š 14 instructions Š 5 instructions No issue on structural hazard same WAR: renaming avoids stall completion WAW: renaming avoids stall completion Broadcast results from FU Write/read registers Control: reservation stations central scoreboard

  25. Tomasulo Drawbacks • Complexity: • delays of 360/91, MIPS 10000, IBM 620? • Many associative stores (CDB) at high speed. • Performance limited by Common Data Bus: • Multiple CDBs => more FU logic for parallel associative stores.

  26. Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop • Assume Multiply takes 4 clock cycles • Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit) • To be clear, will show clocks for SUBI, BNEZ • In reality, integer instructions ahead

  27. Loop Example Cycle 0

  28. Loop Example Cycle 1

  29. Loop Example Cycle 2

  30. Loop Example Cycle 3 • Note: MULT1 has no registers names in RS

  31. Loop Example Cycle 4

  32. Loop Example Cycle 5

  33. Loop Example Cycle 6 • Note: F0 never sees Load1 result

  34. Loop Example Cycle 7 • Note: MULT2 has no registers names in RS

  35. Loop Example Cycle 8

  36. Loop Example Cycle 9 • Load1 completing; what is waiting for it?

  37. Loop Example Cycle 10 • Load2 completing; what is waiting for it?

  38. Loop Example Cycle 11

  39. Loop Example Cycle 12

  40. Loop Example Cycle 13

  41. Loop Example Cycle 14 • Mult1 completing; what is waiting for it?

  42. Loop Example Cycle 15 • Mult2 completing; what is waiting for it?

  43. Loop Example Cycle 16

  44. Loop Example Cycle 17

  45. Loop Example Cycle 18

  46. Loop Example Cycle 19

  47. Loop Example Cycle 20

  48. Loop Example Cycle 21

  49. Inst. status Wait until Action or bookkeeping ¹0) Issue Station or buffer empty if (Register['S1`].Qi {RS[r].Qj ¬ Register[`S1'].Qi} else {RS[r].Vj ¬ S1]; RS[r].Qj ¬ 0}; ¹ if (Register[S2].Qi 0) {RS[r].Qk ¬ Register[S2].Qi}; else {RS[r].Vk ¬ S2; RS[r].Qk ¬ 0} RS[r].Busy ¬ yes; Register['D`].Qi=r; Execute (RS[r].Qj=0) and None-operands are in Vj and Vk (RS[r].Qk=0) " Write result Execution completed at x(if (Register[x].Qi=r) {Fx ¬ result; r and CDB available Register[x].Qi ¬ 0}); " x(if (RS[x].Qj=r) {RS[x].Vj ¬ result; RS[x].Qj ¬ 0}); " x(if (RS[x].Qk=r) {RS[x].Vk ¬ result; RS[x].Qk ¬ 0}); " x(if (Store[x].Qi=r) {Store[x].V ¬ result; Store[x].Qi ¬ 0}); RS[r].Busy ¬ No Tomasulo Bookkeeping

  50. Tomasulo Summary • Reservation stations: renaming to larger set of registers + buffering source operands • Prevents registers as bottleneck. • Avoids WAR, WAW hazards of Scoreboard. • Allows loop unrolling in HW. • Not limited to basic blocks (integer units get ahead, beyond branches) • Helps cache misses as well. • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation • 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

More Related