1 / 24

Dynamic Scheduling to Minimize Stalls Tomasulo Algorithm To: Dr. TenEyck Submitted by Yanyu Liu

Dynamic Scheduling to Minimize Stalls Tomasulo Algorithm To: Dr. TenEyck Submitted by Yanyu Liu Teammate: Monaco. Pipelining vs... Out-of-order Execution. Pipelining

bisa
Download Presentation

Dynamic Scheduling to Minimize Stalls Tomasulo Algorithm To: Dr. TenEyck Submitted by Yanyu Liu

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Scheduling to Minimize Stalls Tomasulo Algorithm To: Dr. TenEyck Submitted by Yanyu Liu Teammate: Monaco

  2. Pipelining vs... Out-of-order Execution • Pipelining • Pipelining implies in-order execution, the execution of the subsequent instructions is done in program order.For example,consider the following code sequence: • I1: DIVD R1, R2 ,R3 • I2: MULT R4 R1,R1 • I3: ADDD R1, R8,R9 • instruction I1 blocks the execute stage, since the division function unit has a long latency. Instruction I2has to be stalled upon the begin of its execution, since the execution stage is blocked by I1and since it requires the result of I1 (data dependence).

  3. Out-of-Order Execution Data dependencies and different latencies of the function units can cause additional delays which reduce performance. In order to eliminate these delays, we use out-of-order execution. We depicts the execution of I1 to I3 on an out-of-order CPU. Instruction I3is now able to enter the execution stage even before I1 does, since I3does not depend on any result of the preceding instructions. It even terminates before I1, which causes a (WAW) data hazard in R1 .Furthermore, I2 tries to read R1 before I2 writes it. Thus, there is (RAW) data hazard. Since I3writes R1 before I2 reads it, there is also a (WAR) hazard.

  4. Static Scheduling vs. Dynamic Scheduling • Compiler-base static scheduling can separate the dependent instructions minimizing actual hazards and stalls in scheduled code. Dynamic Scheduling use a hardware-based mechanism to rearrange instruction execution order to reduce stalls at runtime. It has two advantages: • 1.Enable handling some cases where dependencies are unknown at compile time. • 2.can not remove true data dependencies,but tries to avoid stalling. • There are two dynamic scheduling methods. One is Tomasulo algorithm,the other is Scoreboard. Here, we just discuss Tomasulo algorithm.

  5. Tomasulo's Algorithm • This scheme was invented by RobertTomasulo, and was first • used in the IBM 360/91. it uses register renaming to eliminate • output and anti-dependencies, i.e. WAW and WAR hazards. • Output and anti-dependencies are just name dependencies, there • is no actual data dependence. Tomasulo's algorithm implements • register renaming through the use of what are called reservation • stations. Reservation stations are buffers which fetch and store • instruction operands as soon as they're available.

  6. Reservation stations • Each reservation station holds exactly one instruction and its operands and has the following components: • Op Operation to perform in the unit (e.g., + or –) • Vj, VkValue of Source operands • Store buffers have a single V field indicating result to be stored. • Qj, Qk Reservation stations producing source registers. (value • to be written). • Busy: Indicates reservation station or FU is busy. • Register result status: Indicates which functional unit will write • each register, if one exists. • The load and store buffers each require a busy field.

  7. Three Steps in Tomasolu Algorithms • 1.Issue: Get instruction from pending Instruction Queue. • Instruction issued to a free reservation station (no structural hazard). • Selected RS is marked busy. • Control sends available instruction operands to assigned RS. (renaming registers). • 2.Execution (EX): Operate on operands. • When both operands are ready then start executing on assigned FU. • If all operands are not ready, watch Common Data Bus (CDB) for needed result. • 3.Write result (WB): Finish execution. • Write result on Common Data Bus to all awaiting units • Mark reservation station as available. • Uses Common Data Bus (CDB) for forwarding.

  8. Example of Tomasulo Algorithm Using the following code to consider Tomasulo approach.The code is run on the DLX. # of RSs EX Latency Integer 1 0 cycle Floating Point Multiply/divide 2 10/40 cycles Floating Point add 3 2 cycles LD F6, 34(R2) LD F2, 45(R3) MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD F10, F0, F6 ADDD F6, F8, F2

  9. Dynamic Scheduling: The Tomasulo Approach

  10. Tomasulo Example Cycle 1 Instruction status Execution Write Instruction j k Issue complete Result Busy Address Yes LD F6 34+ R2 1 Load1 No 34+R2 LD F2 45+ R3 Load2 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Register result status Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1

  11. Cycle 2 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 No 0 Mult2 No Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 2 FU Load2 Load1

  12. Cycle 3 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No Add3 No 0 Mult1 Yes MULTD R(F4) Load2 0 Mult2 No Register result status Clock 3 F0 F2 F4 F6 F8 F10 F12 ... F30 Load2 FU Mult1 Load1

  13. Cycle 4 • Load2 completing;

  14. Cycle 5

  15. Cycle 6

  16. Cycle 7 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 Yes SUBD M(34+R2) M(45+R3) 0 Add2 Yes ADDD M(45+R3) Add1 Add3 No 8 Mult1 Yes MULTD M(45+R3) R(F4) 0 Mult2 Yes DIVD M(34+R2) Mult1 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 7 FU Mult1 M(45+R3) Add2 Add1 Mult2 • RS Add1 completing

  17. Cycle 10 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 Yes ADDD M()–M() M(45+R3) 0 Add3 No 5 Mult1 Yes MULTD M(45+R3) R(F4) 0 Mult2 Yes DIVD M(34+R2) Mult1 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 10 FU Mult1 M(45+R3) Add2 M()–M() Mult2 • RS Add2 completing

  18. Cycle 11 Instruction status Execution Write Instruction j k Issue complete Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 7 8 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 10 11 Reservation Stations S1 S2 RS for j RS for k Time Name Busy Op Vj Vk Qj Qk 0 Add1 No 0 Add2 No 0 Add3 No 4 Mult1 Yes MULTD M(45+R3) R(F4) 0 Mult2 Yes DIVD M(34+R2) Mult1 Register result status F0 F2 F4 F6 F8 F10 F12 ... F30 Clock 11 FU Mult1 M(45+R3) (M-M)+M() M()–M() Mult2 • Write back result of ADDD in this cycle

  19. Cycle 15 • Mult1 completing

  20. Cycle 16 Only Divide instruction remains

  21. Cycle 57 Instruction Block done • Again we have: • In-order issue, • Out-of-order execution, completion

  22. Tomasulo Approach Example: Reservation Stations and Register Tags.

  23. Tomasulo Approach Example: Multiply and divide are the only instructions not finished.

  24. References http://www.cs.umd.edu/class/fall1998/cmsc411/projects/dynamic/tomasulo.html http://www.csse.monash.edu.au/~davida/teaching/cse3304/Web/Chapter7/index.htm http://www.crhc.uiuc.edu/ece411/Slides/issue_lect.pdf http://www-wjp.cs.uni-sb.de/~kroening/tomasulo/diplom/main002.html http://www-classes.usc.edu/engr/ee-s/557de/tomasulo.pdf http://www.dpi.ufv.br/disciplinas/mirror/ee282/Handouts/Lecture_9.pdf http://www.ece.umd.edu/class/enee759m.S2000/midterm-2000-solutions.pdf http://goethe.ira.uka.de/ungerer/Prozessorarchitektur/PrA-Folien-10-VL.pdf http://goethe.ira.uka.de/ungerer/prozarch/procarch98-99/pr98-7.pdf http://meseec.ce.rit.edu/eecc551-winter2000/

More Related