1 / 19

Computer S tructure Out-Of-Order Execution

Computer S tructure Out-Of-Order Execution. Lihu Rappoport and Adi Yoaz. What’s Next. Remember our goal: minimize CPU Time CPU Time = clock cycle  CPI  IC So far we have learned Minimize clock cycle  add more pipe stages Minimize CPI  use pipeline Minimize IC  architecture

hada
Download Presentation

Computer S tructure Out-Of-Order Execution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer StructureOut-Of-Order Execution Lihu Rappoport and Adi Yoaz

  2. What’s Next • Remember our goal: minimize CPU Time CPU Time = clock cycle  CPI  IC • So far we have learned • Minimize clock cycle add more pipe stages • Minimize CPI use pipeline • Minimize IC  architecture • In a pipelined CPU: • CPI w/o hazards is 1 • CPI with hazards is > 1 • Adding more pipe stages reduces clock cycle but increases CPI • Higher penalty due to control hazards • More data hazards • Beyond some point adding more pipe stages does not help • What can we do ? Further reduce the CPI !

  3. IF ID EXE MEM WB IF ID EXE MEM WB A Superscalar CPU • Duplicating HW in one pipe stage won’t help • e.g., have 2 ALUs • the bottleneck moves to other stages • Getting IPC > 1 requires to fetch, decode, exe, and retire >1 instruction per clock:

  4. U-pipe IF ID pairing V-pipe The Pentium Processor • Fetches and decodes 2 instructions per cycle • Before register file read, decide on pairing • can the two instructions be executed in parallel • Pairing decision is based on • Data dependencies: instructions must be independent • Resources: • Some instructions use resources from the 2 pipes • The second pipe can only execute part of the instructions

  5. Misprediction Penalty in a Superscalar CPU • MPI : miss-per-instruction: #incorrectly predicted branches #predicted branches MPI = = MPR× total # of instructions total # of instructions • MPI correlates well with performance. E.g., assume: • MPR = 5%, %branches = 20%  MPI = 1% • Without hazards IPC=2 (2 instructions per cycles) • Flush penalty of 5 cycles • We get: • MPI = 1%  flush in every 100 instructions • IPC=2  flush every 100/2 = 50 cycles • 5 cycles flush penalty every 50 cycles • 10% in performance • For IPC=1 we would get • 5 cycles flush penalty per 100 cycles  5% in performance

  6. Is Superscalar Good Enough ? • A superscalar processor can fetch, decode, execute and retire 2 instructions in parallel • Can execute only independent instructions in parallel • But … adjacent instructions are usually dependent • The utilization of the second pipe is usually low • There are algorithms in which both pipes are highly utilized • Solution: out-of-order execution • Execute instructions based on “data flow” rather than program order • Still need to keep the semantics of the original program

  7. Out Of Order Execution • Look ahead in a window of instructions • Find instructions that are ready to execute • Do not depend on data from previous instructions still not executed • Have the required execution resources available • Start instruction execution before execution of a previous instructions • Advantages • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., L1 data cache miss, divide) • Can Compilers do the Work ? • Compilers can statically reschedule instructions • Compilers do not have run time information • Conditional branch direction → limited to basic blocks • Data values, which may affect calculation time and control • Cache miss / hit

  8. Data Flow Graph r5 r6 r1 r8 r4 In-order execution 1 4 2 3 1 6 5 1 2 3 5 2 6 5 3 6 4 4 Out-of-order execution Data Flow Analysis • Example: (1)r1 r4 / r7 ; assume divide takes 20 cycles (2) r8r1 + r2 (3) r5 r5 + 1 (4) r6 r6 - r3 (5) r4 r5 + r6 (6) r7  r8 * r4

  9. Retire (commit) Instruction pool Fetch & Decode In-order In-order Execute Out-of-order OOOE – General Scheme • Fetch & decode instructions in parallel but in order, to fill inst. pool • Execute ready instructions from the instructions pool • All the data required for the instruction is ready • Execution resources are available • Once an instruction is executed • signal all dependant instructions that data is ready • Commit instructions in parallel but in-order • Instruction committed only after all preceding instructions (in program order) have committed

  10. 1 3 4 2 5 Out Of Order Execution – Example • Assume that executing a divide operation takes 20 cycles (1) r1 r5 / r4 (2) r3r1 + r8 (3) r8 r5 + 1 (4) r3 r7 - 2 (5) r6  r6 + r7 • Inst2 has a RAW dependency on r1 with Inst1 • It cannot be executed in parallel with Inst1 • Can successive instructions pass Inst2 ? • Inst3 cannot since Inst2 must read r8 before Inst3 writes to it • Inst4 cannot since it must write to r3 after Inst2 • Inst5 can

  11. False Dependencies • OOOE creates new dependencies • WAR: write to a register which is read by an earlier inst. (1) r3 r2 + r1 (2) r2 r4 + 3 • WAW: write to a register which is written by an earlier inst. (1) r3 r1 + r2 (2) r3 r4 + 3 • These are false dependencies • There is no missing data • Still prevent executing instructions out-of-order • Solution: Register Renaming

  12. Register Renaming • Hold a pool of physicalregisters • Renaming map maps architectural registers into physical registers • Map architectural registers into physical registers • Before an instruction is sent for execution • Map the arch destination register to a newly allocated physical register • Replace each of the arch source registers with the mapped phyregs • When an instruction needs data from a source register for execution • Read data from the physical register allocated to the latest inst which writes to the same arch register, and precedes the current inst • If no such instruction exists, read directly from the arch. register • When an instruction produces a result following its execution • Write the result value to the physical register • When an instruction commits • Move the value from the physical register to the arch register it points

  13. WAW WAW WAR OOOE with Register Renaming: Example Alloc/ EXE EXE Renamecyc1cyc2 r1mem1r1:pr1 pr1mem1 x r2r2+r1r2:pr2pr2r2+pr1 x r1mem2r1:pr3 pr3mem2 x r3r3+r1 r3:pr4pr4r3+pr3 x r1mem3r1:pr5pr5mem3 x r4r5+r1r4:pr6pr6r5+pr5 x r52r5:pr7pr72 x r6r5+2r6:pr8pr8pr7+2 x Rename map

  14. Executing Beyond Branches • So far we do not look for instructions ready to execute beyond a branch • Limited to the parallelism within a basic-block • A basic-block is ~5 instruction long (1)r1 r4 / r7 (2) r2 r2 + r1 (3) r3r2 - 5 (4) beqr3,0,300 If the beq is predicted NT, (5) r8  r8 + 1 Inst 5 can be spec executed • We would like to look beyond branches • But what if we execute an instruction beyond a branch and then it turns out that we predicted the wrong path ? Solution: Speculative Execution

  15. Speculative Execution • Fetch instructions into the pool from a predicted path • Instructions for which all operands are ready can be executed • Instructions commit in-order • Instructions which follow a branch commit only after the branch commits • Branch resolved to be correctly predicted  the instructions are safe • Branch resolved to be wrongly predicted  flush the instructions • Reset the renaming map, so all register are mapped to arch. registers • Redirect the instruction fetch to the correct address • Branch is resolved at EXE  no need to wait until it commits • Start the misprediction recovery right after the branch executes • Start fetching instructions from correct path • Block renaming of new instructions until branch commitsOnce branch commits, reset the renaming map, and re-new renaming • Better: recover the renaming map to its state right after the branch,Once renaming map is recovered, re-new renaming

  16. WAW WAW WAR OOOE with Register Renaming: Example Alloc/ EXE EXE Renamecyc 1cyc2 r1mem1 r1:pr1pr1mem1 x r2r2+r1 r2:pr2pr2r2+pr1 x r1mem2 r1:pr3 pr3mem2 x r3r3+r1 r3:pr4pr4r3+pr3 x jcc L2 predicted taken to L2 L2 r1mem3 r1:pr5pr5mem3 x r4r5+r1 r4:pr6 pr6r5+pr5x r52 r5:pr7pr72 x r6r5+2 r6:pr8pr8pr7+2 x • Rename map • If branch mispredicts, there are 2 options • When branch commits: reset rename map, and then rename correct instructions • When branch executes: recover rename map, and then rename correct instructions

  17. OOO Requires Accurate Branch Predictor • Accurate branch predictor increases effective scheduling window size • Speculate across multiple branches (a branch every 5 – 10 instructions) Instruction pool High chances to commit branches Low chances to commit

  18. Interrupts and Faults Handling • Complications for pipelined and OOO execution • Interrupts occur in the middle of an instruction • A speculative instruction can get a fault (divide by 0, page fault) • Faults are served in program order, at retirement only • Mark an instruction that takes a fault at execution • Instructions older than the faulting instruction are retired • Only when the faulting instruction retires – handle the fault • Flush subsequent instructions • Initiate the fault handling code according to the fault type • Restart faulting and/or subsequent instructions • Interrupts are served when the next instruction retires • Let the instruction in the current cycle retire • Flush subsequent instructions and initiate the interrupt service code • Fetch the subsequent instructions

  19. Out Of Order Execution – Summary • Advantages • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., cache miss, divide) • Superior/complementary to compiler scheduler • Dynamic instruction window • Reg Renaming: can use more than the number architectural registers • Complex micro-architecture • Complex scheduler • Requires reordering mechanism (retirement) in the back-end for: • Precise interrupt resolution • Misprediction/speculation recovery • Memory ordering • Speculative Execution • Advantage: larger scheduling window  reveals more ILP • Issues: misprediction cost and misprediction recovery

More Related