1 / 63

Reducing Branch Penalties

Reducing Branch Penalties. We saw in chapter 3 static approaches to deal with branch penalties (assuming taken or not taken branches, using the branch delay slot) Here, we will examine dynamic hardware oriented approaches: branch-prediction buffer two-bit prediction scheme

ponce
Download Presentation

Reducing Branch Penalties

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reducing Branch Penalties • We saw in chapter 3 static approaches to deal with branch penalties (assuming taken or not taken branches, using the branch delay slot) • Here, we will examine dynamic hardware oriented approaches: • branch-prediction buffer • two-bit prediction scheme • branch-target buffer • branch folding

  2. Branch Prediction Buffer • Consider that if we know for a particular branch that it is usually taken, then we can predict it will be taken this time • The buffer is a small cache indexed by the low-order bits of the address of the branch instruction • This buffer contains a bit as to whether the branch was last taken or not (1 time history) • Upon fetching the instruction, we also fetch the prediction bit (from a special cache), and use it to determine whether to start the branch process or ignore the branch process

  3. Accuracy • Consider a loop which iterates 10 times and the initial bit is set to false (no branch) • First time - predict no branch, but there is a branch - pay branch penalty, set bit to predict branch, branch 9 times, on last time predict branch but branch will not be taken • Even though the branch is taken 9 out of 10 times, our method mispredicts twice, giving an accuracy of 80% • In general, loop branches are very predictable, they are either skipped or repeated many times • Ideal accuracy of this approach = (number of iterations-1)/number of iterations

  4. Better Approach • Rather than using a single bit for history of last branching decision, use a 2 bit history • There are four states: • Predict not taken and last time was not taken • Predict not taken and last time was taken • Predict taken and last time was taken • Predict taken and last time was not taken • See figure 4.13, page 264 • Advantage - longer history gives more accuracy, don’t necessarily have the error caused from the previous example

  5. N-bit Approach • We can generalize the 2-bit approach to an n-bit approach, each branch has an n-bit history • A shifter is used, every time a branch is reached, its predictor is fetched, and a new bit is added by shifting the predictor one bit and then adding a new bit based on whether the prediction based on the history is correct or not • In reality, a 2-bit approach is accurate enough that we won’t need the n-bit approach • See pages 263-271 for more details

  6. DLX and these Approaches • For this approach to be useful, we need to know the branch target address early on in the pipeline • In DLX, the branch target address is computed at the same time as the branch condition -- that is, we know if we are going to branch at the same time we know where to branch too • So, there is no point in predicting whether we are going to branch or not in DLX, since we will know the outcome at the same time as the target address of the branch • So, these approaches will only be useful for longer pipelines where the branch target is known before the branch condition

  7. Branch-Target Buffers • To reduce the branch penalty in DLX, we need to know the branch address by the end of the IF stage -- that is we need to know where we are branching too before we have even decode the instruction as a branch! • Doesn’t seem possible • A branch prediction buffer stores only whether the branch is taken or not • A branch-target buffer instead can store not only a prediction on the branch, but if taken, the predicted address of the branch

  8. How it Works • Branch target buffer is accessed during the IF stage using the instruction address for what is being fetched • If we get a hit (the instruction has a corresponding element in the cache), then we retrieve the predicted PC of the branch, we also retrieve whether the branch is predicted as taken or not • If predicted taken, update the PC to be the predicted branch address, otherwise increment PC as usual • See figure 4.22, page 273

  9. Consequence of this approach • Notice that if the cache access is a hit and the prediction is accurate, we have a 0 clock cycle branch penalty! • What if either it is a miss or we are wrong about the prediction? • If found and the prediction is wrong (either predicts a branch whereas none is taken, or predicts no branch but turns out to be a branch) then we have a two clock cycle penalty • If cache miss, then we just use the normal branch mechanism in DLX which has a 2 cycle penalty • See figure 4.23, page 275

  10. Example • Consider a prediction accuracy of 90% and a hit rate of 90%, if 60% of all branches are actually taken, what is the average branch penalty using this scheme? • Branch penalty = hit rate * percent incorrect predictions * 2 cycles + (1 - hit rate) * taken branches * 2 cycles = (90% * 10% * 2) + (1 - 90%) * 60% * 2 = .3 cycles • Using delayed branches (as seen in chapter 3.5), we had an average branch penalty of .5 cycles, so we improve with this approach

  11. A Variation - Branch Unfolding • Notice that in this approach, we are fetching the new PC value (or more likely, an offset for the PC) from the buffer, but then, we need to update the PC and fetch that instruction • Instead, why not just fetch the next instruction? • Branch unfolding has the buffer storing the predicted instruction rather than the predicted PC of that instruction! • We can use this scheme for unconditional branches so that we have a 0 cycle penalty (this approach won’t work well for conditional branches)

  12. More ILP • Previous approaches all attempt to limit hazard stalls and lower the average CPI to the ideal CPI of 1 • Can we decrease CPI to under 1? How? Issue and execute more than 1 instruction at a time • Multiple-issue processors come in two kinds: • Superscalars - use static and/or dynamic scheduling mechanisms and multiple functional units to issue more than 1 instruction at a time • VLIW - very long instruction word - use instructions which are themselves multiple instructions, scheduled by a compiler - to be executed in parallel

  13. Superscalar for DLX • Hardware may issue from 1 to 8 instructions per clock cycle -- these instructions must be independent and satisfy other constraints • Avoid structural hazards - use different functional units, make up to 1 memory reference combined • Scheduling of instructions can be done statically by a compiler or dynamically by hardware (we will concentrate on compiler scheduling for now) • We will also concentrate on a 2 instruction superscalar for DLX where one instruction will be an integer op and the other a floating point op

  14. How it works • Fetch now gets two simultaneous instructions (consecutive) - 64 bits worth of instruction • First instruction will be an integer operation (including all loads, stores and branches) • Second instruction, if available, will be a floating point operation (arithmetic) • See fig 4.26, page 279 for pipelined instruction stream • To make this worthwhile (since every pair of instructions will include a floating point operation), we need to pipeline floating point operations or else the floating point unit will become a bottleneck, stalling both sides of the pipeline: integer and floating point

  15. Additional Hardware • Same hardware to detect data hazards and forward • Floating point and integer operations use their different registers sets • Enforcing structural hazards remains much the same (with a couple of new problems): • if the integer instruction is a load/store/move floating point value -- this could create contention for the floating point register file • May create a new RAW hazard between the two simultaneous instructions • We can solve these problems by having FP loads/stores be by themselves in the instruction stream

  16. Limitations of Superscalars • The main problem with the superscalar approach is that the superscalars can only take advantage of as much ILP as exists in the code • Consider the following loop: • Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 BNEZ R1, Loop • The LD and ADDD may be coupled together, but the ADDD requires the LD’s result, and so ADDD will stall, stalling the entire pipeline

  17. Combining Approaches • To truly take advantage of the superscalar, we will need to use other mechanisms in conjunction such as loop unrolling • For the previous example, loop unrolling will allow us to reschedule the code so that there are no RAW hazards that stall the superscalar, and less branch delays • We will unroll the loop 5 times, reschedule the code so that all LD’s appear first, ADDD’s overlapped, followed by SD’s and the loop mechanisms

  18. New loop • Loop: LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SUBI R1,R1,#40 SD 16(R1),F16 BNEZ R1, Loop SD 8(R1),F20 • Code has been improved by a factor of 2.5 more than just loop unrolling since the ADDDs are all done in parallel

  19. Dynamic Scheduling • One possible approach is to double the speed of the instruction fetch stage so that two instructions are fetched in the time it takes to work on one instruction, and let the EX stage issue two at a time • Or, use a decoupled architecture -- compiler scheduling will first insure that all consecutive instructions will not have dependences on each other, and then use a variation of Tomasulo’s approach • See pages 282-283 for details on both

  20. VLIW Approach • Each instruction is a group of instructions that can be executed in parallel -- determined by a compiler • Instruction fetch gets one VLIW (perhaps 112-168 bits in length) • The hardware will require multiple functional units • A VLIW might have two integer operations, two floating point operations, two memory references and a branch together • To fill out an entire VLIW instruction, other compiler techniques are needed like loop unrolling, scheduling across blocks and global techniques • See fig 4.29, p. 286 for VLIW version of previous ex.

  21. Limitations to Multiple-Issue • Inherent limitations in the ILP of a program • How many instructions are independent on each other? • How much distance is available between loading an operand and using it? • We can unroll loops to achieve some ILP, but how much? • Also, multicycle latency for certain types of operations will cause inconsistencies in the amount of issuing that can be simultaneous (e.g. load versus integer add)

  22. Limitations to Multiple-Issue • Difficulties in building the underlying hardware • Need multiple function units (although this in itself is not very expensive) and the cost grows linearly • Need an increase (possibly very large) in memory bandwidth and register-file bandwidth -- this takes up a significant area on the chip itself and may require larger bus sizes which turns into more pins • Complexity of multiple fetches means a more complex memory system, possibly with independent banks for parallel accesses • Hardware degradation due to the above defeats the point of multiple issue approaches

  23. Limitations to Multiple-Issue • Limitations specific to Superscalar or VLIW implementations • Hardware oriented scheduling requires mechanisms like the scoreboard or reservation stations • VLIW approach requires little or no additional hardware other than multiple functional units, but requires substantial compiler revisions -- and code compiled using the VLIW approach is not executable on machines without the multiple FUs • Any machine in between will also have limitations due to the difficulties in implementation

  24. Additional Compiler Support • We already discussed forms of loop unrolling and scheduling • There are other compiler techniques available that can also promote some amount of ILP: • Detecting and eliminating data dependencies • Software pipelining (symbolic loop unrolling) • Trace scheduling

  25. Finding Dependencies • There are problems identifying data dependencies in source code: • variables may be aliased • variables may be pointed to • variables may use indirect referencing through arrays • Detecting dependencies is done through matching symbolic names ad the above situations will often have the same variable (datum) being referenced with different names

  26. Loop carried dependences • Here, we examine dependencies across loops • These usually come in the form of a recurrence: • for (I=2;I<=100;I=I+1) Y[I]=Y[I-1]+Y[I]; • each iteration is loop dependent on the previous iteration • Another form of loop carried dependence is when the recurrence is over some distance (called the dependence distance) as in: • for (I=6;I<=100;I=I+1) Y[I]=Y[I-5]+Y[I]; • The dependence distance here is 5

  27. Detecting LC Dependences • Affine - array indices that follow the pattern location = a*i+b where a and b are constants and i is the loop index -- this can be extended to a multi-dimensional array • Almost all loop carried dependence algorithms rely on the arrays being accessed being affine • Given two array accesses: a*i+b and c*i+d, if d-b is not divisible by GCD(a, c) then there are no loop carried dependences

  28. Example • Does the following loop have loop carried dependences? • For (I=1;I<=100;I=I+1) X[2*I+3]=X[2*I] * 5.0; • a=2, b=3, c=2, d=0 • GCD(a,b)=2, d-b=-3 • Since -3/2 has a remainder, the two loops have no dependences • This is provable by considering that the first array access will always yield odd array indices and the second will always yield even array indices

  29. More on Detecting Dependences • Notice that the GCD test does not prove if there is a dependence if d-b is divisible by GCD(a,c), only that there is no dependence if it is not divisible • To prove a dependence exists in general is an NP-complete algorithm, however, many accurate and efficient tests exist for restricted situations and can be applied as a hierarchy of tests -- in this way, a compiler can detect lack of dependences in many situations and use this for rescheduling

  30. Another Example • What are the types of dependences in the following code? Also rewrite the loop to remove any output and antidependences • For (I=1;I<=100;I=I+1) { Y[I]=X[I]/c; /* S1 */ X[I]=X[I]+c; /* S2 */ Z[I]=Y[I]+c; /* S3 */ Y[I]=c-Y[I]; } /* S4 */ • True dependences from S1 to S3 and S1 to S4 (Y[I]) • Antidependences from S1 to S2 (X[I]) and from S3 to S4 (Y[I]) • Output dependence from S1 to S4 (Y[I])

  31. Example Continued • Since Y[I] is used in S3 and S4 as a source register and in S4 as a destination register, we can rename one of the two uses to T[I] • Since X[I] is used in S2 as a destination register, and in S1 and S2 as a source register, we can rename one of its two uses to X1[I] • For (I=1;I<=100;I=I+1) { T[I]=X[I]/c; Note that we will X1[I]=X[I]+c; have to replace X by Z[I]=T[I]+c; X1 at a later point Y[I]=c-T[I]; } if X is referenced after this loop

  32. Software Pipelining • Another compiler technique is to symbolically unroll a loop and form a new loop out of different iterations of the loop • In this way, the new loop interleaves execution of the loop • Consider a loop that has three parts: A, B, C, then the new loop contain iteration I+2 of A, I+1 of B and I of C for iteration I • This requires manipulating the loop maintenance mechanisms and to have pre and post loop instructions

  33. Example • Consider the following loop: • Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop • This loop will have stalls after the LD to load the operand into F0, after the ADDD to perform the floating point add, after SUBI and the branch penalty • If we execute the I+2 iteration of LD, then the ADDD is not dependent on it, nor is the SD dependent on the I+1 iteration of ADDD during iteration I

  34. Continued • We select the I+2 version of LD, the I+1 version of ADDD and the I version of SD • Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop • We precede the loop with the 1st and 2nd iterations of LD, the first iteration of ADDD, and follow the loop with the last iteration of ADDD and the last two iterations of SD • This new loop runs two fewer iterations and has no stalls for data hazards (only the branch penalty)

  35. More on Software Pipelining • In addition to this form of symbolic loop unrolling, we can perform ordinary loop unrolling as well • Between the two approaches, we reduce or eliminate data hazards and branch penalties • We reduce the loop overhead maintenance through loop unrolling • See the comparison in figure 4.31, p. 296

  36. Trace Scheduling • Another compiler technique is to select a path through the code based on branch predictions and moving code prior to branches • The compiler then generates a straight line of code without branches that can be executed with less control hazards and be scheduled for less data hazards • Since this relies on branch predictions, the straight line of code may actually be inaccurate and so the approach must include mechanisms for failed predictions

  37. More on Trace Scheduling • This is a form of global optimization since it schedules code across blocks, not within blocks • It must preserve data and control dependences which makes it tricky • Data dependences are overcome through unrolling and dependence analysis • Control dependences are partially reduced through unrolling • If the predictions are accurate, some amount of speedup will occur, if not, speedup will not be reduced (usually)

  38. Example • Consider the following If-then-else • A[I]=A[I]+B[I]; If (A[I]==0) {B[I]=X;} else {B[I]=Y}; C[I]=Z; • If we have a profile or knowledge that says that the condition (A[I]=0) is mostly true, then we can move the then clause and the following assignment statement to precede the branch • In this way, there is no branch penalty most of the time, and only if the condition is false will there be a branch penalty

  39. Hardware Support for Parallelism • The software techniques are limited in the parallelism that they can exploit and therefore the performance speedup is limited • Aside from the scheduling approaches in 4.2 (scoreboard, Tomasulo’s approach), we can have further hardware support to produce more parallelism. We will look at: • Conditional Instructions • Speculation

  40. Conditional Instructions • In a condition (such as an If-then) statement, there is an explicit branch • If it is known that the condition will mostly be true, we can assume it will be true, and execute the then clause automatically and skip the branch • If, once the then clause is being executed, we detect that the condition is false, we simply change the then statement into a no-op • This is difficult unless the then clause is a simple statement, such as a simple assignment

  41. Example • Consider the simple statement • If (A==0) {S=T}; • Use register R1 to store A, R2 for S and R3 for T, also assume a new instruction CMOVZ A, B, C, a conditional move that will move B to A if C is 0 • We change the code from: • BNEZ R1, L MOV R2, R3 L: • to become CMOVZ R2, R3, R1 • In this instruction, there is no branch penalty, and by the time R1 is evaluated, we can change the movement of R3 to R2 to a no-op

  42. More elaborate example • Consider a two-issue superscalar: • LW R1, 40(R2) ADDD R3, R4, R5 ADDD R6, R3, R7 BEQZ R10, L LW R8, 20(R10) LW R9, 0(R8) • This code will incur a data dependence stall if the branch is not taken • If we assume the branch is most often not taken, we can change the LW R8, 20(R10) into LWC R8, 20(R10), R10 and move it to before the BEQZ into the vacant instruction spot

  43. Cond. Instruction and Exceptions • One problem with the use of a conditional instruction is that it does not cause an exception -- since the instruction may never wind up resulting in anything anyways (if the condition is false) • In the previous example, imagine that the LWC R8, 20(R10), R10 causes an exception because R10 is 0 (if R10=0, then 20(0) = 20 which is a memory fetch from an area probably reserved for the OS!) • Therefore, the compiler must be careful in moving an instruction and creating a conditional instruction out of it!

  44. Cond. Instruction Limitations • Instructions that are annulled (turned into no-ops) still take execution time • Most useful when the condition can be evaluated early -- if the condition and branch cannot be separated, then the CI is less useful • Use is limited if control flow involves more than a simple alternative sequence • May cause a slow down compared to unconditional instructions requiring either a slower clock rate or greater number of cycles • Most computers only have conditional moves

  45. Speculation w/ Hardware Support • We can combine compiler speculation of branches with hardware scheduling techniques • Consider Tomasulo’s approach -- but enhanced with a fourth stage, the commit stage • Essentially the same hardware except that we add a new buffer called the reorder buffer • Its job is to store any speculative instructions (I.e. those executed after a speculated branch or condition) • As instructions complete execution, they are not committed (I.e. results are not written) but instead, placed into the reorder buffer until the speculation is known to be correct or incorrect

  46. Four Stages • Issue - get instruction from operation queue, issue if there is an empty reservation station AND an empty slot in the reorder buffer, update control entries. If no empty stations or buffer full, stall • Execute - if one or more operands not yet available, monitor CDB for them. Also check for RAW hazards. When both operands become available, execute operation. • Write result - when result available, write to CDB and from CDB into reorder buffer as well as other reservation stations waiting for result

  47. Commit - when an instruction, other than a branch with incorrect prediction, reaches the head of the reorder buffer with its result, commit that instruction by writing result to correct register (or memory if a store instruction) • If a branch with incorrect prediction appears, then the speculation for this branch was wrong, and all following instructions in the reorder buffer are invalid, flush the buffer and start next instruction from branch target location

  48. Example • Using the save example from Tomasulo (see page 256-258), we want to see the status tables when the MULTDis ready to commit • In this case (like the Tomasulo’s example) both LD’s have completed and written to registers (because they have reached the Commit stage) • However, unlike Tomasulo’s example, we see that the SUBD and ADDD have also completed AND written their result (they had not written their result in the other example) • But since the MULTD has not yet committed, the SUBD and ADDD, which follow it, cannot have comitted - they appear later in the reorder buffer

  49. Example 2 • A more interesting example is one with a loop • if we speculate that the branch will be taken, then the code in the loop can be repeated under this speculation • Only if the branch is not taken will we have to abort these instructions • This example is shown on page 314-315 • At the point shown in figure 4.36, the first two instructions of the first iteration have committed. The branch is assumed to have been taken and the second iteration is process along with part of the first • However, entry 5 (BNEZ) was speculated incorrectly, the branch is not taken, and therefore 6-10 are all flushed

  50. Adv. of Hardware-based Speculation • Hardware speculation already has operands stored in registers or reservation stations and so gets passed any problems of compiler speculation with aliases and pointers • Improves performance when branch prediction is not available at compile time • Maintains precise exceptions • Does not require compensation or bookkeeping code as compiled code might if the compiler incorrectly speculates on branches • Does not require compiled code tuned to the machine

More Related