Optimizing Instruction Fetch in UltraSPARC Processors for Better Performance

L1: nop cmp add add bg L1 sub nop nop 0 1 2 3 4 5 6 7 8 or L1: nop nop nop cmp add add bg L1 sub 0 1 2 3 4 5 6 7 8 #ifdef FAST int init (int i) {return (10*(i*1000));} #endif #ifdef SLOW int init (int i) {return (10*(i*1000 + 100));} #endif int i,n,j,k; main() { n = init(10000); j = 100000; k = 1342890; for (I=n; i > 0; i--){ j += (((j&0)+1)); k += (((k&0+1)); } } L1: cmp %g1, 0 add %g2,1,%g2 add %g5,1,%g5 bg,a,pt %icc, L1 sub %g1,1,%g1 Problems: Odd-Fetch “When the target of a branch is word 1 or word 3 of an I-cache line and the fourth instruction to be fetched is a branch, the branch prediction bits from the wrong pair of instructions is used.” [UltraSPARC-I User’s Manual] Slow Alignments: Performance: Fast Slow C-code: 1.2 seconds 3.6 - 4.2 secs Assembly code: 2 cycles/iteration 6 - 7 cycles/iteration

Problems: Delay Slot “…if the address of the instructions in an [I-cache] group cross a 32-byte boundary, an implicit branch is “forced” between instructions at address 31 and 32 (low order bits). That rule has a performance immpact only if a branch is in that specific group. Care should be taken not ot place a branch in a group that crosses this boundary….A group containing instructions I0 (branch), I1, I2, and I3 will be broken, because an artificial branch is forced after address 31 and there is already a branch in the group.” [UltraSPARC User’s Manual] ... A1: Label 1: A2: nop A3: br Label 3: A4: nop … B1: Label 2: B2: cmp %l0, 0 B3: bg Label 1: B4: add %l0,-1,%l0 … C1: Label 3: C2: nop C3: br Label 2: C4: nop ... For each group that crosses a 32 byte boundary (ie crosses a cache line boundary) there is an additional cycle of runtime added per iteration. Therefore: # crossing Time/secs Cycles/Iter boundary 0 1.8 3 1 2.4 4 2 3.0 5 3 3.6 6 Cause: Each of the groups shown above already has branch in it, and when the group crosses a 32 byte boundary an “implicit branch” is added (as stated in manual). But only 1 branch can be executed each cycle, so each I-cache group that crosses a 32 byte boundary, and contains a branch takes 2 cycles to execute instead of 1.

... A1: Label 1: A2: nop A3: br Label 3: A4: nop … B1: Label 2: B2: cmp %l0, 0 B3: bg Label 1: B4: add %l0,-1,%l0 … C1: Label 3: C2: nop C3: br Label 2: C4: nop ... For each group that crosses a 32 byte boundary (ie crosses a cache line boundary) there is an additional cycle of runtime added per iteration. Therefore: # crossing Time/secs Cycles/Iter boundary 0 1.8 3 1 2.4 4 2 3.0 5 3 3.6 6 Problems: Fetching Limitation Basic Problem: Only instructions from a SINGLE I-cache line can be fetched each cycle - For straight line code, this is not an issue - For code with many CTI this can cause a slowdown even if all of the branches are predicted correctly Cause: Every time a group of instructions crosses a 32 byte boundary, two I-cache accesses must be used to fetch the whole group. And in order for the processor to sustain a high rate of instruction dispatch, a high rate of instruction fetch must be occuring from the I-cache to the I-buffer. In the worst case here though, only 1-2 out instructions out of every group fetched is actually being executed, causing the total execution rate to be cut in half.

fmovs %f1, %f1 nop nop fmuls %f2, %f2,%f2 fmovs %f3, %f3 nop nop fmuls %f4, %f4,%f4 fmovs %f5, f5 cmp %l0, 0 bg .LL7 add %l0, -1, %l0 Problems: Grouping “UltraSPARC-I can execute up to four instructions per cycle. The first three instructions in a group occupy slots that are interchangeable with respect to resources…The fourth slot can only be used for PC-based branches or for floating-point instructions.” [UltraSPARC Users Manual] What the UltraSPARC Manual calls the “fourth slot” is actually the first seqential instruction in the group. So if the first sequential instruction in the group is not a floating point operation or a CTI, then all four instructions cannot be dispatched in one group. In example to the left, the first sequential instruction in each group is a fmovs instruction, so the loop can be executed in 3 instructions, and the running time for 100, 000,000 iterations on a UltraSPARC-Enterprise 166 is 1.8 secs. nop fmovs %f1, %f1 nop fmuls %f2, %f2,%f2 fmovs %f3, %f3 nop nop fmuls %f4, %f4,%f4 fmovs %f5, f5 cmp %l0, 0 bg .LL7 add %l0, -1, %l0 In the second example, there is no way to align the groups in this loop such that the first instruction in each group is either a branch or a floating point operation. So on average 3 out of every 7 cycles will be allowed to execute 4 instructions (instead of 3) because of a floating point operation as the first instruction. Thus the execution of this loop takes 3.5 instructions on average, and the running time

Optimizing Instruction Fetch in UltraSPARC Processors for Better Performance