Problems odd fetch
Download
1 / 4

Problems: Odd-Fetch - PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on

L1:. nop cmp add add bg L1 sub nop nop. 0. 1. 2. 3. 4. 5. 6. 7. 8. or. L1:. nop nop nop cmp add add bg L1 sub. 0. 1. 2. 3. 4. 5. 6. 7. 8. #ifdef FAST

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Problems: Odd-Fetch' - kirti


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Problems odd fetch

L1:

nop cmp add add bg L1 sub nop nop

0

1

2

3

4

5

6

7

8

or

L1:

nop nop nop cmp add add bg L1 sub

0

1

2

3

4

5

6

7

8

#ifdef FAST

int init (int i) {return (10*(i*1000));}

#endif

#ifdef SLOW

int init (int i) {return (10*(i*1000 + 100));}

#endif

int i,n,j,k;

main()

{

n = init(10000);

j = 100000;

k = 1342890;

for (I=n; i > 0; i--){

j += (((j&0)+1));

k += (((k&0+1));

}

}

L1:

cmp %g1, 0

add %g2,1,%g2

add %g5,1,%g5

bg,a,pt %icc, L1

sub %g1,1,%g1

Problems: Odd-Fetch

“When the target of a branch is word 1 or word 3 of an I-cache line and the fourth instruction to be fetched is a branch, the branch prediction bits from the wrong pair of instructions is used.” [UltraSPARC-I User’s Manual]

Slow Alignments:

Performance:

Fast Slow

C-code: 1.2 seconds 3.6 - 4.2 secs

Assembly code: 2 cycles/iteration 6 - 7 cycles/iteration


Problems delay slot
Problems: Delay Slot

“…if the address of the instructions in an [I-cache] group cross a 32-byte boundary, an implicit branch is “forced” between instructions at address 31 and 32 (low order bits). That rule has a performance immpact only if a branch is in that specific group. Care should be taken not ot place a branch in a group that crosses this boundary….A group containing instructions I0 (branch), I1, I2, and I3 will be broken, because an artificial branch is forced after address 31 and there is already a branch in the group.” [UltraSPARC User’s Manual]

...

A1: Label 1:

A2: nop

A3: br Label 3:

A4: nop

B1: Label 2:

B2: cmp %l0, 0

B3: bg Label 1:

B4: add %l0,-1,%l0

C1: Label 3:

C2: nop

C3: br Label 2:

C4: nop

...

For each group that crosses a 32 byte boundary (ie crosses a cache line boundary) there is an additional cycle of runtime added per iteration.

Therefore:

# crossing Time/secs Cycles/Iter

boundary

0 1.8 3

1 2.4 4

2 3.0 5

3 3.6 6

Cause:

Each of the groups shown above already has branch in it, and when the group crosses a 32 byte boundary an “implicit branch” is added (as stated in manual). But only 1 branch can be executed each cycle, so each I-cache group that crosses a 32 byte boundary, and contains a branch takes 2 cycles to execute instead of 1.


Problems fetching limitation

...

A1: Label 1:

A2: nop

A3: br Label 3:

A4: nop

B1: Label 2:

B2: cmp %l0, 0

B3: bg Label 1:

B4: add %l0,-1,%l0

C1: Label 3:

C2: nop

C3: br Label 2:

C4: nop

...

For each group that crosses a 32 byte boundary (ie crosses a cache line boundary) there is an additional cycle of runtime added per iteration.

Therefore:

# crossing Time/secs Cycles/Iter

boundary

0 1.8 3

1 2.4 4

2 3.0 5

3 3.6 6

Problems: Fetching Limitation

Basic Problem:

Only instructions from a SINGLE I-cache line can be fetched each cycle

- For straight line code, this is not an issue

- For code with many CTI this can cause a slowdown even if all of the branches are predicted correctly

Cause:

Every time a group of instructions crosses a 32 byte boundary, two I-cache accesses must be used to fetch the whole group. And in order for the processor to sustain a high rate of instruction dispatch, a high rate of instruction fetch must be occuring from the I-cache to the I-buffer. In the worst case here though, only 1-2 out instructions out of every group fetched is actually being executed, causing the total execution rate to be cut in half.


Problems grouping

fmovs %f1, %f1

nop

nop

fmuls %f2, %f2,%f2

fmovs %f3, %f3

nop

nop

fmuls %f4, %f4,%f4

fmovs %f5, f5

cmp %l0, 0

bg .LL7

add %l0, -1, %l0

Problems: Grouping

“UltraSPARC-I can execute up to four instructions per cycle. The first three instructions in a group occupy slots that are interchangeable with respect to resources…The fourth slot can only be used for PC-based branches or for floating-point instructions.” [UltraSPARC Users Manual]

What the UltraSPARC Manual calls the “fourth slot” is actually the first seqential instruction in the group. So if the first sequential instruction in the group is not a floating point operation or a CTI, then all four instructions cannot be dispatched in one group.

In example to the left, the first sequential instruction in each group is a fmovs instruction, so the loop can be executed in 3 instructions, and the running time for 100, 000,000 iterations on a UltraSPARC-Enterprise 166 is 1.8 secs.

nop

fmovs %f1, %f1

nop

fmuls %f2, %f2,%f2

fmovs %f3, %f3

nop

nop

fmuls %f4, %f4,%f4

fmovs %f5, f5

cmp %l0, 0

bg .LL7

add %l0, -1, %l0

In the second example, there is no way to align the groups in this loop such that the first instruction in each group is either a branch or a floating point operation. So on average 3 out of every 7 cycles will be allowed to execute 4 instructions (instead of 3) because of a floating point operation as the first instruction. Thus the execution of this loop takes 3.5 instructions on average, and the running time


ad