de optimizations attack n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
De-optimizations ATTACK!!! PowerPoint Presentation
Download Presentation
De-optimizations ATTACK!!!

Loading in 2 Seconds...

play fullscreen
1 / 100

De-optimizations ATTACK!!! - PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on

Finding the Limits of Hardware Optimization through Software De-optimization. De-optimizations ATTACK!!!. Derek Kern, Roqyah Alalqam, Ahmed Mehzer , Mohammed Mohammed. Presented By: . Outline. Flashback Project Structure Judging de-optimizations What does a de-op look like?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'De-optimizations ATTACK!!!' - aloha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
de optimizations attack

Finding the Limits of Hardware Optimization through Software De-optimization

De-optimizationsATTACK!!!

Derek Kern, Roqyah Alalqam,

Ahmed Mehzer, Mohammed Mohammed

Presented By:

outline
Outline
  • Flashback
    • Project Structure
    • Judging de-optimizations
    • What does a de-op look like?
    • General Areas of Focus
      • Instruction Fetching and Decoding
      • Instruction Scheduling
      • Instruction Type Usage (e.g. Integer vs. FP)
      • Branch Prediction
    • Idiosyncrasies
outline1
Outline
  • Our Methods
    • Measuring clock cycles
    • Eliminating noise
  • Something about the de-ops that didn’t work
  • Lots and lots of de-ops
flashback
Flashback

During the research project

  • We studied de-optimizations
  • We studied the Opteron

For the implementation project

  • We have chosen de-optimizations to implement
  • We have chosen algorithms that may best reflect our de-optimizations
  • We have implemented the de-optimizations
  • …And, we’re here to report the results
flashback1
Flashback

Judging de-optimizations (de-ops)

  • Whether the de-op affects scheduling, caching, branching, etc, its impact will be felt in the clocks needed to execute an algorithm.
  • So, our metric of choice will be CPU clock cycles

What does a de-op look like?

  • A de-op is a change to an optimal implementation of an algorithm that increases the clock cycles needed to execute the algorithm and that demonstrates some interesting fact about the CPU in question
our methods
Our Methods
  • The CPUs
    • AMD Opteron (Hydra)
    • Intel Nehalem (Derek’s Laptop)
  • Our primary focus was the Opteron
  • The de-optimizations were designed to affect the Opteron
  • We also tested them on the Intel in order to give you an idea of how universal a de-optimization is
  • When we know why something does or doesn’t affect the Intel, we will try to let you know
our methods1
Our Methods
  • The code
    • Most of the de-optimizations are written in C (GCC)
    • Some of them have a wrapper that is written in C, while the code being de-optimized is written in NASM (assembly)
    • E.g.
        • Mod_ten_counter
        • Factorial_over_array
    • Typically, if a de-op is written in NASM, then the C wrapper does all of the grunt work prior to calling the de-optimized NASM module
our methods2
Our Methods
  • Problem: How do we measure clock cycles?
  • An obvious answer
    • CodeAnalyst
    • Actually, we were getting strange results from CodeAnalyst
    • …And, it is hard to separate important code sections from unimportant code sections
    • …And, it is cumbersome to work with
our methods3
Our Methods
  • A better answer
    • Embed code that measures clock cycles for important sections
    • Ok….but how?

Answer: Read the CPU Timestamp Counter

#if defined(__i386__)

static __inline__ unsigned long longrdtsc(void)

{

unsigned long longint x;

__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));

return x;

}

#elif defined(__x86_64__)

static __inline__ unsigned long longrdtsc(void)

{

unsigned hi, lo;

__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));

return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );

}

#endif

our methods4
Our Methods
  • CPU Timestamp Counter
    • In all x86 CPUs since the Pentium
    • Counts the number of clock cycles since the last reset
    • It’s a little tricky in multi-core environments
    • Care must be taken to control the cores that do the relevant processing
our methods5
Our Methods
  • CPU Timestamp Counter

Windows:

Runs the executable on core 3 (of 1 – 4)

start /realtime /affinity 4 /b <exe name> <arguments>

Linux (Hydra):

Runs the executable on node 11, CPU 3 (of 0 – 11)

bpsh 11 taskset 0x000000008 <exe name> <arguments>

So, by restricting our runs to specific CPUs, we can rely on the CPU timestamp values

our methods6
Our Methods
  • CPU Timestamp Counter
    • Wrapping code so that clock cycles can be counted

//

// Send the array off to be counted by the assembly code

//

unsigned long long start = rdtsc();

#ifdef _WIN64

mod_ten_counter( counts, mod_ten_array, size_of_array );

#elif _WIN32

mod_ten_counter( counts, mod_ten_array, size_of_array );

#elif __linux__

_mod_ten_counter( counts, mod_ten_array, size_of_array );

#endif

printf( "Cycles=%d\n", ( rdtsc() - start ) );

The important section is wrapped and the number of clock cycles will be the difference between the start and the finish

our methods7
Our Methods
  • Eliminating noisy results
    • Even with our precautions, there can be some noise in the clock cycles
    • So, we need lots of iterations that we can use to generate a good average
    • But, this can be very, very time consuming
    • How, oh how?

Answer: The Version Tester

our methods8
Our Methods
  • Eliminating noisy results – The Version Tester
    • Used to iteratively test executables
    • Expects each executable to return the number of cycles that need to be counted
    • Remember this?

//

// Send the array off to be counted by the assembly code

//

unsigned long long start = rdtsc();

#ifdef _WIN64

mod_ten_counter( counts, mod_ten_array, size_of_array );

#elif _WIN32

mod_ten_counter( counts, mod_ten_array, size_of_array );

#elif __linux__

_mod_ten_counter( counts, mod_ten_array, size_of_array );

#endif

printf( "Cycles=%d\n", ( rdtsc() - start ) );

our methods9
Our Methods
  • Eliminating noisy results – The Version Tester
    • Runs executables for specified number of iterations and then averages the number of cycles

Example run on Hydra:

> bpsh10 taskset 0x000000004 version_testermtc.hydra-core3.config

Running Optimized for 1000 for 200 iterations

Done running Optimized for 1000 with an average of 19058 cycles

Running De-optimized #1 for 1000 for 200 iterations

Done running De-optimized #1 for 1000 with an average of 21039 cycles

Running Optimized for 10000 for 200 iterations

Done running Optimized for 10000 with an average of 187296 cycles

Running De-optimized #1 for 10000 for 200 iterations Done running De-optimized #1 for 10000 with an average of 206060 cycles

Runs version_tester.exe on CPU 2 and mod_ten_counter.exe on CPU 3

our methods10
Our Methods
  • Eliminating noisy results – The Version Tester
    • Running

Command Format

version_tester <tester_configuration>

Configuration File (for Hydra)

ITERATIONS=200

__EXECUTABLES__

Optimized for 1000=taskset 0x000000008 ./mod_ten_counter_op1000

De-optimized #1 for 1000=taskset 0x000000008 ./mod_ten_counter_deop 1000

Optimized for 10000=taskset 0x000000008 ./mod_ten_counter_op 10000

De-optimized #1 for 10000=taskset 0x000000008 ./mod_ten_counter_deop 10000

Optimized for 100000=taskset 0x000000008 ./mod_ten_counter_op 100000

De-optimized #1 for 100000=taskset 0x000000008 ./mod_ten_counter_deop 100000

Optimized for 1000000=taskset 0x000000008 ./mod_ten_counter_op 1000000

De-optimized #1 for 1000000=taskset 0x000000008 ./mod_ten_counter_deop 1000000

our methods11
Our Methods
  • Eliminating noisy results – The Version Tester
    • Running

Configuration File (for Windows):

ITERATIONS=200

__EXECUTABLES__

Optimized for 10=.\mod_ten_counter\mod_ten_counter_op10

De-optimized #1 for 10=.\mod_ten_counter\mod_ten_counter_deop10

Optimized for 100=.\mod_ten_counter\mod_ten_counter_op100

De-optimized #1 for 100=.\mod_ten_counter\mod_ten_counter_deop100

Optimized for 1000=.\mod_ten_counter\mod_ten_counter_op1000

De-optimized #1 for 1000=.\mod_ten_counter\mod_ten_counter_deop1000

Optimized for 10000=.\mod_ten_counter\mod_ten_counter_op10000

De-optimized #1 for 10000=.\mod_ten_counter\mod_ten_counter_deop10000

Optimized for 100000=.\mod_ten_counter\mod_ten_counter_op100000

De-optimized #1 for 100000=.\mod_ten_counter\mod_ten_counter_deop100000

Optimized for 1000000=.\mod_ten_counter\mod_ten_counter_op1000000

De-optimized #1 for 1000000=.\mod_ten_counter\mod_ten_counter_deop1000000

Optimized for 10000000=.\mod_ten_counter\mod_ten_counter_op10000000

De-optimized #1 for 10000000=.\mod_ten_counter\mod_ten_counter_deop10000000

our methods12
Our Methods
  • Eliminating noisy results – The Version Tester
    • Therefore, using the Version Tester, we can iterate hundreds or thousands of times in order to obtain a solid average number cycles
  • So, we believe our results fairly represent the CPUs in question
in what follows
In what follows…
  • You are going to see the various de-optimizations that we implemented and the corresponding results
  • These de-optimizations were tested using the Version Tester and were executed while restricting the execution to a single core (CPU)
but first
But first…
  • …something about the de-optimizations that were less than successful
  • Branch Patterns
    • Remember: We wanted to challenge the CPU with branching patterns that could force misses
    • This turned out to be very difficult to do
    • Random data caused a significant slowdown. But random data will break any branch prediction mechanism
    • The branch prediction mechanism on the Opteron is very very good
but first1
But first…
  • Unpredictable Instructions - Recursion
    • Remember: Writing recursive functions that call other functions near their return
    • This was supposed to overload the return address buffer and cause mispredictions
    • It turned out to be very difficult to implement
    • We never really showed any performance degradation
    • So, don’t worry about this one
so without further adieu
So, without further adieu...

The results of De-optimization

dependency chain
Dependency Chain

De-Optimization Results

Area: Instruction Scheduling

dependency chain flashback
Dependency Chain: Flashback
  • Description
    • As we have seen in this class data dependency would have an impact on the ILP.
    • Dynamic scheduling as we saw can eliminate the WAW & WAR dependency
    • However, to a point the Dynamic scheduling could be overwhelmed which could affect the performance as we will see next
  • The Opteron
    • Opetron ,like all the other architectures, would be highly affected by the data hazard
    • The reason of this de-optimization is to show the impact of the data chain dependency on the performance
dependency chain1
Dependency Chain
  • dependency_chain.exe
    • We implemented two versions of a program called ‘dependency_chain’
    • The program takes an array size as argument
    •  It then generates an array of the specified size in which each element is populated with integers x where 0 <= x <= 20
    • The array’s element are being summed and the output would be the number of cycles that been taken by the program
dependency chain2
Dependency Chain
  • dependency_chain.exe
    • In the optimized version adds the elements of the array by striding through the array in four element chunks and adding elements to four different temporary variables
    • Then the four temporary variables are added
    • The advantage is allowing four large dependency chain instead of one massive one
    • However for the de-optimized version, each of the element of the array are sums into one variable
    • This create a massive dependency chain which will quickly exhausts the scheduling resources of the dynamic scheduler
dependency chain3
Dependency Chain
  • Dependency_chain.exe

Source

Optimized

for ( i = 0; i < size_of_array; i+=4 ) {

sum1 += test_array[i];

sum2 += test_array[i + 1];

sum3 += test_array[i + 2];

sum4 += test_array[i + 3];

}

sum = sum1 + sum2 + sum3 + sum4;

De-Optimized

for ( i = 0; i < size_of_array; i++ )

{

sum += test_array[i];

}

dependency chain results1
Dependency Chain: Results
  • dependency_chain.exe
    • chart below shows that not breaking up a dependency chain can be extraordinarily costly.  On the Opteron, it caused ~150% for all array sizes.
    • The scheduling resources of the Opteron become overwhelmed essentially causing the program to run sequentially, i.e. with no ILP
    • Nehalem was impacted by this de-optimization too. Given the lesser impact, one can only imagine that it has more scheduling resources

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

dependency chain upshot
Dependency Chain: Upshot
  • Lessons
    • The code for the de-optimization is so natural that it is a little scary. It is elegant and parsimonious
    • However, this elegance and parsimony may come at a very high cost
    • If you don’t get the performance that you expect from a program, then it is definitely worth looking for these types of dependency chains
    • Break these chains up to give dynamic schedulers more scheduling options
high instructions latency
High Instructions Latency

De-Optimization Results

Area: Instruction Fetching and Decoding

high instructions latency flashback
High Instructions Latency: Flashback
  • Description
    • CPUs often have instructions that can perform almost the same operation
    • Yet, in spite of their seeming similarity, they have very different latencies.  By choosing the high-latency version when the low latency version would suffice, code can be de-optimized
  • The Opetron
    • The LOOP instruction on the Opteron has a latency of 8 cycles, while a test (like DEC) and jump (like JNZ) has a latency of less than 4 cycles
    •  Therefore, substituting LOOP instructions for DEC/JNZ combinations will be a de-optimization
high instructions latency1
High Instructions Latency
  • fib.exe
    • We implemented a program called ‘fib’
    • It takes an array size as an argument
    • Fibonacci number is calculated for each element in the array
high instructions latency2
High Instructions Latency
  • fib.exe
    • Fibonacci number is calculated in assembly code
    • In the optimized version used dec & jnz instructions which take up to 4 cycles
    • In the de-optimized version used loop instruction which takes 8 cycles
high instructions latency3
High Instructions Latency
  • fib.exe

Source

Optimized

De-Optimized

calculate:

movedx, eax

        add ebx, edx

moveax, ebx

movdword [edi], ebx

        add edi, 4  

dececx

jnz calculate

calculate:

movedx, eax

        add ebx, edx

moveax, ebx

movdword [edi], ebx

        add edi, 4

loop calculate

high instructions latency4
High Instructions Latency
  • fib.exe

Compiled

08048481 <calculate>:

8048481: 89 c2 mov %eax,%edx

8048483: 01 d3 add %edx,%ebx

8048485: 89 d8 mov %ebx,%eax

8048487: 89 1f mov %ebx,(%edi)

8048489: 81 c7 04 00 00 00 add $0x4,%edi

804848f: 49 dec %ecx

8048490: 75 efjne 8048481 <calculate>

Optimized

08048481 <calculate>:

8048481: 89 c2 mov %eax,%edx

8048483: 01 d3 add %edx,%ebx

8048485: 89 d8 mov %ebx,%eax

8048487: 89 1f mov %ebx,(%edi)

8048489: 81 c7 04 00 00 00 add $0x4,%edi

804848f: e2 f0 loop 8048481 <calculate>

De-Optimized

high instructions latency results1
High Instructions Latency: Results
  • fib.exe
    • In the chart below we can see that optimized version significantly outperforms the de-optimized version. The results on Nehalem are more impressive

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

high instructions latency upshot
High Instructions Latency: Upshot
  • Lessons
    • As we have seen different instructions can affect your program when you don’t choose them carefully.
    • It’s important to know which instruction takes more cycles to avoid using them as possible.
costly instructions
Costly Instructions

De-Optimization Results

Area: Instruction Type Usage

costly instructions flashback
Costly Instructions: Flashback
  • Description
    • Some instructions can do the same job but with more cost in term of number of cycles
  • The Opetron
    • Integer division for Opetron costs 22-47 cycles for signed, and 17-41 unsigned
    • While it takes only 3-8 cycles for both signed and unsigned multiplication
costly instructions1
Costly Instructions
  • mult_vs_div_deop_1.exe & mult_vs_div_op.exe
    • We implemented two programs , optimized and de-optimized versions
    • They take an array size as an argument, the array initialized randomly with powers of 2 (less than or equal to 2^12)
    • The de-optimized version divides each element by 2.0. The optimized version multiplies each element by 0.5.
    • The versions are functionality equivalent
costly instructions2
Costly Instructions
  • mult_vs_div_deop_1.exe & mult_vs_div_op.exe

Optimized

for ( i = 0; i < size_of_array; i++ )

{

test_array[i] = test_array[i] * 0.5;

}

De-optimized

for ( i = 0; i < size_of_array; i++ )

{

test_array[i] = test_array[i] / 2.0;

}

costly instructions results1
Costly Instructions: Results
  • mult_vs_div_deop_1.exe & mult_vs_div_op.exe
    • By looking to the chart below you can see this de-optimization has a huge impact on the Opetron of average 23% . It still has an affect on the Nehalem even it is not as big as the Opetron

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

costly instructions upshot
Costly Instructions: Upshot
  • Lessons
    • Small changes in your code could have a real impact on the performance
    • It so important to know the difference between instruction in term of cost
    • Seek discount instruction when it is possible
costly instructions3
Costly Instructions

De-Optimization Results

Area: Instruction Type Usage

costly instructions flashback1
Costly Instructions: Flashback
  • Description
    • Some instructions can do the same job but with more cost in term of number of cycles

Example:

float f1, f2

if (f1<f2)

This is a common usage for programmer which could be considered a de-optimization technique

  • The Opteron
    • Branches based on floating-point comparisons are often slow
costly instructions4
Costly Instructions
  • Compare_two_floats.exe
    • We implemented a program called ‘Compare_two_floats’
    • It takes a number of iteration as an argument
    • Comparisons between 2 floating numbers will be done in this program.
costly instructions5
Costly Instructions
  • Compare_two_floats_deop.exe & Compare_two_floats_op.exe
    • In the de-optimized version we compare two floats by using the old common way as we will see in the next slide
    • However for the optimized version, we used change the float to integer and take it as a condition instead
    • The condition was specified on purpose to be not taken all the time
costly instructions6
Costly Instructions
  • Compare_two_floats_deop.exe & Compare_two_floats_op.exe

Optimized

De-Optimized

for (i = 0; i <numberof_iteration ; i++)

{

if (f1<=f2)

{

Count_numbers(i);

count++;

}

else

count++;

}

for (j = 0; j <numberof_iteration ; j++)

{

if (FLOAT2INTCAST(t)<=0)

{

Count_numbers(i);

count++;

}

else

count++;

}

costly instructions results2
Costly Instructions: Results
  • Compare_two_floats.exe
costly instructions results3
Costly Instructions: Results
  • Compare_two_floats.exe
    • The chart below shows a small impact on the Opetron, however the results were surprising for the Nehalem even it was designated basically for the Opetron.

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

costly instructions upshot1
Costly Instructions: Upshot
  • Lessons
    • Usually floats comparison are more expensive compare to integer in term of number of cycles
    • Even though the Opetron passed this test, that does not mean your computer will do the same!!
    • The float comparisons still have a big impact on the Nehalem
    • Again, great care must be taken when the program deals with more floats comparison.
loop re rolling
Loop Re-rolling

De-Optimization Results

Area: Instruction Scheduling

loop re rolling flashback
Loop Re-rolling: Flashback
  • Description
      • Loops not only affect branch prediction. They can also affect dynamic scheduling. How ?
      • Having two instructions 1 and 2 be within loops A and B, respectively. 1 and 2 could be part of a unified loop. If they were, then they could be scheduled together. Yet, they are separate and cannot be
  • The Opteron
      • Given that the Opteron is 3-way scalar, this de-optimization could significantly reduce IPC
      • This would be two consecutive loops each containing one or more instructions such that the loops could be combined
loop re rolling1
Loop Re-rolling
  • Loop_re_rolling_deop.exe & loop_re_rolling_op.exe
    • We implemented two programs : optimized and de-optimized versions
    • They take an array size as an argument, and initialize it randomly
    • Cubic and quadratic are calculated for each element in the array
    • In the de-optimized version the cubic and quadratic calculation were in two consecutive loops. They are combined with the same loop for the optimized version.
    • Bothe versions are functionality equivalent
loop re rolling2
Loop Re-rolling
  • Loop_re_rolling_deop.exe & loop_re_rolling_op.exe
    • We want to show whether the removing some of the flexibility available to the dynamic scheduler would affect the number of cycles or not.
    • It is not expected for the de-optimization instructions to be scheduled at the same time. The de-optimization should prevent this
loop re rolling3
Loop Re-rolling
  • Loop_re_rolling_deop.exe & loop_re_rolling_op.exe

Optimized

De-optimized

  • for (i=0;i<size_of_array;i++)
  • {
    • quadratic_array[i]=load_store_array[i]*load_store_array[i];
    • cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];
  • }
  • for (i=0;i<size_of_array;i++)
  • {
    • quadratic_array[i]=load_store_array[i]*load_store_array[i];
  • }
  • for (i=0;i<size_of_array;i++)
  • {
    • cubic_array[i]=load_store_array[i]*load_store_array[i]*load_store_array[i];
  • }
loop re rolling results1
Loop Re-rolling: Results
  • Loop_re_rolling_deop.exe & loop_re_rolling_op.exe
    • The slow percentage is defiantly large for Opetron. It is almost 50% in average.
    • It is large for the Nehalem as well.
    • These results shows the difference between using dynamic scheduling or not

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

loop re rolling upshot
Loop Re-rolling: Upshot
  • Lessons
    • Dynamic scheduling is absolutely important
    • Loops should be used carefully, and enroll them when it is possible
    • Instructions that do not depend on each other (No true dependency), would guarantee a better dynamic scheduling that would enhance the performance especially when they are repeated frequently
store to load dependency
Store-to-load dependency

De-Optimization Results

Area: Instruction Type Usage

store to load dependency flashback
Store-to-load dependency: Flashback
  • Description
    • Store-to-load dependency takes place when stored data needs to be used shortly
    • This type of dependency increases the pressure on the load and store unit and might cause the CPU to stall especially when this type of dependency occurs frequently
      • In many instructions , when we load the data which is stored shortly
  • The Opteron
store to load dependency1
Store-to-load dependency
  • dependecy_deop.exe & dependency_op.exe
    • We implemented two versions of dependency program, one for optimization and the other for de-optimization
    • They take an array size as an argument, the array initialized randomly
    • Both versions perform a prefix sum over the array. Thus, the final array element will contain a sum with it and all previous elements within the array
store to load dependency2
Store-to-load dependency
  • dependecy_deop.exe & dependency_op.exe
    • In the de-optimization we are storing and loading shortly the same elements of the array
    • However, we used a temporary variables to avoid this type of dependency in the optimized version
    • The optimization code has more instructions, and it is obvious that adding more instructions would have impact on the number of cycles compare to the version that has fewer instructions
store to load dependency3
Store-to-load dependency
  • dependecy_deop.exe & dependency_op.exe

Optimized

for ( i = 3; i < size_of_array; i += 3 )

{

temp2 = test_array[i - 2] + temp_prev;

temp1 = test_array[i - 1] + temp2;

test_array[i - 2] = temp2;

test_array[i - 1] = temp1;

test_array[i] = temp_prev = test_array[i] + temp1;

}

De-optimized

for ( i = 1; i < size_of_array; i++ )

{

test_array[i] = test_array[i] + test_array[i - 1];

}

store to load dependency results1
Store-to-load dependency: Results
  • dependecy_deop.exe & dependency_op.exe
    • The chart below we can see it this de-optimization technique has an average 60% of slowing down for the Opetron, which is a huge difference. The Nehalem as well has been affected by this code

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

store to load dependency upshot
Store-to-load dependency: Upshot
  • Lessons
    • Load-store dependency is something you should be aware of
    • Writing more instructions does not always mean your program will run slower
    • Avoid this common usage of load store dependency to have a good impact on your program
costly behavior
Costly Behavior

De-Optimization Results

Area: Instruction type usage

costly common usage flashback
Costly common usage: Flashback
  • Description
    • Conditional statement is an active player in almost everyone’s code. Would you believe if someone tells you a minor change could have a real effect on your program?
    • The sequence of the statements that are needed to be checked is SO IMPORTANT.
    • Most of the architectures use the same sequence to check these condition regardless the programming language you use & the platform.
costly instructions7
Costly Instructions
  • IF_Condition.exe
    • We implemented two versions of a program called ‘IF_Condition’
    • The takes a number of iterations as an argument and initialize an array randomly with floats between 0.5 and 11.0
    • For each element in the array, we add one to a dummy variable if its index is equal to 0 (mod 2) and its value is greater than 1.5.
costly instructions8
Costly Instructions
  • IF_deop.exe & IF_op.exe
    • The if statement will hold true if both conditions are true
    • In the de-optimized version we put the condition that has more chance to be false as second condition ,while we put it a first one in the optimized version
costly instructions9
Costly Instructions
  • IF_deop.exe & IF_op.exe

Optimized

for ( i = 0; i < size_of_array; i++ )

{

mod = ( i % 2 );

if ( mod == 0 && test_array[i] > 1.5 )

dummy++;

else

dummy--;

}

De-Optimized

for ( i = 0; i < size_of_array; i++ )

{

mod = ( i % 2 );

if ( test_array[i] > 1.5 && mod == 0 )

dummy++;

else

dummy--;

}

costly instructions results5
Costly Instructions: Results
  • IF_Condition.exe
    • The chart below shows optimized version outperforms the de-optimized version on both the Opetron and Nehalem.

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

costly instructions upshot2
Costly Instructions: Upshot
  • Lessons
    • Conditional statements can have a negative impacts if they are ignored
    • One case was implemented (&&), and other cases would be equivalent in term of increasing the number of cycles
    • If it is possible to specify which conditions will be more true or false then putting that condition in the right position will save some cycles
branch density
Branch Density

De-Optimization Results

Area: Branch Prediction

branch density flashback
Branch Density: Flashback
  • Description
    • This de-optimization attempts to overwhelm the CPUs ability to predict a branch code by packing branches as tightly as possible
    • Whether or not a bubble is created is dependent upon the hardware
    • However, at some point, the hardware can only predict so much and pre-load so much code
  • The Opteron
    • The Opteron's BTB (Branch Target Buffer) can only maintain 3 (used) branch entries per (aligned) 16 bytes of code [AMD05]
    • Thus, the Opteron cannot successfully maintain predictions for all of the branches within following sequence of instructions
branch density flashback1
Branch Density: Flashback

  401399: 8b 44 24 10           mov    0x10(%esp),%eax

  40139d: 48                    dec    %eax

  40139e: 74 7a                 je     40141a <_mod_ten_counter+0x8a>

  4013a0: 8b 0f                 mov    (%edi),%ecx

  4013a2: 74 1b                 je     4013bf <_mod_ten_counter+0x2f>

  4013a4: 49                    dec    %ecx

  4013a5: 74 1f                 je     4013c6 <_mod_ten_counter+0x36>

  4013a7: 49                    dec    %ecx

  4013a8: 74 25                 je     4013cf <_mod_ten_counter+0x3f>

  4013aa: 49                    dec    %ecx

  4013ab: 74 2b                 je     4013d8 <_mod_ten_counter+0x48>

  4013ad: 49                    dec    %ecx

  4013ae: 74 31                 je     4013e1 <_mod_ten_counter+0x51>

  4013b0: 49                    dec    %ecx

  4013b1: 74 37                 je     4013ea <_mod_ten_counter+0x5a>

  4013b3: 49                    dec    %ecx

  4013b4: 74 3d                 je     4013f3 <_mod_ten_counter+0x63>

  4013b6: 49                    dec    %ecx

  4013b7: 74 43                 je     4013fc <_mod_ten_counter+0x6c>

  4013b9: 49                    dec    %ecx

  4013ba: 74 49                 je     401405 <_mod_ten_counter+0x75>

  4013bc: 49                    dec    %ecx

  4013bd: 74 4f                 je     40140e <_mod_ten_counter+0x7e>

branch density1
Branch Density
  • mod_ten_counter.exe
    • We implemented a program called ‘mod_ten_counter’
    • It takes an array size as an argument
    • The array is populated with a repeating pattern of consecutive integers from zero to nine
        • Like: 012345678901234567890123456789…
        • In other words, the contents are not random
    • Very simply, it counts the number of times that each integer (0 – 9) appears within the array
branch density2
Branch Density
  • mod_ten_counter.exe
    • The optimized version maintained proper spacing between branch instructions
    • The de-optimized version (seen on the previous slide) has densely packed branches
    • Notes:
      • The spacing for the optimized version is achieved with NOP instructions
      • It has one extra NOP per branch so it has roughly 5 more instructions per iteration than the de-optimized version
      • Thus, if the optimized version outperforms the de-optimized version, then the difference will be even more impressive
unpredictable instructions
Unpredictable Instructions
  • mod_ten_counter.exe

Source

Optimized

De-Optimized

cmpecx, 0

je mark_0 ; We have a 0

nop

dececx

je mark_1 ; We have a 1

nop

dececx

je mark_2 ; We have a 2

nop

dececx

je mark_3

.

.

.

cmpecx, 0

je mark_0 ; We have a 0

dececx

je mark_1 ; We have a 1

dececx

je mark_2 ; We have a 2

dececx

je mark_3

.

.

.

branch density results
Branch Density: Results
  • mod_ten_counter.exe
branch density results1
Branch Density: Results
  • mod_ten_counter.exe
    • As you can see from the chart below, in spite of its handicap, the optimized version significantly outperforms the de-optimized version
    • Interestingly, this de-optimization is more impressive on the Intel, even though it was designed with the Opteron in mind

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

branch density results2
Branch Density: Results
  • So, what’s up with the Nehalem?
    • The Nehalem performs well generally, but is very susceptible to this de-optimization. Why?
    • There isn’t great information on this facet of the Nehalem
    • But…
      • The Nehalem can handle 4 active branches per 16 bytes
      • The misprediction penalty is ~17 cycles so the Nehalem has a long pipeline
      • Therefore, missing the BTB is probably very costly as well
branch density upshot
Branch Density: Upshot
  • Lessons
    • Branch density can adversely affect performance and make otherwise predictable branches unpredictable
    • Great care must be taken when designing branches, if-then-else structures and case-switch structures
unpredictable instructions1
Unpredictable Instructions

De-Optimization Results

Area: Branch Prediction

unpredictable instructions flashback
Unpredictable Instructions: Flashback
  • Description
    • Some CPUs restricts only one branch instruction be within a certain number bytes
    • If this exceeded or if branch instructions are not aligned properly, then branches cannot be predicted
  • The Opteron
    • The return (RET) instruction may only take up one byte
    • If a branch instruction immediately precedes a one byte RET instruction, then RET cannot be predicted
    • One byte RET instruction can cause a miss prediction even if we have one branch instruction per 16 bytes
    • Alignment: 9 branch indicators associated with byte addresses of 0,1,3,5,7,9,11, 13, & 15 within each 16 byte segment
unpredictable instructions2
Unpredictable Instructions
  • factorial_over_array.exe
    • We implemented a program called ‘factorial_over_array’
    • It takes an array size as an argument
    • The array is populated with random integers between 1 and 12

e.g. { 3, 7, 4, 10, 9, 1, 5, 2, 12 }

    • Factorial is calculated for each element in the array
unpredictable instructions3
Unpredictable Instructions
  • factorial_over_array.exe
    • Factorial is calculated in assembly code
    • In the optimized version, the RET instruction is aligned using a NOP so that it is not immediately next another branch and so that it falls on an odd number within the 16 byte segment
    • In the de-optimized version, the RET instruction is aligned immediately next to a branch instruction and so that it falls on an even number within the 16 byte segment
unpredictable instructions4
Unpredictable Instructions
  • factorial_over_array.exe

Source

Optimized

De-Optimized

global _factorial

section .text

_factorial:

nop

moveax, [esp+4]

cmpeax, 1

jne calculate

nop

ret

calculate:

deceax

push eax

call _factorial

add esp, 4

imuleax, [esp+4]

ret

global _factorial

section .text

_factorial:

nop

moveax, [esp+4]

cmpeax, 1

jne calculate

ret

calculate:

deceax

push eax

call _factorial

add esp, 4

imuleax, [esp+4]

ret

unpredictable instructions5
Unpredictable Instructions
  • factorial_over_array.exe

Compiled

0: 90 nop

1: 8b 44 24 04 mov 0x4(%esp),%eax

5: 83 f8 01 cmp $0x1,%eax

8: 75 02 jne c <_factorial+0xc>

a: 90 nop

b: c3 ret

c: 48 dec %eax

d: 50 push %eax

e: e8 edffffff call 0 <_factorial>

13: 83 c4 04 add $0x4,%esp

16: 0f af 44 24 04 imul 0x4(%esp),%eax

1b: c3 ret

Optimized

0: 90 nop

1: 8b 44 24 04 mov 0x4(%esp),%eax

5: 83 f8 01 cmp $0x1,%eax

8: 75 01 jne b <_factorial+0xb>

a: c3 ret

b: 48 dec %eax

c: 50 push %eax

d: e8 eeffffff call 0 <_factorial>

12: 83 c4 04 add $0x4,%esp

15: 0f af 44 24 04 imul 0x4(%esp),%eax

1a: c3 ret

De-Optimized

unpredictable instructions results
Unpredictable Instructions: Results
  • factorial_over_array.exe
unpredictable instructions results1
Unpredictable Instructions: Results
  • factorial_over_array.exe
    • As you can see from the chart below, the optimized version significantly outperforms the de-optimized version
    • Interestingly, this de-optimization has an inconclusive effect on the Nehalem

Difference between Optimized and De-Optimized Versions

* In Clock Cycles

unpredictable instructions upshot
Unpredictable Instructions: Upshot
  • Lessons
    • Alignment is one of many ways that instructions can become unpredictable
    • These constant misses can be very costly
    • Again, great care must be taken. Brevity, at times, can create inefficiencies
conclusion
Conclusion
  • We’ve shown you lots of de-optimizations
  • Most of them were successful
  • So, now, you know some of the costs associated with ignoring CPU architecture when writing code
  • If you are like us, then you must be reconsidering how you write software
  • As you’ve seen, some of the simple habits that you’ve accumulated may be causing your code to run more slowly than it would have otherwise
references
References:

[AMD05] AMD64 Technology. Software Optimization Guide for AMD64 Processors, 2005

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 1: Application Programming. 2011

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 2: System Programming. 2011

[AMD11] AMD64 Technology. AMD64 Architecture Programmers Manual, Volume 3: General Purpose and System Instructions. 2011

questions
Questions?

Thank You