Tips and tricks visual c 2005 optimization best practices
Download
1 / 28

- PowerPoint PPT Presentation


  • 216 Views
  • Uploaded on

Tips and Tricks: Visual C++ 2005 Optimization Best Practices. Kang Su Gatlin TLNL04 Program Manager Visual C++ Microsoft Corporation. 6 Tips/Best Practices To Help Any C++ Dev Write Faster Code. Managed + Unmanaged Pick the right level of optimization Add instant parallelism. Unmanaged

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - angie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tips and tricks visual c 2005 optimization best practices l.jpg

Tips and Tricks: Visual C++ 2005 Optimization Best Practices

Kang Su Gatlin

TLNL04

Program Manager

Visual C++

Microsoft Corporation


6 tips best practices to help any c dev write faster code l.jpg
6 Tips/Best Practices To Help Any C++ Dev Write Faster Code Practices

  • Managed + Unmanaged

  • Pick the right level of optimization

  • Add instant parallelism

  • Unmanaged

  • Disambiguate memory

  • Use intrinsics

  • Managed

  • Avoid double thunks

  • Speed app startup time


1 pick the right level of optimization l.jpg
1. PracticesPick the Right Level Of Optimization

  • Builds from the Lab

    • If at all possible use Profile-Guided Optimization

      • Only available unmanaged

      • More on this next slide

    • If not, use Whole Program Optimization (/GL)

      • Available managed and unmanaged

    • After that we recommend

      • /O2 (optimize for speed) for hot functions/files

      • /O1 (optimize for size) for the rest

  • Other switches to use for maximum speed

    • /Gy

    • /OPT:REF,ICF (good size win on 64bit)

    • /fp:fast

    • /arch:SSE2 (will not work on downlevel architectures)

  • Debug Symbols Are NOT Only for Debug Builds

    • Executable size and codegen are NOT effected by this

      • It’s all in the PDB file

    • Always building debug symbols will make life easier

    • Make sure you use /OPT:REF,ICF, don’t use /ZI, and use /INCREMENTAL:NO


Next gen optimizations today profile guided optimization l.jpg
Next-Gen Optimizations Today PracticesProfile Guided Optimization

  • The next level beyond Whole Program Optimization

  • Static compilers can’t answer everything

  • We get 20-50% improvement on large server applications that we ship

  • Current support is unmanaged only

if(a < b)

foo();

else

baz();

for(i = 0; i < count; ++i)

bar();

Should we unroll this loop?

Should we inline foo()?


Profile guided optimization l.jpg
Profile Guided Optimization Practices

Object files

Compile with /GL

Source

Object files

Link with /LTCG:PGI

Instrumented

Image + PGD file

Scenarios

Profile data

Instrumented

Image

Output

Profile data

Optimized

Image

Link with /LTCG:PGO

Object files

There is throughput impact


What pgo does and does not do l.jpg
What PGO Does And Does Not Do Practices

  • PGO does

    • Optimizations galore

      • Speed/Size Determination

      • Switch expansion

      • Better inlining decisions

      • Function/basic block layout

      • Virtual call speculation

      • Partial inlining

    • Optimize within a single image

    • Merging and weighting of multiple scenarios

  • PGO does not

    • No probing assembly language (inline or otherwise)

    • No optimizations across DLLs

    • No data layout optimization



2 add instant parallelism just add openmp pragmas l.jpg
2. Add Instant Parallelism PracticesJust add OpenMP Pragmas!

  • OpenMP is a popular API for multithreaded programs

    • Born from the HPC community

  • It consists of a set of simple #pragmas and runtime routines

  • Most value parallelizing large loops with no loop-dependencies

  • Visual C++ 2005 implements the full OpenMP 2.5 standard

    • Full unmanaged and/clr managed support

    • See the PDC issue of MSDN magazine for an article on OpenMP


Openmp parallelization l.jpg
OpenMP Parallelization Practices

void test(int first, int last) {

for (int i = first;

i <= last; ++i) {

a[i] = b[i] * c[i];

}

}

Each iteration is

independent;

order of execution

does not matter

Assignments to ‘a’,

‘b’, and ‘c’ are

independent

#pragma omp parallel for

#pragma omp parallel sections

{

#pragma omp section

if(x < 0)

a = foo(x);

else

a = x + 5;

#pragma omp section

b = bat(y);

#pragma omp section

c = baz(x + y);

}

j = a+b+c;

if(x < 0)

a = foo(x);

else

a = x + 5;

b = bat(y);

c = baz(x + y);

j = a*b+c;


Openmp case study panorama factory by smoky city design l.jpg
OpenMP Case Study PracticesPanorama Factory by Smoky City Design

  • Top-rated image stitching application

  • Added multithreading with OpenMP in Visual C++ 2005 Beta2

  • Used 102 instances of #pragma omp *

  • Extremely impressive Results…

    • Stitching together several large images

    • Dual processor, dual core x64 machine


3 disambiguate memory l.jpg
3. Disambiguate Memory Practices

  • Programmer knows a and b never overlap

ecx = a, eax = b

mov edx, DWORD PTR [eax]

mov DWORD PTR [ecx], edx

mov edx, DWORD PTR [eax+4]

mov DWORD PTR [ecx+4], edx

mov edx, DWORD PTR [eax]

mov DWORD PTR [ecx+8], edx

mov edx, DWORD PTR [eax+4]

mov DWORD PTR [ecx+12], edx

mov edx, DWORD PTR [eax]

mov DWORD PTR [ecx+16], edx

mov edx, DWORD PTR [eax+4]

mov DWORD PTR [ecx+20], edx

mov edx, DWORD PTR [eax]

mov DWORD PTR [ecx+24], edx

mov eax, DWORD PTR [eax+4]

mov DWORD PTR [ecx+28], eax

void copy8(int * a,

int * b) {

a[0] = b[0];

a[1] = b[1];

a[2] = b[0];

a[3] = b[1];

a[4] = b[0];

a[5] = b[1];

a[6] = b[0];

a[7] = b[1];

}


Aliasing and memory disambiguation l.jpg
Aliasing And Memory Disambiguation Practices

  • Aliasing is when one object can be used as an alias to another object

  • If compiler can NOT prove that an object does not alias then it MUST assume it can

  • How can we address some of these problems?

  • Avoid taking address of an object.

  • Avoid taking address of a function.

  • Avoid using global variables. Statics are preferable.

  • Use__restrict, __declspec(noalias), and __declspec(restrict)when possible.


Restrict a compiler hint l.jpg
__restrict – A compiler Practiceshint

  • Programmer knows a and b don’t overlap

eax = a, edx = b

void copy8(int * __restrict a,

int * b) {

a[0] = b[0];

a[1] = b[1];

a[2] = b[0];

a[3] = b[1];

a[4] = b[0];

a[5] = b[1];

a[6] = b[0];

a[7] = b[1];

}

mov ecx, DWORD PTR [edx]

mov edx, DWORD PTR [edx+4]

mov DWORD PTR [eax], ecx

mov DWORD PTR [eax+4], edx

mov DWORD PTR [eax+8], ecx

mov DWORD PTR [eax+12], edx

mov DWORD PTR [eax+16], ecx

mov DWORD PTR [eax+20], edx

mov DWORD PTR [eax+24], ecx

mov DWORD PTR [eax+28], edx


Declspec restrict l.jpg
__declspec(restrict) Practices

  • Tells the compiler that the function returns an unalised pointer

    • Only applicable to functions

  • This is a promise the programmer makes to the compiler

  • If this promise is violated the compiler may generate bad code

  • The CRT uses this decoration, e.g., malloc, calloc, etc…

__declspec(restrict) void *malloc(int size);


Declspec noalias l.jpg
__declspec(noalias) Practices

  • Tells the compiler that the function is a semi-pure function

    • Only references locals, arguments, and first-level indirections of arguments

  • This is a promise the programmer makes to the compiler

  • If this promise is violated the compiler may generate bad code

__declspec(noalias) void isElement(Tree *t, Element e);


4 use intrinsics l.jpg
4. Use Intrinsics Practices

  • Simply represented as functions to the programmer

    • _mm_load_pd(double const*);

  • Compilers understand these as primitives

  • Allows the user to get right at the hardware w/o using asm

  • Almost anything you can do in assembly

    • interlock, memory fences, cache control, SIMD

    • The key to things such as vectorization and lock-free programming

  • You can use intrinsics in a file compiled /clr, but the function(s) will be compiled as unmanaged

  • Intrinsics are consumed by PGO and our optimizer

    • Inline asm is not

  • Documentation for intrinsics is much better in Visual C++ 2005

  • [Visual Studio 8]\VC\include\intrin.h


Matrix addition with intrinsics l.jpg
Matrix Addition With Intrinsics Practices

void MatMatAdd(Matrix &a, Matrix &b, Matrix &c) {

for(int i = 0; i < a.m_rows; ++i)

for(int j = 0; j < a.m_cols; j++)

c[i][j] = a[i][j] + b[i][j];

}

#include <intrin.h>

void MatMatAddVect(Matrix &a, Matrix &b, Matrix &c) {

__m128 aSIMD, bSIMD, cSIMD;

for(int i = 0; i < a.m_rows; ++i)

for(int j = 0; j < a.m_cols; j+= 4)

{

aSIMD = _mm_load_ps(&a[i][j]);

bSIMD = _mm_load_ps(&b[i][j]);

cSIMD= _mm_add_ps(aSIMD, bSIMD);

_mm_store_ps(&c[i][j], cSIMD);

}

}


Spin lock with intrinsics l.jpg
Spin-Lock With Intrinsics Practices

#include <intrin.h>

#include <windows.h>

void EnterSpinLock(volatile long &lock) {

while(_InterlockedCompareExchange(&lock, 1, 0) != 0)

Sleep(0);

}

void ExitSpinLock(volatile long &lock) {

lock = 0;

}


5 avoid double thunks l.jpg
5. Avoid Double-Thunks Practices

  • Thunks are functions used to transition from managed to unmanaged (and vice-versa)

Managed Code

UnmanagedFunc();

Unmanaged Code

UnmanagedFunc() { … }

Managed To

Unmanaged

Thunk

Thunks are a part of life…

but sometimes we can have

Double Thunks…


Double thunking l.jpg
Double Thunking Practices

  • From managed to managed only

    • Indirect calls

      • Function pointers and virtual functions

      • Is the callee is managed or unmanaged entry point?

    • __declspec(dllexport)

      • No current mechanism to export functions as managed entry points

Managed Code

ManagedFunc();

Managed Code

ManagedFunc() { … }

Managed To

Unmanaged

Thunk

Unmanaged To

Managed

Thunk


How to fix double thunking l.jpg
How To Fix Double Thunking Practices

  • Indirect Functions (including Virtual Funcs)

    • Compile with /clr:pure

    • Use __clrcall

  • __declspec(export)

    • Wrap functions in a managed class, and then #using the object file


Using clrcall to improve performance l.jpg

Using __clrcall To PracticesImprove Performance


6 speed app startup time l.jpg
6. Speed App Startup Time Practices

  • No one likes to wait for an app to start-up

  • There is still some time associated with loading CLR

  • In some apps you may have non-CLR paths

  • Only load the CLR when you need to

  • Use DelayLoading technology in the linker

    • If the EXE is compiled /clr then we will always load the CLR



Summary of best practices l.jpg
Summary Of Best Practices Practices

Large and ongoing investment in managed and unmanaged C++ code

  • Managed + Unmanaged

  • Use PGO for unmanaged and WPO for managed…

  • OpenMP can ease multithreaded development.

  • Unmanaged

  • Make it easier for the compiler to track pointers.

  • Intrinsics give the ability to get to the metal.

  • Managed

  • Know where your double thunks are and fix.

  • Delay load the CLR to improve startup.


Resources l.jpg
Resources Practices

  • Visual C++ Dev Center

    • http://msdn.microsoft.com/visualc

    • This is the place to go for all our news and whitepapers

  • Myself

  • Must See Talks

    • TLN309  C++: Future Directions in Language Innovation with Herb Sutter (Friday 10:30am)


Slide28 l.jpg

© 2005 Microsoft Corporation. All rights reserved. Practices

This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.


ad