Tips and Tricks: Visual C++ 2005 Optimization Best Practices

Tips and Tricks: Visual C++ 2005 Optimization Best Practices Kang Su Gatlin TLNL04 Program Manager Visual C++ Microsoft Corporation

6 Tips/Best Practices To Help Any C++ Dev Write Faster Code • Managed + Unmanaged • Pick the right level of optimization • Add instant parallelism • Unmanaged • Disambiguate memory • Use intrinsics • Managed • Avoid double thunks • Speed app startup time

1. Pick the Right Level Of Optimization • Builds from the Lab • If at all possible use Profile-Guided Optimization • Only available unmanaged • More on this next slide • If not, use Whole Program Optimization (/GL) • Available managed and unmanaged • After that we recommend • /O2 (optimize for speed) for hot functions/files • /O1 (optimize for size) for the rest • Other switches to use for maximum speed • /Gy • /OPT:REF,ICF (good size win on 64bit) • /fp:fast • /arch:SSE2 (will not work on downlevel architectures) • Debug Symbols Are NOT Only for Debug Builds • Executable size and codegen are NOT effected by this • It’s all in the PDB file • Always building debug symbols will make life easier • Make sure you use /OPT:REF,ICF, don’t use /ZI, and use /INCREMENTAL:NO

Next-Gen Optimizations Today Profile Guided Optimization • The next level beyond Whole Program Optimization • Static compilers can’t answer everything • We get 20-50% improvement on large server applications that we ship • Current support is unmanaged only if(a < b) foo(); else baz(); for(i = 0; i < count; ++i) bar(); Should we unroll this loop? Should we inline foo()?

Profile Guided Optimization Object files Compile with /GL Source Object files Link with /LTCG:PGI Instrumented Image + PGD file Scenarios Profile data Instrumented Image Output Profile data Optimized Image Link with /LTCG:PGO Object files There is throughput impact

What PGO Does And Does Not Do • PGO does • Optimizations galore • Speed/Size Determination • Switch expansion • Better inlining decisions • Function/basic block layout • Virtual call speculation • Partial inlining • Optimize within a single image • Merging and weighting of multiple scenarios • PGO does not • No probing assembly language (inline or otherwise) • No optimizations across DLLs • No data layout optimization

PGO Compilation in Visual C++ 2005

2. Add Instant ParallelismJust add OpenMP Pragmas! • OpenMP is a popular API for multithreaded programs • Born from the HPC community • It consists of a set of simple #pragmas and runtime routines • Most value parallelizing large loops with no loop-dependencies • Visual C++ 2005 implements the full OpenMP 2.5 standard • Full unmanaged and/clr managed support • See the PDC issue of MSDN magazine for an article on OpenMP

OpenMP Parallelization void test(int first, int last) { for (int i = first; i <= last; ++i) { a[i] = b[i] * c[i]; } } Each iteration is independent; order of execution does not matter Assignments to ‘a’, ‘b’, and ‘c’ are independent #pragma omp parallel for #pragma omp parallel sections { #pragma omp section if(x < 0) a = foo(x); else a = x + 5; #pragma omp section b = bat(y); #pragma omp section c = baz(x + y); } j = a+b+c; if(x < 0) a = foo(x); else a = x + 5; b = bat(y); c = baz(x + y); j = a*b+c;

OpenMP Case StudyPanorama Factory by Smoky City Design • Top-rated image stitching application • Added multithreading with OpenMP in Visual C++ 2005 Beta2 • Used 102 instances of #pragma omp * • Extremely impressive Results… • Stitching together several large images • Dual processor, dual core x64 machine

3. Disambiguate Memory • Programmer knows a and b never overlap ecx = a, eax = b mov edx, DWORD PTR [eax] mov DWORD PTR [ecx], edx mov edx, DWORD PTR [eax+4] mov DWORD PTR [ecx+4], edx mov edx, DWORD PTR [eax] mov DWORD PTR [ecx+8], edx mov edx, DWORD PTR [eax+4] mov DWORD PTR [ecx+12], edx mov edx, DWORD PTR [eax] mov DWORD PTR [ecx+16], edx mov edx, DWORD PTR [eax+4] mov DWORD PTR [ecx+20], edx mov edx, DWORD PTR [eax] mov DWORD PTR [ecx+24], edx mov eax, DWORD PTR [eax+4] mov DWORD PTR [ecx+28], eax void copy8(int * a, int * b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[0]; a[3] = b[1]; a[4] = b[0]; a[5] = b[1]; a[6] = b[0]; a[7] = b[1]; }

Aliasing And Memory Disambiguation • Aliasing is when one object can be used as an alias to another object • If compiler can NOT prove that an object does not alias then it MUST assume it can • How can we address some of these problems? • Avoid taking address of an object. • Avoid taking address of a function. • Avoid using global variables. Statics are preferable. • Use__restrict, __declspec(noalias), and __declspec(restrict)when possible.

__restrict – A compiler hint • Programmer knows a and b don’t overlap eax = a, edx = b void copy8(int * __restrict a, int * b) { a[0] = b[0]; a[1] = b[1]; a[2] = b[0]; a[3] = b[1]; a[4] = b[0]; a[5] = b[1]; a[6] = b[0]; a[7] = b[1]; } mov ecx, DWORD PTR [edx] mov edx, DWORD PTR [edx+4] mov DWORD PTR [eax], ecx mov DWORD PTR [eax+4], edx mov DWORD PTR [eax+8], ecx mov DWORD PTR [eax+12], edx mov DWORD PTR [eax+16], ecx mov DWORD PTR [eax+20], edx mov DWORD PTR [eax+24], ecx mov DWORD PTR [eax+28], edx

__declspec(restrict) • Tells the compiler that the function returns an unalised pointer • Only applicable to functions • This is a promise the programmer makes to the compiler • If this promise is violated the compiler may generate bad code • The CRT uses this decoration, e.g., malloc, calloc, etc… __declspec(restrict) void *malloc(int size);

__declspec(noalias) • Tells the compiler that the function is a semi-pure function • Only references locals, arguments, and first-level indirections of arguments • This is a promise the programmer makes to the compiler • If this promise is violated the compiler may generate bad code __declspec(noalias) void isElement(Tree *t, Element e);

4. Use Intrinsics • Simply represented as functions to the programmer • _mm_load_pd(double const*); • Compilers understand these as primitives • Allows the user to get right at the hardware w/o using asm • Almost anything you can do in assembly • interlock, memory fences, cache control, SIMD • The key to things such as vectorization and lock-free programming • You can use intrinsics in a file compiled /clr, but the function(s) will be compiled as unmanaged • Intrinsics are consumed by PGO and our optimizer • Inline asm is not • Documentation for intrinsics is much better in Visual C++ 2005 • [Visual Studio 8]\VC\include\intrin.h

Matrix Addition With Intrinsics void MatMatAdd(Matrix &a, Matrix &b, Matrix &c) { for(int i = 0; i < a.m_rows; ++i) for(int j = 0; j < a.m_cols; j++) c[i][j] = a[i][j] + b[i][j]; } #include <intrin.h> void MatMatAddVect(Matrix &a, Matrix &b, Matrix &c) { __m128 aSIMD, bSIMD, cSIMD; for(int i = 0; i < a.m_rows; ++i) for(int j = 0; j < a.m_cols; j+= 4) { aSIMD = _mm_load_ps(&a[i][j]); bSIMD = _mm_load_ps(&b[i][j]); cSIMD= _mm_add_ps(aSIMD, bSIMD); _mm_store_ps(&c[i][j], cSIMD); } }

Spin-Lock With Intrinsics #include <intrin.h> #include <windows.h> void EnterSpinLock(volatile long &lock) { while(_InterlockedCompareExchange(&lock, 1, 0) != 0) Sleep(0); } void ExitSpinLock(volatile long &lock) { lock = 0; }

5. Avoid Double-Thunks • Thunks are functions used to transition from managed to unmanaged (and vice-versa) Managed Code UnmanagedFunc(); Unmanaged Code UnmanagedFunc() { … } Managed To Unmanaged Thunk Thunks are a part of life… but sometimes we can have Double Thunks…

Double Thunking • From managed to managed only • Indirect calls • Function pointers and virtual functions • Is the callee is managed or unmanaged entry point? • __declspec(dllexport) • No current mechanism to export functions as managed entry points Managed Code ManagedFunc(); Managed Code ManagedFunc() { … } Managed To Unmanaged Thunk Unmanaged To Managed Thunk

How To Fix Double Thunking • Indirect Functions (including Virtual Funcs) • Compile with /clr:pure • Use __clrcall • __declspec(export) • Wrap functions in a managed class, and then #using the object file

Using __clrcall To Improve Performance

6. Speed App Startup Time • No one likes to wait for an app to start-up • There is still some time associated with loading CLR • In some apps you may have non-CLR paths • Only load the CLR when you need to • Use DelayLoading technology in the linker • If the EXE is compiled /clr then we will always load the CLR

Delay Loading The CLR

Summary Of Best Practices Large and ongoing investment in managed and unmanaged C++ code • Managed + Unmanaged • Use PGO for unmanaged and WPO for managed… • OpenMP can ease multithreaded development. • Unmanaged • Make it easier for the compiler to track pointers. • Intrinsics give the ability to get to the metal. • Managed • Know where your double thunks are and fix. • Delay load the CLR to improve startup.

Resources • Visual C++ Dev Center • http://msdn.microsoft.com/visualc • This is the place to go for all our news and whitepapers • Myself • kanggatl@microsoft.com • http://blogs.msdn.com/kangsu • Must See Talks • TLN309 C++: Future Directions in Language Innovation with Herb Sutter (Friday 10:30am)

Tips and Tricks: Visual C++ 2005 Optimization Best Practices