370 likes | 572 Views
Multi-core Programming. Tools. Topics. General Ideas Compiler Switches Dual Core Vectorization. General Ideas - Optimization. Exploiting Architectural Power requires Sophisticated Compilers Optimal use of Registers & functional units Dual-Core/Multi-processor SSE instructions
E N D
Multi-core Programming Tools
Topics • General Ideas • Compiler Switches • Dual Core • Vectorization Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
General Ideas - Optimization • Exploiting Architectural Power requires Sophisticated Compilers • Optimal use of • Registers & functional units • Dual-Core/Multi-processor • SSE instructions • Cache architecture
General Ideas – Compatibility Compatibility is a key concern for software development. Always read manuals to ensure: • Tool compatibility with hardware revisions. • Tool compatibility with software revisions. • Tool compatibility with IDE. • Tool compatibility with native (development) and target (deployment) operating system.
General Ideas – Intel C++ Compatibility with Microsoft • Source & binary compatible with VC2003 with /Qvc71, • Source & binary compatible with w/ VC 2005 under /Qvc8. • Microsoft* & Intel OpenMP binaries are not compatible. • Use the one compiler for all modules compiled with OpenMP • For more information, refer to the User’s Guide
General Ideas – Tools Never ignore or take tools for granted. A key part of system development is initial specification and qualification of tools required to get the job done. The wrong tool alone can destroy project success chances.
General Ideas - Use Intel Compiler in Microsoft IDE C++ Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Topics • General Ideas • Compiler Switches • Dual Core • Vectorization Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Compiler Switches - General Optimizations Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Compiler Switches - Multi-pass Optimization Interprocedural Optimizations (IPO) ip: Enables interprocedural optimizations for single file compilation ipo: Enables interprocedural optimizations across files Can inline functions in separate files Enhances optimization when used in combination with other compiler features Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Pass 1 virtual .o Pass 2 executable Compiler Switches - Multi-pass Optimization - IPOUsage: Two-Step Process Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Compiler Switches - Profile Guided Optimizations (PGO) • Use execution-time feedback to guide many other compiler optimizations • Helps I-cache, paging, branch-prediction • Enabled optimizations: • Basic block ordering • Better register allocation • Better decision of functions to inline • Function ordering • Switch-statement optimization • Better vectorization decisions Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Compiler Switches - Multi-pass OptimizationPGO: Three-Step Process Step 1 Instrumented Compilation (Mac*/Linux*) icc-prof_gen[x]prog.c(Windows*) icl-Qprof_gen[x]prog.c Instrumented executable Step 2 Instrumented Execution Run program on a typical dataset DYN file containing dynamic info: .dyn Step 3 Merged DYN summary file: .dpi Delete old dyn files if you do not want the info included Feedback Compilation (Mac/Linux) icc -prof_use prog.c(Windows) icl -Qprof_use prog.c Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Topics • General Ideas • Compiler Switches • Dual Core • Auto Parallelization • OpenMP • Threading Diagnostics • Vectorization Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Auto-parallelization • Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives. • Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze. Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
OpenMP* Threading Technology • Pragma based approach to parallelism • Usage: • OpenMP switches: -openmp : /Qopenmp • OpenMP reports: -openmp-report : /Qopenmp-report #pragmaomp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i]; Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
OpenMP: Workqueueing Extension Example • Intel Compiler’s Workqueuing extension • Create Queue of tasks…Works on… • Recursive functions • Linked lists, etc. #pragmaintelomp parallel taskq shared(p) { while (p != NULL) { #pragmaintelomp task captureprivate(p) do_work1(p); p = p->next; } } Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Source Instrumentation for Intel Thread Checker Allows thread checker to diagnose threading correctness bugs To use tcheck/Qtcheck you must have Intel Thread Checker installed Parallel Diagnostics Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Topics • General Ideas • Compiler Switches • Dual Core • Vectorization • SSE & Vectorization • Vectorization Reports • Explanations of a few specific vectorization inhibitors Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
SIMD – SSE, SSE2, SSE3 Support 2x doubles 4x floats 1x dqword SSE2 SSE3 16x bytes SSE 8x words MMX* 4x dwords 2x qwords * MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers
FP to integer conversions Complex arithmetic Video encoding SIMD FP using AOS format* Thread Synchronization SSE3 Instructions FISTTP ADDSUBPD, ADDSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP LDDQU * Also benefits Complex and Vectorization HADDPD, HSUBPD HADDPS, HSUBPS MONITOR, MWAIT
Using SSE3 - Your Task: Convert This… for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; A[0] A[1] not used not used not used + + + + B[0] B[1] not used not used not used 128-bit Registers C[1] C[0] not used not used not used Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
… Into This … for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; A[3] A[1] A[2] A[0] + + + + 128-bit Registers B[3] B[1] B[2] B[0] C[3] C[1] C[2] C[0] Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Compiler Based VectorizationProcessor Specific Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Compiler Based Vectorization Automatic Processor Dispatch – ax[?] • Single executable • Optimized for Intel® Core Duo processors and generic code that runs on all IA32 processors. • For each target processor it uses: • Processor-specific instructions • Vectorization • Low overhead • Some increase in code size Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Why Loops Don’t Vectorize • Independence • Loop Iterations generally must be independent • Some relevant qualifiers: • Some dependent loops can be vectorized. • Most function calls cannot be vectorized. • Some conditional branches prevent vectorization. • Loops must be countable. • Outer loop of nest cannot be vectorized. • Mixed data types cannot be vectorized. Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Why Didn’t My Loop Vectorize? Windows* Linux* Macintosh* -Qvec_reportn -vec_reportn -vec_reportn Set diagnostic level dumped to stdout • n=0: No diagnostic information • n=1: (Default) Loops successfully vectorized • n=2: Loops not vectorized – and the reason why not • n=3: Adds dependency Information • n=4: Reports only non-vectorized loops • n=5: Reports only non-vectorized loops and adds dependency info Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Why Loops Don’t Vectorize • “Existence of vector dependence” • “Nonunit stride used” • “Mixed Data Types” • “Unsupported Loop Structure” • “Contains unvectorizable statement at line XX” • There are more reasons loops don’t vectorize but we will disucss the reasons above Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
“Existence of Vector Dependency” • Usually, indicates a real dependency between iterations of the loop, as shown here: for (i = 0; i < 100; i++) x[i] = A * x[i + 1]; Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Defining Loop Independence Iteration Y of a loop is independent of when (or whether) iteration X occurs. int a[MAX], b[MAX]; for (j=0;j<MAX;j++) { a[j] = b[j]; } Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
“Nonunit stride used” Memory for (I=0;I<=MAX;I++) for (J=0;J<=MAX;J++) { c[I][J]+=1; // Unit Stride c[J][I]+=1; // Non-Unit A[J*J]+=1; // Non-unit A[B[J]]+=1; // Non-Unit if (A[MAX-J])=1 last1=J;}// Non-Unit End Result: Loading Vector may take more cycles than executing operation sequentially. Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
“Mixed Data Types” An example: inthowmany_close(double *x, double *y) { intwithinborder=0; double dist; for(inti=0;i<MAX;i++) { dist=sqrtf(x[i]*x[i] + y[i]*y[i]); if (dist<5) withinborder++; } } Mixed data types are possible – but complicate things i.e.: 2 doubles vs 4 ints per SIMD register Some operations with specific data types won’t work Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
“Unsupported Loop Structure” Example: struct _xx { int data; int bound; } ; doit1(int *a, struct _xx *x) { for (inti=0; i<x->bound; i++) a[i] = 0; An unsupported loop structure means the loop is not countable, or the compiler for whatever reason can’t construct a run-time expression for the trip count. Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
A[3] A[1] A[2] A[0] func func func func 128-bit Registers B[3] B[1] B[2] B[0] “Contains unvectorizable statement” for (i=1;i<nx;i++) { B[i] = func(A[i]); } Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Reference • White papers and technical notes • www.intel.com/ids • www.intel.com/software/products • Product support resources • www.intel.com/software/products/support Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version