1 / 35

Multi-core Programming

Multi-core Programming. Tools. Topics. General Ideas Compiler Switches Dual Core Vectorization. General Ideas - Optimization. Exploiting Architectural Power requires Sophisticated Compilers Optimal use of Registers & functional units Dual-Core/Multi-processor SSE instructions

vondra
Download Presentation

Multi-core Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-core Programming Tools

  2. Topics • General Ideas • Compiler Switches • Dual Core • Vectorization Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  3. General Ideas - Optimization • Exploiting Architectural Power requires Sophisticated Compilers • Optimal use of • Registers & functional units • Dual-Core/Multi-processor • SSE instructions • Cache architecture

  4. General Ideas – Compatibility Compatibility is a key concern for software development. Always read manuals to ensure: • Tool compatibility with hardware revisions. • Tool compatibility with software revisions. • Tool compatibility with IDE. • Tool compatibility with native (development) and target (deployment) operating system.

  5. General Ideas – Intel C++ Compatibility with Microsoft • Source & binary compatible with VC2003 with /Qvc71, • Source & binary compatible with w/ VC 2005 under /Qvc8. • Microsoft* & Intel OpenMP binaries are not compatible. • Use the one compiler for all modules compiled with OpenMP • For more information, refer to the User’s Guide

  6. General Ideas – Tools Never ignore or take tools for granted. A key part of system development is initial specification and qualification of tools required to get the job done. The wrong tool alone can destroy project success chances.

  7. General Ideas - Use Intel Compiler in Microsoft IDE C++ Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  8. Topics • General Ideas • Compiler Switches • Dual Core • Vectorization Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  9. Compiler Switches - General Optimizations Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  10. Compiler Switches - Multi-pass Optimization Interprocedural Optimizations (IPO) ip: Enables interprocedural optimizations for single file compilation ipo: Enables interprocedural optimizations across files Can inline functions in separate files Enhances optimization when used in combination with other compiler features Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  11. Pass 1 virtual .o Pass 2 executable Compiler Switches - Multi-pass Optimization - IPOUsage: Two-Step Process Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  12. Compiler Switches - Profile Guided Optimizations (PGO) • Use execution-time feedback to guide many other compiler optimizations • Helps I-cache, paging, branch-prediction • Enabled optimizations: • Basic block ordering • Better register allocation • Better decision of functions to inline • Function ordering • Switch-statement optimization • Better vectorization decisions Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  13. Compiler Switches - Multi-pass OptimizationPGO: Three-Step Process Step 1 Instrumented Compilation (Mac*/Linux*) icc-prof_gen[x]prog.c(Windows*) icl-Qprof_gen[x]prog.c Instrumented executable Step 2 Instrumented Execution Run program on a typical dataset DYN file containing dynamic info: .dyn Step 3 Merged DYN summary file: .dpi Delete old dyn files if you do not want the info included Feedback Compilation (Mac/Linux) icc -prof_use prog.c(Windows) icl -Qprof_use prog.c Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  14. Topics • General Ideas • Compiler Switches • Dual Core • Auto Parallelization • OpenMP • Threading Diagnostics • Vectorization Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  15. Auto-parallelization • Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives. • Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze. Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  16. OpenMP* Threading Technology • Pragma based approach to parallelism • Usage: • OpenMP switches: -openmp : /Qopenmp • OpenMP reports: -openmp-report : /Qopenmp-report #pragmaomp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i]; Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  17. OpenMP: Workqueueing Extension Example • Intel Compiler’s Workqueuing extension • Create Queue of tasks…Works on… • Recursive functions • Linked lists, etc. #pragmaintelomp parallel taskq shared(p) { while (p != NULL) { #pragmaintelomp task captureprivate(p) do_work1(p);      p = p->next; } } Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  18. Source Instrumentation for Intel Thread Checker Allows thread checker to diagnose threading correctness bugs To use tcheck/Qtcheck you must have Intel Thread Checker installed Parallel Diagnostics Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  19. Topics • General Ideas • Compiler Switches • Dual Core • Vectorization • SSE & Vectorization • Vectorization Reports • Explanations of a few specific vectorization inhibitors Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  20. SIMD – SSE, SSE2, SSE3 Support 2x doubles 4x floats 1x dqword SSE2 SSE3 16x bytes SSE 8x words MMX* 4x dwords 2x qwords * MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers

  21. FP to integer conversions Complex arithmetic Video encoding SIMD FP using AOS format* Thread Synchronization SSE3 Instructions FISTTP ADDSUBPD, ADDSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP LDDQU * Also benefits Complex and Vectorization HADDPD, HSUBPD HADDPS, HSUBPS MONITOR, MWAIT

  22. Using SSE3 - Your Task: Convert This… for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; A[0] A[1] not used not used not used + + + + B[0] B[1] not used not used not used 128-bit Registers C[1] C[0] not used not used not used Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  23. … Into This … for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; A[3] A[1] A[2] A[0] + + + + 128-bit Registers B[3] B[1] B[2] B[0] C[3] C[1] C[2] C[0] Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  24. Compiler Based VectorizationProcessor Specific Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  25. Compiler Based Vectorization Automatic Processor Dispatch – ax[?] • Single executable • Optimized for Intel® Core Duo processors and generic code that runs on all IA32 processors. • For each target processor it uses: • Processor-specific instructions • Vectorization • Low overhead • Some increase in code size Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  26. Why Loops Don’t Vectorize • Independence • Loop Iterations generally must be independent • Some relevant qualifiers: • Some dependent loops can be vectorized. • Most function calls cannot be vectorized. • Some conditional branches prevent vectorization. • Loops must be countable. • Outer loop of nest cannot be vectorized. • Mixed data types cannot be vectorized. Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  27. Why Didn’t My Loop Vectorize? Windows* Linux* Macintosh* -Qvec_reportn -vec_reportn -vec_reportn Set diagnostic level dumped to stdout • n=0: No diagnostic information • n=1: (Default) Loops successfully vectorized • n=2: Loops not vectorized – and the reason why not • n=3: Adds dependency Information • n=4: Reports only non-vectorized loops • n=5: Reports only non-vectorized loops and adds dependency info Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  28. Why Loops Don’t Vectorize • “Existence of vector dependence” • “Nonunit stride used” • “Mixed Data Types” • “Unsupported Loop Structure” • “Contains unvectorizable statement at line XX” • There are more reasons loops don’t vectorize but we will disucss the reasons above Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  29. “Existence of Vector Dependency” • Usually, indicates a real dependency between iterations of the loop, as shown here: for (i = 0; i < 100; i++) x[i] = A * x[i + 1]; Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  30. Defining Loop Independence Iteration Y of a loop is independent of when (or whether) iteration X occurs. int a[MAX], b[MAX]; for (j=0;j<MAX;j++) { a[j] = b[j]; } Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  31. “Nonunit stride used” Memory for (I=0;I<=MAX;I++) for (J=0;J<=MAX;J++) { c[I][J]+=1; // Unit Stride c[J][I]+=1; // Non-Unit A[J*J]+=1; // Non-unit A[B[J]]+=1; // Non-Unit if (A[MAX-J])=1 last1=J;}// Non-Unit End Result: Loading Vector may take more cycles than executing operation sequentially. Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  32. “Mixed Data Types” An example: inthowmany_close(double *x, double *y) { intwithinborder=0; double dist; for(inti=0;i<MAX;i++) { dist=sqrtf(x[i]*x[i] + y[i]*y[i]); if (dist<5) withinborder++; } } Mixed data types are possible – but complicate things i.e.: 2 doubles vs 4 ints per SIMD register Some operations with specific data types won’t work Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  33. “Unsupported Loop Structure” Example: struct _xx { int data; int bound; } ; doit1(int *a, struct _xx *x) {  for (inti=0; i<x->bound; i++) a[i] = 0; An unsupported loop structure means the loop is not countable, or the compiler for whatever reason can’t construct a run-time expression for the trip count. Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  34. A[3] A[1] A[2] A[0] func func func func 128-bit Registers B[3] B[1] B[2] B[0] “Contains unvectorizable statement” for (i=1;i<nx;i++) { B[i] = func(A[i]); } Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

  35. Reference • White papers and technical notes • www.intel.com/ids • www.intel.com/software/products • Product support resources • www.intel.com/software/products/support Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version

More Related