1 / 31

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors. Shuo Li Financial Services Engineering Software and Services Group Intel Corporation. Agenda . A Tale of Two Architectures Transcendental Functions - Step 5 Lab 1/3 Thread Affinity - Step 5 Lab part 2/3

austin
Download Presentation

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors Shuo Li Financial Services Engineering Software and Services Group Intel Corporation

  2. Agenda • A Tale of Two Architectures • Transcendental Functions - Step 5 Lab 1/3 • Thread Affinity - Step 5 Lab part 2/3 • Prefetch and Streaming Store • Data Blocking – Step 5 Lab part 3/3 • Summary

  3. A Tale of Two Architectures

  4. A Tale of Two Architectures Multicore

  5. Transcendental Functions

  6. Extended Math Unit • Fast approximations of transcendental functions using hardware lookup table in single precision • Minimax quadratic polynomial approximation • Full effective 23-bit Mantissa bits • 1-2 cycles of throughput in 4 elementary functions • Benefit other functions that directly use them

  7. Use EMU Functions in Finance Algorithms • Challenges in using exp2() and log2() • exp() and log() are widely use in Finance algorithm • the base of which are e (=2.718…) not 2 • The Change of Base Formula • M_LOG2E, M_LN2are defined in math.h • Cost: 1 multiplication, and its effect on result accuracy • In General, it works for any base b • expb(x) = exp2(x*log2b) logb(x) = log2(x) * logb2 • Manage the cost of conversion • Absorb multiply to other constant calculations outside loop • Always convert from other bases to base 2 exp(x) = exp2(x*M_LOG2E) log(x) = log2(x)*M_LN2

  8. Black-Scholes can use EMU Functions constfloatLN2_V = M_LN2 * (1/V); constfloatRLOG2E =-R*M_LOG2E; for(int opt = 0; opt < OPT_N; opt++) { float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; floatrsqrtT = 1/sqrtf(T); floatsqrtT= 1/rsqrtT; float d1 = log2f(S/X)*LN2_V*rsqrtT + RVV*sqrtT; float d2= d1 - V * sqrtT; CNDD1 = CNDF(d1); CNDD2 = CNDF(d2); floatexpRT = X * exp2f(RLOG2E * T); floatCallVal = S * CNDD1 - expRT * CNDD2; CallResult[opt] = CallVal; PutResult[opt] = CallVal + expRT - S; } constfloat R = 0.02f; constfloat V = 0.30f; constfloat RVV = (R + 0.5f * V * V)/V; for(intopt = 0; opt < OPT_N; opt++) { floatT = OptionYears[opt]; floatX = OptionStrike[opt]; floatS = StockPrice[opt]; floatsqrtT = sqrtf(T); floatd1 = logf(S/X)/(V*sqrtT) + RVV*sqrtT; floatd2 = d1 - V * sqrtT; CNDD1 = CNDF(d1); CNDD2 = CNDF(d2); floatexpRT = X * expf(-R * T); floatCallVal = S * CNDD1 - expRT * CNDD2; CallResult[opt] = CallVal; PutResult[opt] = CallVal + expRT - S; }

  9. Transcendental FunctionsStep 5 Lab 1/3

  10. Using Intel® Xeon Phi™ Coprocessors • Use Intel® Xeon® E5 2670 platform with Intel® Xeon Phi™ Coprocessor • Make sure you can invoke Intel C/C++ Compiler • Try icpc –V to prinout the banner for Intel® C/C++ compiler • Build the native Intel® Xeon Phi™ Coprocessor application • Change the Makefile and use –mmic in lieu of –xAVX source /opt/intel/Compiler/2013.2.146/composerxe/pkg_bin/compilervars.sh intel64 • Copy program from Host to Coprocessor • Find out the host name using hostname(suppose it returns esg014) • Copy the executives using - scp ./MonteCarloesg014-mic0: • Establish a execution environment - sshesg014-mic0 • Set env. Variable %export LD_LIBRARY_PATH=. • Optionally export KMP_AFFINITY=“compact,granularity=fine” • Invoke the program ./MonteCarlo

  11. Step 5 Transcendental Functions • Use transcendental functions in EMU • Inner loop calls expf(MuByT + VBySqrtT * random[pos]); • Call exp2f and adjust the parameter by a factor of M_LOG2E • Combine the multiplication with MuByT and VBySqrtT float VBySqrtT = VLOG2E * sqrtf(T[opt]); float MuByT = RVVLOG2E * T[opt];

  12. Thread Affinity

  13. More on Thread Affinity • Bind the worker threads to certain processor core/threads • Minimizes the thread migration and context switch • Improves data locality; reduce coherency traffic • Two components to the problem: • How many worker threads to create? • How to bind worker threads to core/threads? • Two ways to specify thread affinity • Environment variables OMP_NUM_THREADS, KMP_AFFINITY • C/C++ API: kmp_set_defaults("KMP_AFFINITY=compact")omp_set_num_threads(244); • Add to your source file#include <omp.h> • Compiler with –openmp • Use libiomp5.so

  14. The Optimal Thread Number • Intel MIC maintains 4 hardware contexts per core • Round-robin execution policy, • Require 2 threads for decent performance • Financial algorithms takes all 4 threads to peak • Intel Xeon processor optionally use HyperThreading • Execute-until-stall execution policy • Truly compute intensive ones peak with 1 thread per core • Finance algorithms likes HyperThreading, 2 threads per core • Use OpenMP application with NCORE number of cores • Host only: 2 x ncore(or 1x if HyperThreading disabled) • MIC Native:4 x ncore • Offload: 4 x (ncore-1) OpenMP runtime avoids the core OS runs

  15. Thread Affinity Choices • Intel® OpenMP* Supports the following Affinity Type: • Compactassign threads to consecutive h/w contexts on same physical core to achieve the benefit from shared cache. • Scatter assign consecutive threads to different physical cores maximize access to memory. • Balanced blend of compact & scatter (currently only available for Intel® MIC Architecture) • You can also specify affinity modifier • Explicit setting KMP_AFFINITY set to granularity=fine, proclist=“1-240”,explicit • Affinity is particularly important if not all available threads are used • Affinity Type is less important in full thread subscription

  16. Thread Affinity in Monte Carlo • Monte Carlo can take all threads available to a core • Enable HyperThreading for Intel® Xeon® processor • Set –opt-threads-per-core=4 for the Coprocessor code • Affinity type is less important when all cores are used • Argument for compact: maximized the share random number effect • Argument for scatter: maximize the bandwidth to memory • However you have to set thread affinity type • API Calls: • Env. variables: #ifdef _OPENMP kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine"); #endif ~ $ export LD_LIBRARY_PATH=. ~ $ export KMP_AFFINITY="scatter,granularity=fine" ~ $ ./MonteCarlo

  17. Thread Affinity Step 5 Lab 2/3

  18. Step 5 Thread Affinity • Add #include <omp.h> • Add the following line before the very first #pragamomp #ifdef _OPENMP kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine"); #endif • Add –opt-threads-per-core =4 to the Makefile

  19. Prefetch and Streaming Stores

  20. Prefetch on Intel Multicore and Manycore platforms • Objective: Move data from memory to L1 or L2 Cache in anticipation of CPU Load/Store • More import on in-order Intel Xeon Phi Coprocessor • Less important on out of order Intel Xeon Processor • Compiler prefetching is on by default for Intel® Xeon Phi™ coprocessors at –O2 and above • Compiler prefetch is not enabled by default on Intel® Xeon® Processors • Use external options –opt-prefetch[=n] n = 1.. 4 • Use the compiler reporting options to see detailed diagnostics of prefetching per loop • Use -opt-report-phase hlo –opt-report 3

  21. Automatic Prefetches Loop Prefetch • Compiler generated prefetches target memory access in a future iteration of the loop • Target regular, predictable array and pointer access Interactions with Hardware prefetcher • Intel® Xeon Phi™ Comprocessor has a hardware L2 prefetcher • If Software prefetches are doing a good job, Hardware prefetching does not kick in • References not prefetched by compiler may get prefetched by hardware prefetcher

  22. Explicit Prefetch • Use Intrinsics • _mm_prefetch((char *) &a[i], hint);See xmmintrin.h for possible hints (for L1, L2, non-temporal, …) • But you have to specify the prefetch distance • Also gather/scatter prefetch intrinsics, see zmmintrin.h and compiler user guide, e.g. _mm512_prefetch_i32gather_ps • Use a pragma / directive (easier): • #pragma prefetch a [:hint[:distance]] • You specify what to prefetch, but can choose to let compiler figure out how far ahead to do it. • Use Compiler switches: • -opt-prefetch-distance=n1[,n2] • specify the prefetch distance (how many iterations ahead, use n1 and prefetchesinside loops. n1 indicates distance from memory to L2. • BlackScholes uses -opt-prefetch-distance=8,2

  23. Streaming Store • Avoid read for ownership for certain memory write operation • Bypass prefetch related to the memory read • Use #pragma vector nontemporal (v1, v2, …) to drop a hint to compiler • Without Streaming Stores 448 Bytes read/write per iteration for(intchunkBase = 0; chunkBase < OptPerThread; chunkBase += CHUNKSIZE) { #pragmasimdvectorlength(CHUNKSIZE) #pragmasimd #pragma vector aligned #pragma vector nontemporal (CallResult, PutResult) for(int opt = chunkBase; opt < (chunkBase+CHUNKSIZE); opt++) { float CNDD1; float CNDD2; floatCallVal =0.0f, PutVal = 0.0f; float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; …… CallVal = S * CNDD1 - XexpRT * CNDD2; PutVal = CallVal + XexpRT - S; CallResult[opt] = CallVal ; PutResult[opt] = PutVal ; } } • With Streaming Stores, 320 Bytes read/write per iteration • Relief Bandwidth pressure; improve cache utilization • –vec-report6displaysthe compiler action bs_test_sp.c(215): (col. 4) remark: vectorization support: streaming store was generated for CallResult. bs_test_sp.c(216): (col. 4) remark: vectorization support: streaming store was generated for PutResult.

  24. Data Blocking

  25. Data Blocking • Partition data to small blocks that fits in L2 Cache • Exploit data reuse in the application. • Ensure the data remains in the cache across multiple uses • Using the data in cache remove the need to go to memory • Bandwidth limited program may execute at FLOPS limit • Simple case of 1D • Data size DATA_N is used WORK_N times from 100s of threads • Each handles a piece of work and have to traverse all data • Without Blocking • With Blocking • 100s of thread pound on different area of DATA_N • Memory interconnet limit the performance • Cacheable BSIZE of data is processed by all 100s threads a time • Each data is read once kept reusing until all threads are done with it for(intBBase= 0; BBase< DATA_N; BBase+= BSIZE) { #pragma omp parallel for for(intwrk = 0; wrk < WORK_N; wrk++) { initialize_the_work(wrk); for(intind= BBase; ind< BBase+BSIZE; ind++) { dataptrdatavalue= read_data(ind); result = compute(datavalue); aggregate[wrk] = combine(aggregate[wrk], result); } postprocess_work(aggregate[wrk]); } } #pragma omp parallel for for(intwrk= 0; wrk< WORK_N; wrk++) { initialize_the_work(wrk); for(intind= 0; ind< DATA_N; ind++) { dataptrdatavalue= read_data(dataind); result = compute(datavalue); aggregate = combine(aggregate, result); } postprocess_work(aggregate); }

  26. Blocking in Monte Carlo European Options constintnblocks = RAND_N/BLOCKSIZE; for(int block = 0; block < nblocks; ++block) { vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF,Randomstream, BLOCKSIZE, random, 0.0f, 1.0f); #pragmaomp parallel for for(int opt = 0; opt < OPT_N; opt++) { floatVBySqrtT = VLOG2E * sqrtf(T[opt]); floatMuByT = RVVLOG2E * T[opt]; floatSval = S[opt]; floatXval = X[opt]; float val = 0.0, val2 = 0.0; #pragma vector aligned #pragma simd reduction(+:val) reduction(+:val2) #pragma unroll(4) for(intpos = 0; pos < BLOCKSIZE; pos++) { … … … val += callValue; val2 += callValue * callValue; } h_CallResult[opt] += val; h_CallConfidence[opt] += val2; } } #pragmaomp parallel for for(intopt = 0; opt < OPT_N; opt++) { constfloatval = h_CallResult[opt]; constfloat val2 = h_CallConfidence[opt]; constfloat exprt = exp2f(-RLOG2E*T[opt]); h_CallResult[opt] = exprt * val * INV_RAND_N; … … … h_CallConfidence[opt] = (float)(exprt * stdDev * CONFIDENCE_DENOM); } • Each thread runs Monte Carlo using all random num. • Random num. are too big to fit in each thread’s L2 • Random num size is RAND_N * sizeof(float) = 1 MB, at 256K floats • Each thread’s L2 is 512KB/4 = 128KB effective: 64KB or 16K floats • Without Blocking: • Each thread make independent pass of RAND_N data for all options it runs • Interconnects is busy satisfying the read req. from different threads at different points • Also prefetched data from different points saturate the memory bandwidth • With Blocking • Random number is partitioned into cacheable blocks • A block is brought to cache when its previous block is done processing by all threads. • It remains in the caches of all threads until all thread have finish their passes • Each threads repeatedly reuse the data in cache for all options it need to runs.

  27. Data BlockingStep 5 Lab 3/3

  28. Move the loop to calculate per option data from middle loop to outside the loop #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { const float val=h_CallResult[opt]; const float val2=h_CallConfidence[opt]; const float exprt=exp2f(-RLOG2E*T[opt]); h_CallResult[opt]= exprt*val*INV_RAND_N; const float stdDev= sqrtf((F_RAND_N * val2 - val * val) * STDDEV_DENOM); h_CallConfidence[opt]= (float)(exprt * stdDev * CONFIDENCE_DENOM); • Add initialization loop before the triple nested loops #pragma omp parallel for for(intopt = 0; opt < OPT_N; opt++) { h_CallResult[opt] = 0.0f; h_CallConfidence[opt] = 0.0f; } Step 5 Data Blocking • 1D Data blocking to Random nums. • L2: 521K per core, 128K per threads • Your application can use 64KB • BLOCKSIZE = 16*1024 const int nblocks = RAND_N/BLOCKSIZE; for(block=0; block<nblocks; ++block){ vsRngGaussian(VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, BLOCKSIZE, random, 0.0f, 1.0f); <<<Existing Code>> // Save intermediate result here h_CallResult[opt] += val; h_CallConfidence[opt] += val2; } • Change Inner loop from RAND_N to BLOCKSIZE

  29. Run your program on Intel® Xeon Phi™ Coprocessor • Build the native Intel® Xeon Phi™ Coprocessor application • Change the Makefile and use –mmicin lieu of –xAVX • Copy program from Host to Coprocessor • Find out the host name using hostname(if it retuns esg014) • Copy the executives using scpMonteCarloesg014-mic0: • Establish a execution environment sshesg014-mic0 • Set env. Variable %export LD_LIBRARY_PATH=. • Invoke the program ./MonteCarlo

  30. Summary

  31. Summary • Base 2 exponent and logarithmic functions are always faster than other bases on Intel multicore and manycore platforms • Set thread affinity when you use OpenMP* • Allow hardware prefetcher to work for you. Fine tuning your loop with prefetcher pragma/directives • Optimize your data access and rearrange your computation based on data in cache

More Related