1 / 18

Jacobi solver status

Jacobi solver status. Lucian Anton, Saif Mulla , Stef Salvini CCP_ASEARCH meeting October 8, 2013 Daresbury. Outline. Code structure Front end Numerical kernels Data collection Performance data Intel SB Xeon Phi BlueGeneQ GPU. Code structure. Read input from command line

gizi
Download Presentation

Jacobi solver status

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jacobi solver status Lucian Anton, SaifMulla, StefSalvini CCP_ASEARCH meeting October 8, 2013 Daresbury

  2. Outline • Code structure • Front end • Numerical kernels • Data collection • Performance data • Intel SB • Xeon Phi • BlueGeneQ • GPU Jacobi test program

  3. Code structure Jacobi test program • Read input from command line • Grid sizes, length of iteration block, # of iteration blocks ,.. • Algorithm to use • Output format (header, test iterations, …) • Initialize grid with an eigenvalue of Jacobi smoother • Run several iteration blocks • Collect min, max, average times.

  4. Build model Jacobi test program Uses a generic Makefile + plaform/*.inc files F90 := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \ source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && mpiifort CC := source /opt/intel/composerxe/bin/compilervars.sh intel64 && \ source /opt/intel/impi/4.1.0/intel64/bin/mpivars.sh && icc LANG = C ifdef USE_MIC FMIC = -mmic endif ifdef USE_MPI FMPI=-DUSE_MPI endif ifdef USE_DOUBLE_PRECISION DOUBLE=-DUSE_DOUBLE_PRECISION endif ifdef USE_VEC1D VEC1D = -DUSE_VEC1D endif #FC = module add intel/comp intel/mpi && mpiifort

  5. Command line parameters Jacobi test program arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -help Usage: [-ng<grid-size-x> <grid-size-y> <grid-size-z> ] [ -nb<block-size-x> <block-size-y> <block-size-z>][-np<num-proc-x> <num-proc-y> <num-proc-z>] [-niter <num-iterations>] [-biter <iterations-block-size>] [-malign <memory-alignment> ] [-v] [-t] [-pc] [-model <model_name> [num-waves] [threads-per-column]] [-nh] [-help] arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -model help possible values for model parameter: baseline baseline-opt blocked wave num-waves threads-per-column basegpu optgpu Note for wave model: if threads-per-column == 0 diagonal wave kernel is used.

  6. README file Jacobi test program Full explanation on command line options are provided in README The following flags can be used to set the grid sized and other run parameters: -ng <nx> <ny> <nz> set the global gris sizes -nb <bx> <by> <bz> set the computational block size, relevant only for blocked model. Notes: 1) no sanity checks tests are done, you are on your own. 2) for blocked model the OpeNMP parallelism is done over computational blocks. One must ensure that there enough work for all threads by setting suitable block sizes.

  7. Correctness check Jacobi test program -t flag checks if norm ratio are close to Jacobi smoother eigenvalue arcmport01:~/Projects/HOMB>./homb_c_gcc_debug_gpu.exe -t -niter 7 Correctness check iteration, norm ratio, deviation from eigenvalue 0 6.36918e+01 6.26966e+01 1 9.95185e-01 2.55054e-08 2 9.95185e-01 1.50473e-08 3 9.95185e-01 2.57243e-08 4 9.95185e-01 3.27436e-08 5 9.95185e-01 1.96427e-08 6 9.95185e-01 3.17978e-08 # Last norm 6.187368259733268e+01 #==========================================================================================================# # NThsNxNyNz NITER minTimemeanTimemaxTime #==========================================================================================================# 8 33 33 33 1 1.299e-04 1.487e-04 1.690e-04

  8. Algorithms Jacobi test program • Basic 3 loops iteration over the grid • OpenMP parallelism applied to external loop • If condition from inner loop eliminated • Blocked iterations • Wave iterations

  9. Algorithms: wave details New Old New Old Z Y Jacobi test program

  10. Algorithms: helping vectorisation Jacobi test program The inner loop can be replace with an easier to vectorize function: // 1D loop that helps the compiler to vectorize static void vec_oneD_loop(constint n, const Real uNorth[], const Real uSouth[], const Real uWest[], const Real uEast[], const Real uBottom[], const Real uTop[], Real w[] ){ inti; #ifdef __INTEL_COMPILER #pragma ivdep #endif #ifdef __IBMC__ #pragma ibmindependent_loop #endif for (i=0; i < n; ++i) w[i] = sixth * (uNorth[i] + uSouth[i] + uWest[i] + uEast[i] + uBottom[i] + uTop[i]); }

  11. Algorithms: CUDA Jacobi test program Base laplace3D (from Mike’s lecture notes) Shared memory in XY plane … more to come

  12. Data collection Jacobi test program With such a large parameter space we have a big-ish data problem. Bash script + gnuplot index=0 for exe in $exe_list do for model in $model_list do for nth in $threads_list do export OMP_NUM_THREADS=$nth for ((linsize=10; linsize <= max_linsize; linsize += step)) do biter=$(((10*max_linsize)/linsize)) niter=5 if [ "$model" = wave ] then nwave="$biter $((nth<biter?nth:biter))" echo "model $model $nwave" else nwave="" fi if [ "$blk_x" -eq 0 ] ; then blk_xt=$linsize ; else blk_xt=$blk_x ; fi if [ "$blk_y" -eq 0 ] ; then blk_yt=$linsize ; else blk_yt=$blk_y ; fi if [ "$blk_z" -eq 0 ] ; then blk_zt=$linsize ; else blk_zt=$blk_z ; fi echo "./"$exe" -ng $linsize $linsize $linsize -nb $blk_xt $blk_yt $blk_zt -model $model $nwave

  13. SandyBrige baseline Jacobi test program

  14. SB: blocked and wave Jacobi test program

  15. BGQ Jacobi test program

  16. Xeon Phi vsSandyBridge Jacobi test program

  17. Fermi data Jacobi test program

  18. Conclusions & To do Jacobi test program • We have an integrate set of Jacobi smoother algorithms • OpenMP, CUDA, MPI(almost) • Flexible build system • Run parameters can be selected from command line and preprocessor flags • Correctness check • Scripted data collection • README file • Tested on several system (Idataplex, BGQ, Emerald,…, MacOs laptop) • GPU needs further improvements • ….

More Related