Optimization Methodology on Xeon and Xeon Phi Hybrid Cluster for POP (Parallel Ocean Model Program)

Case Study --- Optimization methodology on Xeon and Xeon Phi Hybrid Cluster regarding POP (Parallel Ocean model Program) 肖斌，国家海洋局第一海洋研究所林隽民，英特尔公司（北京）

Contents • Introduction to POP • Optimization Steps • Load balance • Fine-grained OpenMP parallelization • CA-PCG algorithm • Stencil opt • Vectorization • Conclusions

Introduction to POP • 海洋环流模拟 • 气候模拟 • 短期气候预测 FIO-ESM示意图

Introduction to POP • The ocean component of the CESM1.0 is the Parallel Ocean Program version 2 (POP2). This model is based on the POP version 2.1 of the Los Alamos National Laboratory; it includes many physical and software developments incorporated by the members of the Ocean Model Working Group.

Introduction to POP Cache POP SIMD Ram • POP V.S. other ocean models (MOM5, NEMO) MOM Block Loop: Loop barotropic 4 times fetching memory 4 times fetching memory Loop vertmixing 4 times fetching memory Loop baroclinic 4 times fetching memory Disadvantage: MPI communications increasing 4 times fetching memory Loop sfcflux

Optimization Steps: Analysis Node1 Node2 Node3 • Bench01d Analysis： • Utilize Intel Vtune to get the hotspots and micro structure behavior. • Utilize Intel Itac to get MPI performance Network connection Node4 Node5 Node6… Work load: 3600*2400*40, state-of-art scale. Platform: Xeon + Xeon Phi • Optimization techniques: • Load balance • Fine-grained OpenMP parallelization • CA-PCG algorithm • Stencil opt • Vectorization Cache Login and task assignment service Ram SIMD

Optimization Steps: Step 1 Load balance Motivation • - load imbalance between Xeon MPI ranks : land blocks elimination. • - load imbalance between Xeon and Xeon Phi ranks: the even distribution policy. Make POP work in Xeon+Xeon Phi symmetric mode Approach • Distribute effective(non-land) blocks evenly across Xeon MPI ranks • Distribute less effective blocks to Xeon Phi MPI rank, according to its computation power, while subjected to the memory capacity of Xeon Phi • Other consideration, minimum ratio of circumference/area, encompassing Xeon Phi block with Xeon blocks on the same node.

Intel Trace Analyzer • POP with bench01 before load balance tuning

A New Block Distribution Algorithm • Demonstration • Assume L=12, • legend • Green: land block • Blue: unallocated ocean block • Orange: ocean block allocated to Xeon • Red: ocean block allocated to Xeon Phi • Number: the local number of block in each rank scanned strip scanned strip scanned strip

Intel Trace Analyzer • POP with bench01 after load balance tuning

Optimization Steps: Step 2 Fine-grained OpenMP parallelization on Xeon Phi • Motivation • Memory bounded nature: depends on block size • Poor temporal locality with default block size when using inter-block (coarse-grained) OpenMP parallelization, especially for baroclinic. • Approach • Distribute normal size block to Xeon Phi process, then divide it into many tiny blocks. Explore the inter-tiny-block threading, namely fine-grained OpenMP. • - Computation: As normal; • - Communication: aggregate tiny blocks to participate the normal-size communication; • Implementation – modules to modify include: • - Block creation & block distribution • - Communication interfaces: global reduction, gather & scatter, boundary(halo) update ;

Results of fine-grained OpenMP parallelizition • Performance summary • Improve baroclinic: the time spent on baroclinic is reduced from ~140 seconds to ~60 seconds; • Bridge performance gap between single Xeon Phi process and Xeon process: Xeon/Xeon Phi< 3X ( was ~7X).

Results of fine-grained OpenMP parallelizition • Load balance analysis by Intel Trace Analyzer POP with bench01 on Xeon+Xeon Phi Xeon: 28MPI ranks, Xeon Phi: 28MPI x 8 OMP per node, 4-node. Avg Block#xeon=3, Block#phi=1 Xeon Xeon Phi

Optimization Steps: Step 3 CA-PCG algorithm Motivation Barotropic integrations show intense global MPI communications, this is due to the intense global reduce operations in the traditional PCG solver. This issue can be bottleneck for large scale nodes applications of POP. Approach Mathematically CA-PCG solver provide the possibility of reducing global communication, also theoretically this new algorithm also provide wider chance to promote data locality.

Introduction to CA-PCG • Communication Avoiding Krylov Subspace Methods (CA-KSMs) are developed by Prof. Demmel, Berkeley • KSMs rely on sparse matrix-vector multiply and vector inner product: • communication-bound  communicate once for s steps • cyclic data dependence  break up • Communication Avoiding Preconditioned Conjugate Gradient (CA-PCG): • Proposed in [1], but somewhat complicated; • CA-CG with new insight in [2], but no CA-PCG described; • Our work: derive CA-PCG by following the way in [2]; • Competitor: P-CSI ([4])

References • [1] M. Hoemmen, Communication-avoiding Krylov subspace methods, PhD thesis, University of California, Berkeley, 2010. • [2] Erin Carson, Communication-Avoiding Krylov Subspace Methods in Theory and Practice, PhD thesis, University of California, Berkeley, 2015. • [3] Erin Carson and James Demmel, A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of s-step Krylov Subspace Methods. SIAM J. Matrix Anal. Appl., 35(1) 2014, pp. 22-43. • [4] Yong Hu, Xiaomeng Huang, etal. Improving the Scalability of the Ocean Barotropic Solver in the Community Earth System Model. SC 2015.

PCG and CA-PCG Algorithm CA-PCG PCG

Sample codes Code samples of PCG and CA-PCG, need to re-construct all the data environments, infrastructures, dependence relationship and Parallel IO. Finally, 22 related files are used.

Communication trace by ITAC MPI communications in sample code runs, Xeon: 28MPI ranks, 2 nodes

Performance results pcg：20.97 s capcg：13.76s

Optimization Steps: Step 4 Stencil opt • Stencil codes are a class of iterative kernels which update array elements according to some fixed pattern, called stencil. They are most commonly found in the codes of computer simulations, e.g. for computational fluid dynamics in the context of scientific and engineering applications. Other notable examples include solving partial differential equations, the Jacobi kernel, the Gauss–Seidel method, image processing and cellular automata. The regular structure of the arrays sets stencil codes apart from other modeling methods such as the Finite element method. Data dependencies of a selected cell in the 2D array.

Approach • 2-dimension to 1-dimension transformation with stencil code to avoid the gather/scatter vector instruction. do j=this_block%jb,this_block%je do i=this_block%ib,this_block%ie AX(i,j,bid) = A0 (i ,j ,bid)*X(i ,j ,bid) + & AN (i ,j ,bid)*X(i ,j+1,bid) + & AN (i ,j-1,bid)*X(i ,j-1,bid) + & AE (i ,j ,bid)*X(i+1,j ,bid) + & AE (i-1,j ,bid)*X(i-1,j ,bid) + & ANE(i ,j ,bid)*X(i+1,j+1,bid) + & ANE(i ,j-1,bid)*X(i+1,j-1,bid) + & ANE(i-1,j ,bid)*X(i-1,j+1,bid) + & ANE(i-1,j-1,bid)*X(i-1,j-1,bid) end do end do (before) do j=this_block%jb,this_block%je do i=this_block%ib,this_block%ie AX(i+(j-1)*nx_block+(bid-1)*nx_block*ny_block) = A0 (i+(j-1)*nx_block+(bid-1)*nx_block*ny_block)*X(i+(j-1)*nx_block+(bid-1)*nx_block*ny_block) + & ... end do end do (after)

Intel Trace Analyzer Results of stencil opt sample code on Xeon Phi • Improve the performance by ~5%

Intel Trace Analyzer • Results in the POP • Improve the overall performance by ~6%

Optimization Steps: Step 5 Vectorization in KPP vertical mixing scheme POP with benchX1/01 (on Xeon Phi) Vectorization: (i,j,k)loop: innermost k loop has strong loop-carried dependence, and the loop counter is not constant, which prevents vectorization. (Yet threading is possible) Source code: function impvmixt, impvmixt_correct, impvmixu K+1 dep on K ① ② K dep on K+1 Vectorization on (i:i+VLEN-1, j, k) points Here VLEN=8

Vtune Trace Data Analysis POP with bench01 on Xeon Phi (in Xeon+Xeon Phi mode)

do j=this_block%jb,this_block%je do i=this_block%ib,this_block%ie-VLEN+1,VLEN k_min = minval(KMT(i:i+VLEN-1,j,bid)) k_max = maxval(KMT(i:i+VLEN-1,j,bid)) ! Initialization for k loop do k=2,k_max (or k_min) vectorization on i1=i,i+VLEN-1 … enddo do k=k_max (or k_min),1,-1 vectorization on i1=i,i+VLEN-1 … enddo enddo !i enddo !j (after) do j=this_block%jb,this_block%je do i=this_block%ib,this_block%ie ! Initialization for k loop do k=2,KMT(i,j,bid) enddo do k=KMT(i,j,bid)-1,1,-1 enddo enddo !i enddo !j (before) • performance improvement • Overall 4.9% with benchX1 Note: the iterations between k_min and k_max can be either executed sequentially or masked vectorized. • POP with benchX1/01 (on Xeon Phi) • code demonstration

Vtune Trace Data Analysis • Results of KPP vectorization in sample code

Vtune Trace Data Analysis

Conclusions 1, Load balance of POP is improved by the newly designed distribution algorithm, as results, the efficiency of POP on Xeon and Xeon Phi platform is promoted. 2, Fine grained optimizations is achieved by introducing tiny blocks on MIC processes, the tiny-block designs help to improve POP’s efficiency on MIC significantly by exposing more parallelism. 3, Mathematical algorithm optimizations are achieved by adopting the latest published CA-PCG algorithm in the barotropic solvers, this state-of-art algorithm can significantly reduce MPI global communications and provide more opportunities in data-locality optimizations. 4, Micro-architecture optimizations including, stencil opt , vectorizations opt, they contribute to the total improvements in an important way, which is both shown in POP runs and sample code runs.

Optimization Methodology on Xeon and Xeon Phi Hybrid Cluster for POP (Parallel Ocean Model Program)

Optimization Methodology on Xeon and Xeon Phi Hybrid Cluster for POP (Parallel Ocean Model Program)

Presentation Transcript