1 / 59

Xin Huo Department of Computer Science and Engineering The Ohio State University

Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments. Xin Huo Department of Computer Science and Engineering The Ohio State University Advisor: Prof. Gagan Agrawal. Motivation - Parallel Architectures. Many-core Architecture.

fausto
Download Presentation

Xin Huo Department of Computer Science and Engineering The Ohio State University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments • Xin Huo • Department of Computer Science and Engineering • The Ohio State University • Advisor: Prof. Gagan Agrawal

  2. Motivation - Parallel Architectures • Many-core Architecture • Programming Model • CUDA/OpenCL • All x86 based programming models • Running Method • Offload method • Offload/Native method • Memory Configuration • Configurable shared memory + L2 cache • Coherent L2 cache • SIMT vs. SIMD • Support non-continuous and non-aligned accesses, and control dependencies automatically • Not direct support for non-continuous, non-aligned accesses, control/data dependencies

  3. Fusion APU • Host Memory • GPU Memory Motivation - Parallel Architectures • x86 CPU CORES • GPU Engine Arrays • Heterogeneous Architecture • CPU - Decoupled GPU • CPU - Coupled GPU • PCIe bus • Memory • Controller • Physical Memory • Physical Memory

  4. Motivation - Application Patterns • Irregular / Unstructured Reduction • A dwarf in Berkeley view on parallel computing (Molecular Dynamics and Euler) • Challenges in Parallelism • Heavy data dependencies • Challenges in Memory Performance • Indirect memory accesses result in poor data locality • Recursive Control Flow • Conflicts between SIMD architecture and control dependencies • Recursion Support: SSE (No), OpenCL (No), CUDA(Yes) • Generalized Reduction and Stencil Computation • Need to be reconsidered when harness thousands of threads

  5. Thesis Work • Parallel • Architectures • Application • Patterns • GPU • Generalized Reduction • Irregular Reduction • CPU + GPU • Stencil Computation • APU • Recursive Application • Xeon Phi

  6. Thesis Work • Parallel • Architectures • Application • Patterns • Approaches for Parallelizing Reductions on GPUs (HiPC 2010) • An Execution Strategies and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs (ICS 2011) • Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations (HiPC 2011) • Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture (SC 2012) • Efficient scheduling of recursive control flow on GPUs (ICS 2013) • A Programming System for Xeon Phis with Runtime SIMD Parallelization (ICS2014) • GPU • Generalized Reduction • Irregular Reduction • CPU + GPU • Stencil Computation • APU • Recursive Application • Xeon Phi

  7. Outline • Latest work • A Programming System for Xeon Phis with Runtime SIMD Parallelization • Other work • Strategy and runtime support for irregular reductions on GPUs • Improve parallelism of SIMD for recursive applications on GPUs • Different strategies for generalized reduction on GPU • Task scheduling frameworks • Decoupled GPU + CPU • Coupled GPU + CPU • Conclusion

  8. Intel Xeon Phi Architecture

  9. for(i = 0; i < n; i += 4) • A[i:(i+4)] = A[i:(i+4)] + B[i:(i+4)]; • for(i = 0; i < n; ++i) • A[i] = A[i] + B[i]; SIMD - Single Instruction Multiple Data • Scalar Loop • SIMD Loop

  10. MMX, SSE, AVX and IMCI Instruction Sets

  11. GPU vs. Xeon Phi • 244 • 3904

  12. Contributions • MIMD and SIMD Programming System • MIMD Parallelism • Pattern knowledge based parallel framework • Task Partitioning & Scheduling Strategies • Data reorganization to help SIMD parallelism • Automatic MIMD API • SIMD Parallelism • Automatic SIMD API • Control flow dependency resolving • Data dependency resolving

  13. MIMD and SIMD Challenges for Different Patterns

  14. job • job • job • job • job MIMD Parallel System • Task Definition • MIMD Parallel Framework • Configuration • Task Type • Task • Parameter Type • Kernel Function • Data Reorganization • Task Partitioner • job • Job Scheduler

  15. MIMD Parallel System • Task Partitioner • Computation Space Partitioning • Achieve good load balance • Need data replication or locking to resolve data dependency • Generalized Reduction, Stencil Computation • Reduction Space Partitioning • No data dependency between partitions • Introduce computation redundancy and load imbalance - can be reduced by appropriate partitioning and reorder method • Irregular Reduction • Job Scheduler • Static, dynamic, or user defined strategies • API Interface • Easy interface • run(Task) : Register Task to MIMD framework, and start to run • join() : Block and wait execution finish

  16. SIMD Data Type

  17. value0 • s • v0-0 • v1-0 • v0-1 • s • value1 • v1-1 • v0-2 • value • s1 • v0-3 • v1-0 • s • value2 • v1-2 • v1-1 • v1-2 • v1-3 • value3 • s • v1-3 • int s = s1 • int s; • vint array[2]; • vint v = s; SIMD Data Type • vint v = v1; • vint v; • 

  18. continuous memory access • non-continuous memory access SIMD API

  19. Input: • float *data; • int i; //the index of one node • //Iterate k clusters to compute the distance to each cluster • for(int j = 0; j < k; ++j){ • float dis = 0.0; • for(int m = 0; m < 3; ++m) • { • dis += (data[i*3+m]-cluster[j*3+m])* • (data[i*3+m]-cluster[j*3+m]); • } • dis = sqrt(dis); • } • Input: • vfloat *data; • int i; //the index of one node • //Iterate k clusters to compute the distance to each cluster • for(int j = 0; j < k; ++j){ • vfloat dis = 0.0; • for(int m = 0; m < 3; ++m) • { • dis += (data[i+m*n]-cluster[j*3+m])* • (data[i+m*n]-cluster[j*3+m]); • } • dis = sqrt(dis); • } Sample Kernels - Kmeans • Sequential Codes • SIMD API

  20. Input : (i, j) the index of one node, vfloat b[N][N] • vfloat Dx = 0.0; • vfloat Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx+=wH[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; • Dy += wV[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; • } • } • vfloat z = sqrt(Dx*Dx + Dy*Dy); • z.store(&a[i][j]); • Input : (i, j) the index of one node, float b[N][N] • float Dx = 0.0; • float Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx += wH[p+1][q+1]*b[i+p][j+q]; • Dy += wV[p+1][q+1]*b[i+p][j+q]; • } • } • float z = sqrt(Dx*Dx + Dy*Dy); • a[i][j] = z; Sample Kernels - Sobel • Sequential Codes • SIMD API

  21. Input : (i, j) the index of one node, float b[N][N] • __m512 Dx = _mm_set1_ps(0.0); • __m512 Dy = _mm_set1_ps(0.0); • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <=1; ++p){ • for(int q = -1; q <=1; ++q){ • __m512 *tmp = (__m512*)&b[i+q][j+p*vec_width]; • __m512 tmpx = _mm512_mul_ps(*tmp, wH[p+1][q+1]); • Dx = _mm512_add_ps(Dx, tmpx); • __m512 tmpy = _mm512_mul_ps(*tmp, wV[p+1][q+1]); • Dy = _mm512_add_ps(Dy, tmpy); • } • } • __m512 sqDX = _mm512_mul_ps(Dx, Dx); • __m512 sqDy = _mm512_mul_ps(Dy, Dy); • __m512 ret = _mm512_add_ps(sqDx, sqDy); • ret = _mm512_sqrt_ps(ret); • _mm512_store_ps(&a[i][j], ret); • Input : (i, j) the index of one node, float b[N][N] • float Dx = 0.0; • float Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx += wH[p+1][q+1]*b[i+p][j+q]; • Dy += wV[p+1][q+1]*b[i+p][j+q]; • } • } • float z = sqrt(Dx*Dx + Dy*Dy); • a[i][j] = z; Sample Kernels - Sobel • Sequential codes • Manual vectorization codes

  22. Data Layout Reorganization • Whydata reorganization? • Vectorization only support continuous and aligned memory access • Gather/Scatter APIs to load and store non-continuous and unaligned data • Increase more than 60% overhead and introduce programming efforts to users

  23. Data Layout Reorganization • Non-unit-stride memory access • AOS (Array of Structure) to SOA (Structure of Array) • Structure{x, y, z} • [x0, y0, z0, x1, y1, z1,…] => [x0, x1, …, y0, y1, …, z0, z1, …] • Non-aligned memory access • Non-linear data layout transformation proposed in “Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures” • Provide functions to gain the indices of the adjacent nodes after transformation • Irregular/Indirect memory access • Edge reordering method proposed by Bin Ren

  24. Active • Active • Inactive • Inactive • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; • if(v < 3) • v = 0; • else • v = 1; Control Flow Resolution • IMCI first introduces operations with mask parameters • mask: bitset (1: active lanes, 0: inactive lanes) • old value: the default value for inactive lanes • mask = [1, 1, 0, 0] • How to pass mask and • old value to our • operator overload API? • 0 • 0 • 1 • 1

  25. Mask Operation Implementation • IMCI implementation • Mask operations has different API from unmask operations • _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2) • Our Considerations • Operator overload functions do not support extra parameters • Minimize the complexity of API • Mask Status is one kind of status in execution • It exist during the whole thread life cycle - static • It is private to each thread - thread local

  26. vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v = 0; • if(v < 3) • v = 0; • else • v = 1; • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; Mask Operation Implementation • IMCI implementation • Mask operations has different API from unmask operations • _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2) • Our Considerations • Operator overload functions do not support extra parameters • Minimize the complexity of API • Mask Status is one kind of status in execution • It exist during the whole thread life cycle - static • It is private to each thread - thread local

  27. Mask Operation Implementation • When our API should use mask operation? • API always use mask operation • No effort for users • Introduce extra performance overhead • User decide when to use mask operation • Two vector types: unmask vector type and mask vector type • Change from unmask to mask type by calling mask() function

  28. vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v.mask() = 0; • vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v = 0; • if(v < 3) • v = 0; • else • v = 1; • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; Mask Operation Implementation • When our API should use mask operation? • API always use mask operation • No effort for users • Introduce extra performance overhead • User decide when to use mask operation • Two vector types: unmask vector type and mask vector type • Change from unmask to mask type by calling mask() function

  29. Data Dependency Resolution • Writing conflits in reduction • Undefined behavior when different SIMD lanes update the same location • Serialized reduction • template<class ReducFunc = func> • void reduction(int *update, int scale, int offset, vint *index, type value) • for(int i = 0; i < simd_width; ++i){ • ReducFunc()(update[index[i]*scale+offset], value[i];) • }

  30. Data Dependency Resolution • Reorder reduction • A • A • A • B • B • B • C • D • A • B • C • D • A • B • Group reduction • Still need gather and scatter for • non-continuous memory access • A • A • B • B • C • D • A • B • A • B • C • D

  31. Experiment Evaluations • Environment configuration • Xeon Phi SE 10P with 61 cores at 1.1 GHz • 8GB DDR5 memory • Running in native model • Intel ICC 13.1.0 with -O3 option • Benchmarks • Generalized reduction: Kmeans, NBC • Stencil computation: Sobel, Heat3D • Experiment Goals • SIMD-API • SIMD-Manual • Pthread-vec: -vec option • Pthread-novec: -no-vec option • OpenMP-vec: -vec option with #pragma vector always • OpenMP-novec: -no-vec option

  32. Speedup - Generalized Reduction • SIMD-API achieves better performance compared to Pthread-nonvec and Pthread-vec • SIMD-API introduces very small overhead compared to SIMD-Manual • Kmeans: the performance of compiler vectorization dependents on K • NBC: large number of divergences limit the performance of compiler vectorization

  33. Speedup - Stencil Computation • Unaligned access is the major challenge for vectorization • Compiler can also do auto-vectorization for stencil computation in some conditions • Sobel: Compiler vectorization fails due to the complicate nest loop • Heat3D: Compiler vectorization achieves the best performance, which is very similar to our API version

  34. Scalability

  35. Comparison with OpenMP • MIMD + SIMD Parallelism • More than 3 times speedups for most applications • MIMD Parallelism • Gain better performance due to the pattern knowledge based task partitioning and scheduling strategies.

  36. Outline • Latest work • A Programming System for Xeon Phis with Runtime SIMD Parallelization • Other work • Strategy and runtime support for irregular reductions on GPUs • Improve parallelism of SIMD for recursive applications on GPUs • Different strategies for generalized reduction on GPU • Task scheduling frameworks • Decoupled GPU + CPU • Coupled GPU + CPU • Conclusion

  37. Irregular Reduction • Irregular Applications • Unstructured grid pattern • Random and indirect memory accesses • Molecular Dynamics • Indirection Array -> Edges (Interactions) • Reduction Objects -> Molecules (Attributes) • Computation Space -> Interactions b/w molecules • Reduction Space -> Attributes of Molecules

  38. Main Issues • Traditional Strategies are not effective • Full Replication (Private copy per thread) • Large memory overhead • Both intra-block and inter-block combination • Shared memory usage unlikely • Locking Scheme (Private copy per block) • Heavy conflicts within a block • Avoid intra-block combination, but not inter-block combination • Shared memory is only available for small data sets • Need to choose Partitioning Strategy • Make sure data can be put into shared memory • Choice of partitioning space (Computation VS. Reduction) • Tradeoffs: Partitioning overhead & Execution efficiency

  39. Contributions • A Novel Partitioning-based Locking Strategy • Efficient shared memory utilization • Eliminate both intra and inter-block combination • Optimized Runtime Support • Multi-Dimensional Partitioning Method • Reordering & Updating components for correctness and memory performance • Significant Performance Improvements • Exhaustive evaluation • Up to 3.3x improvement over traditional strategies

  40. Choice of Partitioning Space • Two partitioning choices: • Computation Space • Partition on edges • Reduction Space • Partition on nodes

  41. 5 Computation Space Partitioning • 1 • 3 • Pros: • Load Balance on Computation • Cons: • Unequal reduction size in each partition • Replicated reduction elements (4 out of 16 nodes are replicated) • Combination cost • 2 • Partitioning on the iterations of computation loop • 4 • Partition 1 • 8 • 6 • 5 • 12 • 16 • 2 • 4 • 7 • 9 • 11 • 13 • 10 • 16 • 6 • 12 • Partition 2 • 14 • 7 • 4 • 5 • Shared memory is infeasible • Partition 3 • Partition 4

  42. 5 Reduction Space Partitioning • 1 • 3 • Pros: • Balanced reduction space • Independent between each two partitions • Avoid combination cost • Shared memory is feasible • Cons: • Imbalance on computation space • Replicated work caused by the crossing edges • 2 • Partitioning on the Reduction Elements • 4 • 8 • Partition 1 • 6 • 12 • 5 • 16 • 7 • 9 • 7 • 11 • 13 • 10 • 16 • Partition 2 • 14 • Partition 3 • Partition 4

  43. Reduction Space Partitioning - Challenges • Unbalanced & Replicated Computation • Partitioning method can achieve balance between Cost and Efficiency • Cost: Execution time of partitioning method • Efficiency: Reduce number of crossing edges (Replicated work) • Maintain correctness on GPU • Reorder reduction space • Update/Reorder computation space

  44. Runtime Partitioning Approaches • Metis Partitioning (Multi-level k-way Partitioning) • Execute sequentially on CPU • Minimizes crossing edges • Cons: Large overhead for data initialization • GPU-based (Trivial) Partitioning • Parallel execution on GPU • Minimize execution time • Cons: Large number of crossing edges among partitions • Multi-dimensional Partitioning (Coordinate Information) • Execute sequentially on CPU • Balance between cost and efficiency • High Cost • Low Efficiency

  45. Performance Gains • Euler:Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time Molecular Dynamics:Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time

  46. Outline • Latest work • A Programming System for Xeon Phis with Runtime SIMD Parallelization • Other work • Strategy and runtime support for irregular reductions on GPUs • Improve parallelism of SIMD for recursive applications on GPUs • Different strategies for generalized reduction on GPU • Task scheduling frameworks • Decoupled GPU + CPU • Coupled GPU + CPU • Conclusion

  47. Current Recursion Support on Modern GPUs • Problem definition: intra-warp thread scheduling • Each thread in a warp executes an independent recursive function call • SIMT (Single Instruction Multiple Thread): different from the traditional SSE extensions • Each thread owns a stack for function call • Branches control: different branches execute in serial • Recursion support on current GPUs • AMD GPUs and OpenCL programming model • Not support recursion • NVIDIA GPUs • Support recursion from computing capability 2.0 and SDK 3.1 • Performance is limited by the reconvergence method — immediate post dominator reconvergence

  48. /* General Branch */ • Fib(n) • { • if(n < 2) • { • return 1; • } • else { • x = Fib(n-1); • y = Fib(n-2); • return x+y; • } • } Immediate post-dominator Reconvergence in Recursion • T0 • T1 • Reconvergence can only happen on the same recursion level • Threads with shorter branch cannot return until the threads with longer branch coming back to the reconvergence point

  49. Contributions • Dynamic reconvergence method for recursion • Reconvergence can happen before or after immediate post-dominator • Two kinds of implementations • Dynamic greedy reconvergence • Dynamic majority reconvergence • Significant performance improvements • Exhaustive evaluation on six recursion benchmarks with different characteristics • Up to 6x improvement over immediate post-dominator method

  50. Dynamic Reconvergence Mechanisms • T0 • Divergence • T1

More Related