Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations XinHuo, Vignesh T. Ravi, GaganAgrawal Department of Computer Science and Engineering The Ohio State University

Irregular Reduction - Context • A dwarf in Berkeley view on parallel computing • Unstructured grid pattern • More random and irregular accesses • Indirect memory references • Previous efforts in porting to different architectures • Distributed memory machines • Distributed shared memory machines • Shared memory machines • Cache performance improvement on uniprocessor • Many-core architecture - GPGPU (Our work in ICS 11) • No system study on heterogeneous architecture (CPU&GPU)

Why CPU + GPU ? – A Glimpse • Dominant positions in different areas, but connected tightly • GPU: Computation Intensive problems, large number of threads execute in SIMD • CPU: Data Intensive problems, branch processing, high precision and complicated computation • GPU is dependent on the scheduling and data from CPU • One of the most popular heterogeneous architectures • 3 out of 5 fastest supercomputers are based on CPU + GPU architecture (500 list on 11/2011) • Fusion in AMD, Sandy Bridge in Intel, and Denver in NVIDIA • Cloud compute instances in Amazon

Outline • Background • Irregular Reduction Structure • Partitioning-based Locking Scheme • Main Issues • Contributions • Multi-level Partitioning Framework • Runtime Support • Pipeline Scheme • Task Scheduling • Experiment Support • Conclusions

Irregular Reduction Structure • Robj: Reduction Object • e: Iteration of computation loop • IA(e,x): Indirection Array • IA: Iterators over e (Computation Space) • Robj: Accessed by Indirection Array (Reduction Space) /* Outer Sequence Loop */while( ) {/* Reduction Loop */Foreach(element e) {(IA(e,0),val1) = Process(IA(e,0));(IA(e,1),val2) = Process(IA(e,1)); Robj= Reduce(Robj(IA(e,0)),val1);Robj= Reduce(Robj(IA(e,1)),val2);}/*Global Reduction to Combine Robj*/}

Application Context Molecular Dynamics Indirection Array -> Edges (Interactions) Reduction Objects -> Molecules (Attributes) Computation Space -> Interactions b/w molecules Reduction Space -> Attributes of Molecules

Partitioning-based Locking Strategy • Reduction Space Partitioning • Efficient shared memory utilization • Eliminate intra and inter-block combination • Multi-Dimensional Partitioning Method • Balance between minimum cutting edges and partitioning time Huo et al., 25th International Conference on Supercomputing

Main Issues • Device memory limitation on GPU • Partitioning overhead • Partitioning cost increases with the increasing of data volume • GPU is in idle for waiting the results of partitioning • Low utilization of CPU • CPU only conducts partitioning • CPU is in idle when GPU doing computation

Contributions • A Novel Multi-level Partitioning Framework • Parallelize irregular reduction on heterogeneous architecture (CPU + GPU) • Eliminate device memory limitation on GPU • Runtime Support Scheme • Pipelining Scheme • Work stealing based scheduling strategy • Significant Performance Improvements • Exhaustive evaluations • Achieve 11% and 22% improvement for Euler and Molecular Dynamics

Multi-level Partitioning Framework

Computation Space Partitioning • Partitioning on the iterations of the computation loop • Pros • Load Balance on Computation • Cons • Unequal reduction size in each partition • Replicated reduction elements (4 out of 16 nodes) • Combination cost • Between CPU and GPU (First Level) • Between different thread blocks (Second Level) 5 Partition 1 1 3 2 2 4 4 8 6 Partition 2 15 12 12 7 7 9 13 16 11 10 14 Partition 3 Partition 4

Reduction Space Partitioning • Pros • Balance reduction space • Shared memory is feasible for GPU • Independent between each two partitions • No communication between CPU and GPU (First Level) • No communication between thread blocks (Second Level) • Avoid combination cost • Between different thread blocks • Between CPU and GPU • Cons • Imbalance on computation space • Replicated work caused by crossing edges • Partitioning on Reduction Elements 5 Partition 1 1 3 2 4 8 6 15 Partition 2 12 7 9 Partition 3 13 16 11 10 14 Partition 4

Task Scheduling Framework

Task Scheduling Framework • Pipelining Scheme • K blocks assigned to GPU in one global loading • Pipelining between Partitioning and Computation in a global loading • Work Stealing Scheduling • Scheduling granularity (Large for GPU; Small for CPU) • Too large: Better pipelining effect, but worse load balance • Too small: Better load balance, but small pipelining length • Work Stealing can achieve both maximum pipelining length and good load balance

Experiment Setup • Platform • GPU • NVIDIA Tesla C2050 “Fermi” (14x32 = 448 cores) • 2.68GB device memory • 64KB configurable shared memory • CPU • Intel 2.27 GHz Quad Xeon E5520 • 48GB memory • x16 PCI Express 2.0 • Applications • Euler (Computational Fluid Dynamics) • MD (Molecular Dynamics)

Scalability – Molecular Dynamics • Scalability of Molecular Dynamics on Multi-core CPU and GPU across Different Datasets (MD)

Pipelining – Euler • Effect of Pipelining CPU Partitioning and GPU Computation (EU) Avoid partitioning overhead except the first partition Pipelining increasing Performance increasing

Heterogeneous Performance – EU and MD • Benefits From Dividing Computations Between CPU and GPU for EU and MD 11% 22%

Work Stealing - Euler • Comparison of Fine-grained, Coarse-grained, and Work Stealing Strategies Granularity = 1 Good load balance, Bad pipelining effect Granularity = 5 Good pipelining effect Bad load balance Good pipelining effect Good load balance

Conclusions • Multi-level Partitioning Framework to port irregular reduction on heterogeneous architectures • Pipelining Scheme can overlap partitioning on CPU and computation on GPU • Work Stealing Scheduling achieves the best pipelining effect and load balance

Thank you Questions ? Contacts: Xin Huo huox@cse.ohio-state.edu Vignesh T. Ravi raviv@cse.ohio-state.edu GaganAgrawalagrawal@cse.ohio-state.edu

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

Presentation Transcript

Task Management for Irregular-Parallel Workloads on the GPU

OpenFOAM on a GPU-based Heterogeneous Cluster

Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems

Heterogeneous CPU/GPU co-processor clusters

Future of GPU/CPU Computing and Programming

Porting NANOS on SDSM

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Heterogeneous CPU Cores

Programming Heterogeneous (GPU) Systems

A Discussion of CPU vs. GPU

GPU Parallelization Strategy for Irregular Grids

Paragon: Collaborative Speculative Loop Execution on GPU and CPU

CPU-GPU Collaboration for Output Quality Monitoring

OpenCL Framework for Heterogeneous CPU/GPU Programming

ECLIPSE Extended CPU Local Irregular Processing Structure

GPU and CPU: The Differences

Porting the physical parametrizations on GPU using directives

ucosii porting on pxa270

GPU vs. CPU

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Microarchitectural Performance Characterization of Irregular GPU Kernels