1 / 21

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations. Xin Huo , Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. Irregular Reduction - Context. A dwarf in Berkeley view on parallel computing

rhett
Download Presentation

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations XinHuo, Vignesh T. Ravi, GaganAgrawal Department of Computer Science and Engineering The Ohio State University

  2. Irregular Reduction - Context • A dwarf in Berkeley view on parallel computing • Unstructured grid pattern • More random and irregular accesses • Indirect memory references • Previous efforts in porting to different architectures • Distributed memory machines • Distributed shared memory machines • Shared memory machines • Cache performance improvement on uniprocessor • Many-core architecture - GPGPU (Our work in ICS 11) • No system study on heterogeneous architecture (CPU&GPU)

  3. Why CPU + GPU ? – A Glimpse • Dominant positions in different areas, but connected tightly • GPU: Computation Intensive problems, large number of threads execute in SIMD • CPU: Data Intensive problems, branch processing, high precision and complicated computation • GPU is dependent on the scheduling and data from CPU • One of the most popular heterogeneous architectures • 3 out of 5 fastest supercomputers are based on CPU + GPU architecture (500 list on 11/2011) • Fusion in AMD, Sandy Bridge in Intel, and Denver in NVIDIA • Cloud compute instances in Amazon

  4. Outline • Background • Irregular Reduction Structure • Partitioning-based Locking Scheme • Main Issues • Contributions • Multi-level Partitioning Framework • Runtime Support • Pipeline Scheme • Task Scheduling • Experiment Support • Conclusions

  5. Irregular Reduction Structure • Robj: Reduction Object • e: Iteration of computation loop • IA(e,x): Indirection Array • IA: Iterators over e (Computation Space) • Robj: Accessed by Indirection Array (Reduction Space) /* Outer Sequence Loop */while( ) {/* Reduction Loop */Foreach(element e) {(IA(e,0),val1) = Process(IA(e,0));(IA(e,1),val2) = Process(IA(e,1)); Robj= Reduce(Robj(IA(e,0)),val1);Robj= Reduce(Robj(IA(e,1)),val2);}/*Global Reduction to Combine Robj*/}

  6. Application Context Molecular Dynamics Indirection Array -> Edges (Interactions) Reduction Objects -> Molecules (Attributes) Computation Space -> Interactions b/w molecules Reduction Space -> Attributes of Molecules

  7. Partitioning-based Locking Strategy • Reduction Space Partitioning • Efficient shared memory utilization • Eliminate intra and inter-block combination • Multi-Dimensional Partitioning Method • Balance between minimum cutting edges and partitioning time Huo et al., 25th International Conference on Supercomputing

  8. Main Issues • Device memory limitation on GPU • Partitioning overhead • Partitioning cost increases with the increasing of data volume • GPU is in idle for waiting the results of partitioning • Low utilization of CPU • CPU only conducts partitioning • CPU is in idle when GPU doing computation

  9. Contributions • A Novel Multi-level Partitioning Framework • Parallelize irregular reduction on heterogeneous architecture (CPU + GPU) • Eliminate device memory limitation on GPU • Runtime Support Scheme • Pipelining Scheme • Work stealing based scheduling strategy • Significant Performance Improvements • Exhaustive evaluations • Achieve 11% and 22% improvement for Euler and Molecular Dynamics

  10. Multi-level Partitioning Framework

  11. Computation Space Partitioning • Partitioning on the iterations of the computation loop • Pros • Load Balance on Computation • Cons • Unequal reduction size in each partition • Replicated reduction elements (4 out of 16 nodes) • Combination cost • Between CPU and GPU (First Level) • Between different thread blocks (Second Level) 5 Partition 1 1 3 2 2 4 4 8 6 Partition 2 15 12 12 7 7 9 13 16 11 10 14 Partition 3 Partition 4

  12. Reduction Space Partitioning • Pros • Balance reduction space • Shared memory is feasible for GPU • Independent between each two partitions • No communication between CPU and GPU (First Level) • No communication between thread blocks (Second Level) • Avoid combination cost • Between different thread blocks • Between CPU and GPU • Cons • Imbalance on computation space • Replicated work caused by crossing edges • Partitioning on Reduction Elements 5 Partition 1 1 3 2 4 8 6 15 Partition 2 12 7 9 Partition 3 13 16 11 10 14 Partition 4

  13. Task Scheduling Framework

  14. Task Scheduling Framework • Pipelining Scheme • K blocks assigned to GPU in one global loading • Pipelining between Partitioning and Computation in a global loading • Work Stealing Scheduling • Scheduling granularity (Large for GPU; Small for CPU) • Too large: Better pipelining effect, but worse load balance • Too small: Better load balance, but small pipelining length • Work Stealing can achieve both maximum pipelining length and good load balance

  15. Experiment Setup • Platform • GPU • NVIDIA Tesla C2050 “Fermi” (14x32 = 448 cores) • 2.68GB device memory • 64KB configurable shared memory • CPU • Intel 2.27 GHz Quad Xeon E5520 • 48GB memory • x16 PCI Express 2.0 • Applications • Euler (Computational Fluid Dynamics) • MD (Molecular Dynamics)

  16. Scalability – Molecular Dynamics • Scalability of Molecular Dynamics on Multi-core CPU and GPU across Different Datasets (MD)

  17. Pipelining – Euler • Effect of Pipelining CPU Partitioning and GPU Computation (EU) Avoid partitioning overhead except the first partition Pipelining increasing Performance increasing

  18. Heterogeneous Performance – EU and MD • Benefits From Dividing Computations Between CPU and GPU for EU and MD 11% 22%

  19. Work Stealing - Euler • Comparison of Fine-grained, Coarse-grained, and Work Stealing Strategies Granularity = 1 Good load balance, Bad pipelining effect Granularity = 5 Good pipelining effect Bad load balance Good pipelining effect Good load balance

  20. Conclusions • Multi-level Partitioning Framework to port irregular reduction on heterogeneous architectures • Pipelining Scheme can overlap partitioning on CPU and computation on GPU • Work Stealing Scheduling achieves the best pipelining effect and load balance

  21. Thank you Questions ? Contacts: Xin Huo huox@cse.ohio-state.edu Vignesh T. Ravi raviv@cse.ohio-state.edu GaganAgrawalagrawal@cse.ohio-state.edu

More Related