1 / 23

Approaches for Parallelizing Reductions on Modern GPUs

Approaches for Parallelizing Reductions on Modern GPUs. Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210. Outline. Motivation Challenges Contributions Generalized Reductions

werner
Download Presentation

Approaches for Parallelizing Reductions on Modern GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approaches for Parallelizing Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and GaganAgrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210

  2. Outline • Motivation • Challenges • Contributions • Generalized Reductions • Parallelization Approaches • Full Replication • Locking Scheme • Hybrid Scheme • Evaluation • Conclusions

  3. Motivation • Deluge of scientific and Data-Intensive applications • Floating/ Double data types & use of shared memory, a commonplace • Different applications require different synchronization mechanisms • State-of-the-art mechanisms to avoid race conditions in CUDA applications (<= compute 1.3) • Replicate: A private copy for each thread • Fine-grained Atomic Operation on Device Memory (Only Integer) • Fine-grained Atomic Operation on Shared Memory (Only Integer) • Visible gap in application requirement & CUDA support • No floating-point operations • No robust coarse-grained locking • Disadvantages of existing mechanisms • Replication: Huge Memory & Combination overhead • Atomic Operations: Introduce High Conflicts

  4. Challenges • Provide additional Mechanisms to avoid race conditions • Enable floating-point Atomic Operation (both Device & Shared Memory) • Enable Coarse-grained locking • Overheads of Replication • Increase in memory requirement with data size, threads, application parameter • Combination overhead increases with thread size • Mostly obviates the use of shared memory • Overheads of Locking • Heavy conflicts per word can occur with large thread numbers • How to improve the existing mechanisms? • Provide a mechanism that can balance the trade-offs between Replication & Locking

  5. Contributions • Additional Locking Mechanisms • Wrapper, Floating-point Fine-grained Locking • A robust, dead-lock free Coarse-grained Locking • Explicit conditional branch • Explicit warp serialization • A novel Hybrid Scheme combining Replication & Locking • Balances the overheads from both replication and locking • Handle all these schemes transparently from user

  6. Generalized Reduction Computations • Similar to MapReduce model • But only one stage, Reduction • Reduction Object, Robj, exposed to programmer • Large intermediate results avoided • Reduction operation, Reduc, is associative or commutative • Order of processing can be arbitrary • Target a particular class of applications in this work {* Outer sequential loop*} While(unfinished) { {*Reduction loop*} Foreach( element e) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } }

  7. Device Host Reduction Object Reduction Object Reduction Object Threads Threads Reduction Object Threads Reduction Object …….…… Reduction Object Reduction Object Reduction Object Threads Reduction Object Final Result Threads Reduction Object Threads Reduction Object Reduction Object …….…… Reduction Object Reduction Object Threads Threads Threads …….…… Parallelization Schemes - Full Replication Block Block Block Reduction Object Reduction Object Reduction Object Reduction Object

  8. Device Host All threads in one block share the same copy Reduction Object Reduction Object Threads Block Threads Threads …….…… Reduction Object Reduction Object Final Result Threads Block Threads Threads …….…… Reduction Object Reduction Object Threads Block Threads Threads …….…… Parallelization Schemes – Locking Schemes • Fine-grained locking • Coarse-grained locking

  9. Fine-grained Locking Atomic executing“*address+val” • Lock for updating a particular word (Atomic Operation) • Support floating & double point computation • Implemented by wrapping atomicCAS provided by CUDA Input: Address,old_value, new_value, val AtomicCAS(Address, old_value, new_value) *Address = old_value? Yes No *address= new_value Old_value=*address New_value=*address + value Stop

  10. Coarse-grained Locking Locking=0 free Locking=1 busy • Lock for a group of operations (Critical section) Input: locking AtomicCAS(locking, 0, 1) Yes No locking = 0? Locking=1 Critical section Spin Lockcauses Thread Divergence in Warp Free lock Locking=0 Deadlock prone Stop

  11. Deadlock free solutions • Explicit Warp Serialization • Explicit Conditional Branch Do = true Tthread: 0 -32 While (Do) Get lock() Get lock() Yes No Success Critical section Critical section Release lock() Release Lock() threadID++ Do=false

  12. Hybrid Scheme • Hybrid Scheme • Balance between Full Replication & Locking Scheme • Insert a middle layer “group” in thread organization • Intra group: Locking Scheme • Inter group: Full Replication • Benefits of Hybrid Scheme varies with group size • Advantages with appropriate group size • Reduced Memory overhead • Reduced combination • Better use of shared Memory • Reduced conflicts

  13. Device Host The threads are organized as Groups and Blocks Each group has a private copy of reduction object. BLOCK Thread Reduction object Thread Group Intermediate results Thread Intermediate results Reduction object Group ……….. Group Reduction object ……….. Final results BLOCK Thread Reduction object Thread Group Intermediate results Intermediate results Thread Reduction object Group ……….. Group Reduction object ……….. Model of Hybrid Scheme

  14. Experiment Setup • Setup • NVIDIA Tesla C1060 • 4GB device memory, 16KB shared memory • Computing capability 1.3 • AMD Opteron 2218 processors • Applications • K-Means Clustering • Principal Component Analysis (PCA) • K-Nearest Neighbor Search (KNN) • Evaluate the performance of three parallelization techniques: • Full Replication • Lock Scheme • Hybrid Scheme • Analyze the factors influencing the performance

  15. K-Means: K=10, data size=2GB low memory and combination overhead, but high competition Number of groups

  16. K-Means: K=10 • Hybrid outperform both Full Replication and Locking Scheme Comparison in the best configurations

  17. KNN: K=10, data size=20MB • High competition with small reduction size (K=10) • Using explicit warp serialization in Locking, which gives better performance than explicit conditional branch. • 32 groups in Hybrid Scheme matching the number of threads in a warp. • No two threads in same warp go to the same group. So race condition only exist among the threads in different warp. Locking is much more sensitive to the number of threads due to high overhead of coarse-grained locking

  18. KNN: K=10 • Full Replication: 9.6 times faster than locking scheme. • Hybrid: 62.3 times faster than Locking Scheme • Comparison the best configurations

  19. KNN: CUDA Profiler Results • Hybrid achieves a balance between divergent and global store. • 1.6% braches of Locking • 0.7% global stores of Full Replication

  20. KNN: CUDA Profiler Results • With the increasing of k, divergent branches in Locking doesn't change much, but number of global store in Full Replication increases dramatically

  21. KNN: varying K • Locking will outperform Full Replication with increasing K,based on the observation of changing of divergent branches and global store

  22. Conclusions • Performance is dependent on • Characteristics of applications • Thread configurations • Choice of scheme • Full Replication • Viable when reduction object space is small and combination cost is low • Locking Scheme • Memory location is large enough to keep contention overhead low • Hybrid Scheme • Obtain a balance between combination memory overhead and synchronization cost • Achieves the best performance for the benchmarks we considered

  23. Thank you Questions? Contacts: Xin Huo huox@cse.ohio-state.edu Vignesh Ravi raviv@cse.ohio-state.edu Wenjing Ma mawe@cse.ohio-state.edu GaganAgrawalagrawal@cse.ohio-state.edu

More Related