1 / 22

Hardware Transactional Memory for GPU Architectures

Hardware Transactional Memory for GPU Architectures. Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44). Performance.

stormy
Download Presentation

Hardware Transactional Memory for GPU Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

  2. Performance E.g. N-Body with 5M bodies CUDA SDK: O(n2) – 1640 s (barrier)Barnes Hut: O(nLogn) – 5.2 s (locks) Functionality Time Fine-Grained Locking Transactional Memory Time Time Motivation • Lifetime of GPU Application Development ? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures

  3. Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs • 1000s Concurrent Scalar Threads • Challenges (from TM perspective) Our Solution: KILO TM • Hardware TM for GPUs 3 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  4. Aborted Committed T0 T0 T0 T1 T1 T1 T2 T2 T2 T3 T3 T3 Hardware TM for GPUs Challenge #1: SIMD Hardware • On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Branch Divergence! 4 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  5. KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort Like a Loop Extend SIMT Stack Abort ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 5 5 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  6. GPU Core (SM) CPU Core 10s of Registers 32k Registers Register File Register File @ TX Entry @ TX Abort Checkpoint Register File Warp Warp Warp Warp Warp Warp Warp Warp Checkpoint? Hardware TM for GPUs Challenge #2: Transaction Rollback 2MB Total On-Chip Storage 6 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  7. Overwritten Abort KILO TM – Solution toChallenge #2: Transaction Rollback • SW Register Checkpoint • Most TX: Registers overwritten at first use • TX in Barnes Hut: Checkpoint 2 registers TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit 7 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  8. Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol • Not Available on GPUs • No Private Data Cache per Thread Signatures? • 1024-bit / Thread • 3.8MB / 30k Threads Hardware TM for GPU Architectures Hardware TM for GPU Architectures 8

  9. GPU Core (SM) L1 Data Cache Warp Warp Warp Warp Warp Warp Warp Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines 1024-1536 Threads Hardware TM for GPUs Challenge #4: Write Buffer Problem: 384 lines / 1536 threads < 1 line per thread! 9 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  10. Read-Log Read-Log Write-Log Write-Log TX2 atomic {A=B+2} Private Memory KILO TM: Value-Based Conflict Detection • Self-Validation + Abort: • Only detects existence of conflict (not identity) Global Memory A=1 A=1 TX1 atomic {B=A+1} Private Memory B=2 B=0 A=1 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit B=2 B=2 TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit B=0 A=2 10 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  11. Tx1 then Tx2: A=4,B=2 OR Read-Log Read-Log Tx2 then Tx1: Write-Log Write-Log A=2,B=3 TX2 atomic {A=B+2} Private Memory Parallel Validation? Data Race!?! Global Memory A=1 A=1 TX1 atomic {B=A+1} Private Memory B=0 B=0 A=1 B=2 B=2 B=0 A=2 A=2 11 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  12. Commit Unit Global Memory V + C Serialize Validation? TX1 TX2 Time • Benefit #1: No Data Race • Benefit #2: No Live Lock • Drawback:Serializes Non-ConflictingTransactions (“collateral damage”) V + C Stall V = Validation C = Commit 12 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  13. Commit Unit Global Memory TX1 TX2 V+C V+C V+C Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts • Recently Committed TX in Parallel • Concurrently Committing TX in Commit Order • Approximate V = Validation C = Commit TX3 Time RS Stall RS RS Conflict Rare  Good Commit Parallelism 13 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  14. KILO TM Implementation • Minimal Modification to Existing GPU Arch. SIMT Stacks Commit Unit TX Log Unit 14 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  15. Evaluation Methodology • GPGPU-Sim 3.0 (BSD license) • Detailed: IPC Correlation of 0.93 vs GT 200 • KILO TM (Timing-Driven Memory Accesses) • GPU TM Applications • Hash Table (HT-H, HT-L) • Bank Account (ATM) • Cloth Physics (CL) • Barnes Hut (BH) • CudaCuts (CC) • Data Mining (AP) 15 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  16. Performance (vs. Serializing TX) Higher is Better Serializing TX ≈ Coarse-Grained Locks 16 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  17. 3 Ideal TM e m KILO TM i T FG Lock . 2 c e x E d e 1 z i l a m r o N 0 HT-H HT-L ATM CL BH CC AP Performance (Exec. Time) Lower is Better • Captures 59% of FG Lock Performance 17 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  18. Implementation Complexity • Logs in Private Memory @ L1 Data Cache • Commit Unit • 5kB Last Writer History Unit • 19kB Transaction Status • 32kB Read-Set and Write-Set Buffer • CACTI 5.3 @ 40nm • 0.40mm2 x 6 Memory Partition • 0.5% of 520mm2 18 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  19. Summary • KILO TM: Hardware TM for GPUs • 1000s of Concurrent Scalar TXs • Handles Scalar TX Abort • No cache coherence protocol dependency • Word-level conflict detection • Unbounded Transaction • 59% Fine-Grained Locking Performance • 128X Faster than Serializing TX Execution • 0.5% Area Overhead Question? 19 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  20. Backup Slides 20 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  21. top A B C Next Next Next Null t A Next B top A C Next Next Null top top B B C C Next Next Next Next Null Null top C Next Null ABA Problem? • Classic Example: Linked List Based Stack • Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 21 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

  22. top A B C Next Next Next Null ABA Problem? • atomicCAS protects only a single word • Only part of the data structure • Value-based conflict detection protects all relevant parts of the data structure while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 22 Hardware TM for GPU Architectures Hardware TM for GPU Architectures

More Related