320 likes | 654 Views
Cache Coherence for GPU Architectures. Inderpreet Singh 1 , Arrvindh Shriraman 2 , Wilson Fung 1 , Mike O’Connor 3 , Tor Aamodt 1. 1 University of British Columbia 2 Simon Fraser University 3 AMD Research. Image source: www.forces.gc.ca. What is a GPU?. Workgroups. CPU. Wavefronts.
E N D
Cache Coherence for GPU Architectures Inderpreet Singh1, Arrvindh Shriraman2, Wilson Fung1, Mike O’Connor3, Tor Aamodt1 1 University of British Columbia 2 Simon Fraser University 3AMD Research Image source: www.forces.gc.ca
What is a GPU? Workgroups CPU Wavefronts spawn GPU Core GPU Core GPU ▪▪▪ L1D L1D done CPU Interconnect CPU L2 Bank spawn ▪▪▪ GPU time
Evolution of GPUs Graphics pipeline Compute (OpenCL, CUDA) e.g. Matrix Multiplication Pixel Shader Vertex Shader OpenGL/ DirectX
Evolution of GPUs Future: coherent memory space Efficient critical sections Load balancing Stencil computation lock shared structure … computation … unlock Workgroups
Challenge 1: Coherence traffic GPU Coherence Challenges No coherence MESI Load C Load D Load E Load F … Load G Load H Load I Load J … Load O Load P Load Q Load R … Load K Load L Load M Load N … 1.5 GPU-VI Load C Do not require C1 C2 C3 C4 coherence 2.2 1.0 Recalls L1D L1D L1D L1D 1.3 A B A B A B A B 0.5 rcl A rcl A rcl A Interconnect traffic ack ack rcl A ack ack L2/Directory gets C A B
GPU Coherence Challenges Challenge 2: Tracking in-flight requests Significant % of L2 SShared S_M MModified L2 / Directory MSHR
GPU Coherence Challenges Challenge 3: Complexity MESI L2 States Non-coherent L1 MESI L1 States Events States Non-coherent L2
GPU Coherence Challenges All three challenges result from introducing coherence messages on a GPU Traffic: transferring Storage: tracking Complexity: managing GPU cache coherence without coherence messages? YES – using global time
Temporal Coherence (TC) Global time Local Timestamp > Global Time VALID Core 1 Core 2 ▪▪▪ L1D L1D Interconnect Global Timestamp < Global Time NO L1 COPIES L2 Bank ▪▪▪ 0 0 A=0 A=0
Temporal Coherence (TC) T=11 T=0 T=15 Core 1 Core 2 L1D L1D No coherence messages Interconnect Load A Store A=1 T=10 L2 Bank ▪▪▪ 10 0 10 A=0 A=0 A=0 A=0 10 A=0 A=1 10
Temporal Coherence (TC) What lifetime values should be requested on loads? Use a predictor to predict lifetime values What about stores to unexpired blocks? Stall them at the L2?
TC Stalling Issues Stall? Problem #1: Sensitive to mispredictions Problem #2: Impedes other accesses Problem #3: Hurts existing GPU applications Solution: TC-Weak
TC-Weak Stores return Global Write Completion Time (GWCT) T=0 T=1 T=31 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET 1 data=NEW 2 FENCE 3 flag=SET GPU Core 1 GPU Core 2 L1D L1D No stalling at L2 30 30 data=OLD data=OLD GWCT Table W0: W1: GWCT Table W0: W1: Store data=NEW Store flag=SET Interconnect L2 Bank 30 30 data=NEW data=OLD 47 47 flag=SET flag=NULL
Methodology GPGPU-Sim v3.1.2 for GPU core model GEMS Ruby v2.1.1 for memory system All protocols written in SLICC Model a generic NVIDIA Fermi-based GPU (see paper for details) Applications: 6 do not require coherence 6 require coherence Barnes Hut Cloth Physics Versatile Place and Route Max-Flow Min-Cut 3D Wave Equation Solver Octree Partitioning Locks Stencil communication Load balancing
Interconnect Traffic Reduces traffic by 53% over MESI and 23% over GPU-VI for intra-workgroup applications Lower traffic than 16x-sized 32-way directory NO-COH MESI GPU-VI TC-Weak 2.3 1.50 1.25 1.00 Interconnect Traffic 0.75 0.50 0.25 Do not require coherence 0.00
Performance TC-Weak with simple predictor performs 85% better than disabling L1 caches Performs 28% better than TC with stalling Larger directory sizes do not improve performance NO-L1 MESI GPU-VI TC-Weak 2.0 1.5 1.0 Speedup 0.5 0.0 Require coherence
Complexity MESI L2 States Non-Coherent L1 TC-Weak L1 MESI L1 States Non-Coherent L2 TC-Weak L2
Summary First work to characterize GPU coherence challenges Save traffic and energy by using global time Reduce protocol complexity 85% performance improvement over no coherence Questions?
Lifetime Predictor One prediction value per L2 bank Events local to L2 bank update prediction value Events Prediction Expired load: ↑ Unexpired store: ↓ Unexpired eviction: ↓ L2 Bank T = 20 T = 0 Load A Store A prediction-- Prediction Value prediction++ A A 10 30
TC-Strong vs TC-Weak Fixed lifetime for all applications Best lifetime for each application TCSUO TCS TCSOO 1.2 TCW TCW w/ predictor 1.4 1.0 1.2 Speedup 0.8 1.0 Speedup 0.8 0.6 All applications 0.6 All applications