Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Introduction SIMD Execution Shared Memory DRAM Memory Req.uestBuffer Core Many-Thread Aware Prefetching Mechanisms (MICRO-43) • General Purpose GPUs (GPGPU) are getting popular • High-performance capability (NVIDIA Geforce GTX 580: 1.5 Tflops) • GPGPUs have SIMD execution, many cores, and large-scale multi-threading • Warp – basic unit of execution in a core (SIMD unit)

Memory Latency Problem C C C M M M C C C C C C M M M D D D Memory Latency Computation Memory Dependent on memory C M D No stall 4 active threads T0 C M C C M D C M C Switch T1 Switch T2 Switch T3 • Tolerating memory latency is critical in CPUs • Many techniques have been proposed • Caches, prefetching, multi-threading, etc. • Is this critical in GPGPUs as well? • GPGPUs have employed multi-threading Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Memory Problems in GPGPUs 2 active threads C M C C M D C C M M C C C C M M D D T0 Switch Stall Cycles T1 Stall Memory Latency • What if there are not enough threads in GPGPUs? • Limited thread-level-parallelism • Application behavior • Algorithmically, lack of parallelism • Limited by resource constraints • # registers per thread, # threads per block, shared memory usage per block Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Prefetching in GPGPUs Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem: when multi-threading is not enough, how can we hide memory latency? • Other solutions • Caching (NVIDIA Fermi) • Prefetching (in this talk) • Many prefetchers mechanisms proposed in CPUs • stride, stream, Markov, CDP, GHB, helper thread, etc. • Question: will the existing mechanisms work in GPGPUs?

Characteristic #1. Many Threads 1 thread 2 threads Many threads Prefetcher Prefetcher Prefetcher Prefetching in CPU Prefetching in GPU Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem #1. Training of prefetcher(Scalability) • Accesses from many threads are interleaved • Problem #2. Amplified negative effects (SIMT) • One useless prefetchrequest per thread many useless prefetches

Characteristic #1. Many Threads Capacity misses pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref pref Fit in a cache Cache Cache Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem #3. Single-Configuration Many-Threads (SCMT) • Too many threads are controlled together Prefetch degree 1: < cache size Prefetch degree 2: >> cache size

Characteristic #2. Data Level Parallelism create prefetch Memory latency prefetch demand Memory latency demand Useful! Not enough opportunity 1. thread terminated 2. too close to demand terminate A thread in sequential program A thread in parallel program Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem #4. Short thread lifetime • The length of a thread in parallel programs is shorter than in sequential programs due to the parallelization

Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Design hardware/software prefetching mechanisms for GPGPU applications • Step 1. Prefetcher for Many-thread Architecture • Many-Thread Aware Prefetching Mechanisms(Problems #1 and #4) • Step 2. Feedback mechanism to reduce negative effects • Prefetch Throttling(Problems #2 and #3)

Many-Thread Aware Hardware Prefetcher PromotionTable Decision Logic PromotionTable Decision Logic PC, ADDR Pref. Addr IP Pref. PC, ADDR TID IP Pref. Stride Pref. PC, ADDR TID Stride Pref. Stride Promotion • (Conventional) Stride prefetcher • Promotion table for stride prefetcher (Problem #1) • Inter-Thread prefetcher (Problem #4) • Decision logic Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Solving Scalability Problem Promotion Redundant Entries Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Problem #1. Training of prefetcher (Scalability) • Stride Promotion • Similar (or even same) access pattern across threads • Without promotion, table is occupied by redundant entries • By promotion, we can effectively manage storage • Reduce training time using earlier threads’ information

Solving Short Thread Lifetime Problem prefetch demand Memory latency for (ii = 0; ii < 100; ++ii) { prefetch(A[ii+1]); prefetch(B[ii+1]); C[ii] = A[ii] + B[ii]; } // there are 100 threads __global__ void KernelFunction(…) { inttid = blockDim.x * blockIdx.x + threadIdx.x; intvarA = aa[tid]; intvarB = bb[tid]; C[tid] = varA + varB; } Loop! No loop, 2 mem, 1 comp Many-Thread Aware Prefetching Mechanisms (MICRO-43) Problem #4 (Short thread lifetime) Highly parallelized code often eliminates prefetching opportunities

Inter-Thread Prefetching Prefetch // there are 100 threads __global__ void KernelFunction(…) { inttid = blockDim.x * blockIdx.x + threadIdx.x; intnext_tid = tid + 32; prefetch(aa[next_tid]); prefetch(bb[next_tid]); intvarA = aa[tid]; intvarB = bb[tid]; C[tid] = varA + varB; } T0 T0 T1 T1 T2 T2 … Prefetch T32 T32 T33 T33 … … … … prefetch Memory access in other threads prefetch T64 T64 T64 • Instead, we can prefetch for other threads • Inter-Thread Prefetching (IP) • In CUDA, Memory addresses = func(thread id) [SIMT] Many-Thread Aware Prefetching Mechanisms (MICRO-43)

IP Pattern Detection in Hardware Req 4 Req 3 Req 2 Req 1 PC:0x1a Addr:2100 TID:1 PC:0x1a Addr:400 TID:3 PC:0x1a Addr:1100 TID:10 PC:0x1a Addr:200 TID:1 Prefetch (addr + stride) Addr:2100 Stride:100 Trained already Addr ∆ Delta (Req1-Req2) = = 100 All three deltas are same We found a pattern TID ∆ Delta (Req3-Req1) = = 100 Delta (Req3-Reg2) = = 100 Detecting strides across threads Launch prefetch request Many-Thread Aware Prefetching Mechanisms (MICRO-43)

MT-aware Hardware Prefetcher PromotionTable Decision Logic PC, ADDR Pref. Addr PC, ADDR TID IP Pref. Cycle 2 Cycle 3 PC, ADDR TID Stride Pref. Cycle 1 Stride Promotion Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Goal Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Design a hardware/software prefetcher for GPGPU applications • Step 1. Prefetcher for Many-thread Architecture • Many-Thread Aware Prefetching Mechanisms • Step 2. Feedback mechanism to reduce negative effects • Prefetch Throttling

Outline Many-Thread Aware Prefetching Mechanisms (MICRO-43) Motivation Step 1. Many-Thread Aware Prefetching Step 2. Prefetch Throttling Evaluation Conclusion

Prefetch Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Recall problems in GPGPU prefetching • Problem #2. Amplifying negative effects • Problem #3. Single-Configuration Many-Thread • In order to identify whether prefetching is effective • Metrics • Usefulness – Accurate and timely • Harmfulness – Inaccurate or too early prefetches • Some late prefetches can be tolerable • Similar to Srinath [HPCA 2007]

Throttling Metrics Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Merged memory requests • New request with same address of existing entries • Inside of a core (in MSHR) • Late prefetches in CPUs • Indicate accuracy (due to massive multi-threading) • Less correlated with timeliness • Early block eviction from a prefetch cache • Due to capacity misses, regardless of accuracy • Indicate timeliness and accuracy • Periodic Updates • To cope with runtime behavior

Heuristic for Prefetch Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) * Ideal case (accurate and perfect timing) will have low early eviction and low merge ratio. • Throttle Degree • Vary from 0 (prefetch all) to 5 (no prefetch) • Default:2

Evaluation Methodology Many-Thread Aware Prefetching Mechanisms (MICRO-43) • MacSim simulator • A cycle accurate, in-house simulator • A trace-driven simulator (trace from GPUOcelot[Diamos]) • Baseline • 14-core (8-wide SIMD) Freq:900MHz, 16 Banks/8 Channels, 1.2GHz memory frequency, 900MHz bus, FR-FCFS • NVIDIA G80 Architecture • 14 memory intensive benchmarks • CUDA SDK, Merge, Rodinia, and Parboil • Type • Stride, MP (massively parallel), uncoalesced • Non-memory intensive benchmarks (in the paper)

Evaluation Methodology – cont’d Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Prefetch • Stream, Stride, and GHB prefetchers evaluated • 16 KB cache per core (other size results are in the paper) • Prefetch distance:1 degree :1 (the optimal configuration) • Results • Hardware prefetcher • Software prefether (in the paper)

MT Hardware Prefetcher Results 15% over Stride Many-Thread Aware Prefetching Mechanisms (MICRO-43) GHB/Stride do not work in mp and uncoal-type IP (Inter-Thread Prefetching) can be effective Stride Promotion improves performance of few benchmarks

MT-HWP with Throttling Results 15% over Stride + Throttling Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Feedback-driven hardware prefetchers can be effective • Throttling eliminates negative effect (stream) * There are more negative cases in software prefetching mechanism

Outline Many-Thread Aware Prefetching Mechanisms (MICRO-43) Motivation Step 1. Many-Thread Aware Prefetching Step 2. Prefetch Throttling Evaluation Conclusion

Conclusion Many-Thread Aware Prefetching Mechanisms (MICRO-43) • Memory is also an important problem in GPGPUs. • GPGPU prefetching has four problems: • scalability, amplifying negative effects, SCMT, and short thread • Goal: Design hardware/software prefetcher • Step 1. Many-Thread aware prefetcher(promotion, IP) • Step 2. Prefetch throttling • MT-aware hardware prefetcher shows 15% performance improvement and prefetch throttling removes all the negative effects. • Future work • Study other many-thread architectures. • Other programming models, architectures with caches

Many-Thread Aware Prefetching Mechanisms (MICRO-43) THANK YOU!

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

NVIDIA Fermi Result Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Different Prefetch Cache Size Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Software MT Prefetcher Results Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Many-Thread Aware Prefetching Mechanisms (MICRO-43)

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Presentation Transcript

GPGPU in Medical Imaging Applications

Prefetching for RC

Prefetching

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications

Performance and Power Aware CMP Thread Allocation

Many Monitoring Applications

Database for Location -Aware Applications

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance

GPGPU programming with image processing applications

Database for Location -Aware Applications

Combining MPI and GPGPU for Monte Carlo Applications

Aging-Aware Compiler-Directed VLIW Assignment for GPGPU Architectures

Power-Aware Hardware Prefetching

Application Aware Prioritization Mechanisms for On-Chip Networks

Location Sensing for Context-Aware Applications

Implementing Context Aware Applications

Locality Aware Mechanisms for Large-scale Networks

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

GPGPU

A Taxonomy of Data Prefetching Mechanisms