Detailed Cache Coherence Characterization for OpenMP Benchmarks

Detailed Cache Coherence Characterization for OpenMP Benchmarks Jaydeep Marathe1, Anita Nagarajan2, Frank Mueller1 1 Department of Computer Science, NCSU 2 Intel Technology India Pvt. Ltd. NC State University

Our Focus • Target shared memory SMP systems. • Characterize coherence behavior at application level. • Metrics guide coherence optimization. Bus-Based Shared Memory SMP Processor-1 Processor-N Cache Cache Coherence Protocol Bus NC State University

Invalidation-based coherence protocol • Cache lines have state bits. • Data migrates between processor caches, state transitions maintain coherence. • MESIProtocol: 4 states:M: Modified, E: Exclusive, S: Shared, I: Invalid Processor A Processor B x 1. Read “x” IE “exclusive” x x 2. Read “x” ES IS “shared” “shared” 3. Write “x” x Cache line was invalidated SM SI “modified/dirty” “invalid” NC State University

A performance perspective • Writes to shared variables cause invalidation traffic. (E->I, M->I, S->I) • Worse, invalidations lead tocoherence misses! Proc A. Proc B. 1. Write shared varsVar_1 Var_2 Var_n invalidations ! 2. Read Var_1 Var_2 Var_n Data Coherence Misses in Proc. B ! 3. Write Var_1 Var_2 Var_3 invalidations ! “Coherence Bottlenecks” Reducing Coherence Misses, Invalidations  Improved Performance ! NC State University

Question: Does a coherence bottleneck exist ? One Approach: Time-based Profilers Load Imbalance between 2 threads (KAI GuideView) Parallel Time Imbalance Time Problem:ImplicitInformation – Does imbalance/ speedup loss indicate a coherence bottleneck ? - We can’t tell ! NC State University

Does a coherence bottleneck exist ? (contd) Another Approach: Using Hardware counters Total Misses Coherence Misses Invalidations # Total P-1 P-2 P-3 P-4 Processors  + Detect potential coherence bottlenecks. - Block level statistics – perturbation prevents fine-grained monitoring. - Can’t diagnose cause ! - Which source code refs ? data structures ? Need more detailed information ! NC State University

What we offer.. • Hierarchicallevels of detail.. Overall statistics Per-reference metrics Invalidator True-Sharing False-Sharing V2 4811 20 V3 3199 0 Clock 200 1000 .... .... .... Invalidator Set for each reference • Rich metrics – coherence misses, true & false sharing, per-reference invalidators • Facilitates easy isolation of bottleneck references ! NC State University

Our Framework • Bound OpenMP threads for SMP parallelism • Access Traces using dynamic binary rewriting • Traces used for incremental SMP cache simulation. (L1+L2+coherence) Instrumentation Trace Generation Thread-N Thread-1 Execute Thread-0 Handler() Instrument Compression Handler() Handler() Target OpenMP Executable Simulation Access Trace Access Trace Access Trace Controller Extract Detailed Coherence Metrics Target Descriptor SMP Cache Simulator Instruction Line, File Global & Local Variables NC State University

Target Executable Instrumentation Machine Code CFG Controller DynInst myfunc: ….. ….. CALL _xlsmpParallelDoSetup … LOAD B[I] LOAD C[I] STORE A[I] … Exit_from_parallel_for … CALL _xlsmpBarrier myfunc() { …. #pragma omp parallel for For(I=0;I < N;I++) { A[I] = B[I] + C[I]; }//end parallel for … #pragma omp barrier Instrumentation Points • Enhanced DynInst Dynamic Binary Rewriting package ( U. Wisconsin) • Instrument Memory access instructions (LD/ST) • Instrument Compiler-generated OpenMP construct functions. NC State University

Per-reference metrics • Uniprocessor Misses • Coherence Misses • Invalidations • List & count of Invalidator references “Fork-Join” Model //serial code ….. #pragma omp parallel { …. … }//end parallel … //serial code … #pragma omp parallel do { …. …. }//end parallel … “Region” Invalidations “In”-Region “Across”-Region “Region” False Sharing “Region” True Sharing False Sharing True Sharing “Region” NC State University

In-depth Example: SMG2000 • ASCI Purple benchmark • Has been scaled up to 3150 processors • Hybrid OpenMP + MPI Parallelization • Code is known to be memory-intensive • Code Characteristics • 72 Files, 24213 lines (non-whitespace) • Instrumentation Characteristics • 4 OpenMP threads, default workload • Functions instrumented:313 (69 OpenMP, 244 Others) • Access Points instrumented:10692 • 8531 Load (2184 64-bit, 6329 32-bit, 18 8-bit) • 2161 Store (722 64-bit, 1425 32-bit, 14 8-bit) • Tracing:16.73 Million accesses logged. NC State University

Overall Performance: SMG2000 10 A. Overall Misses B. Cumulative Coherence Misses • Most L2 misses are Coherence misses • Most Invalidations result in Coherence Misses • Only ~280 out of 10692 access points show coherence activity (2.6% !) • Only ~10% of these points account for >= 90% of the coherence misses NC State University

Drilling Down: Per-Access Point Metrics • Top-5 Metrics for Processor-1 • Group-1 Refs: False-sharing In-Region (Same OpenMP region) invalidations dominate • Group-2 Ref: True-sharing In-Region invalidations only NC State University

Drilling Further: A Ref & its invalidators for k = 0 to Kmax for j =0 to Jmax #pragma omp parallel do for i = 0 to Imax { ... rp[k][j][i] = rp[k][j][i] – Ai[] * xp[]; }//end omp do Invalidator List Cache Line Cache Line P1 P2 P3 P4 • Large number of False-In Region Invalidations ! • Sub-optimal Parallelization: Fine-grained sharing • Solution: Parallelize Outermost loop (Coarsening) NC State University

Another Optimization #pragma omp parallel num_threads=omp_get_num_threads(); Invalidator List Cache Line num_thread P1 P2 P3 P4 • Multiple threads updating same shared variable ! • Solution: Remove unnecessary sharing (SharedRemoval) • num_threads = omp_get_max_threads(); NC State University

Impact of Optimizations • SMG2000 run on IBM SP Blue Horizon. (POWER3) • Wall-clock times for recommended full-sized workloads (threads = 1, 2, 4, 8) • Maximum of 73% improvement for 4th Workload NC State University

Highlights & Future Directions Highlights • First tool for characterizing OpenMP SMP performance • Dynamic Binary Rewriting – no source code modification ! • Detailed source-correlated statistics. • Rich set of coherence metrics. Future Directions • Use of partial access traces. (intermittent instrumentation) • Other Threading Models – Pthreads, etc. • Characterizing Perennial Server applications (Apache) NC State University

The End NC State University

Simulator Accuracy: Comparing #invalidations • NAS 3.0 OpenMP benchmarks + NBF • Total Invalidations: Hardware Counters(IBM SP: HPM) vs. Simulator (ccSIM) • Account for OpenMP runtime overhead in HPM. • 16.5% Maximum absolute error, most benchmarks have <= 7% error. NC State University

Related Work MemSpy: Martonosi et. al.(1992) • + Execution-driven simulation • + Classifies by code & data objects • + Invalidations & coherence misses • No true/false sharing • No invalidator lists • Compiler-inserted instrumentation • Uniprocessor-simulated parallel threads SM-Prof: Brorsson et.al.(1995) • + Variable Classification tool • + Access classes of “shared/private read/write few/many” • Cant detect true/false sharing, magnitude • No coherence misses,invalidator lists Rsim, Proteus, SimOS • Architecture-oriented simulators, only bulk statistics • Not meant for application developers NC State University

Tracing Overhead (earlier work: METRIC) • 1-3 Orders of Magnitude overhead, in most cases. • Conventional breakpoints (TRAP) have > 4Orders of Magnitude overhead. NC State University

Trap-based instrumentation From DynInst Documentation dyninst gdb application # operations ops/sec time (sec) time (sec) compress 95 32,513 406,655.7 0.0874.35 li (xlmatch) 110,209 43,607.7 2.53221.04 li (compare) 4,475 640.2 6.9916.39 li (binary) 401 19.4 20.6921.62 NC State University

Compression Ratio (earlier work: METRIC) • Comparison of Uncompressed and Compressed Stream sizes • 1 Million Accesses Logged • 2-4 Orders of Magnitude Compression , in most cases. NC State University

NC State University

Detailed Cache Coherence Characterization for OpenMP Benchmarks

Detailed Cache Coherence Characterization for OpenMP Benchmarks

Presentation Transcript

Extra Cache Coherence Examples

Directory-Based Cache Coherence

Cache Coherence Schemes for Multiprocessors

Cache Coherence for GPU Architectures

Cache coherence

Cache Coherence

Cache coherence for CMPs

The Cache-Coherence Problem

Cache Coherence

Cache coherence, etc… - MIMD –

Cache Coherence Protocols

The Cache-Coherence Problem

Power Efficient Cache Coherence

Directory-based Cache Coherence

Cache Coherence

The Cache-Coherence Problem

The Cache-Coherence Problem

Example Cache Coherence Problem

Cache Coherence Techniques for Multicore Processors

The Cache-Coherence Problem

Detailed Cache Coherence Characterization for OpenMP Benchmarks