1 / 24

Detailed Cache Coherence Characterization for OpenMP Benchmarks

Detailed Cache Coherence Characterization for OpenMP Benchmarks. Jaydeep Marathe 1 , Anita Nagarajan 2 , Frank Mueller 1. 1 Department of Computer Science, NCSU 2 Intel Technology India Pvt. Ltd. Our Focus. Target shared memory SMP systems.

tamar
Download Presentation

Detailed Cache Coherence Characterization for OpenMP Benchmarks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detailed Cache Coherence Characterization for OpenMP Benchmarks Jaydeep Marathe1, Anita Nagarajan2, Frank Mueller1 1 Department of Computer Science, NCSU 2 Intel Technology India Pvt. Ltd. NC State University

  2. Our Focus • Target shared memory SMP systems. • Characterize coherence behavior at application level. • Metrics guide coherence optimization. Bus-Based Shared Memory SMP Processor-1 Processor-N Cache Cache Coherence Protocol Bus NC State University

  3. Invalidation-based coherence protocol • Cache lines have state bits. • Data migrates between processor caches, state transitions maintain coherence. • MESIProtocol: 4 states:M: Modified, E: Exclusive, S: Shared, I: Invalid Processor A Processor B x 1. Read “x” IE “exclusive” x x 2. Read “x” ES IS “shared” “shared” 3. Write “x” x Cache line was invalidated SM SI “modified/dirty” “invalid” NC State University

  4. A performance perspective • Writes to shared variables cause invalidation traffic. (E->I, M->I, S->I) • Worse, invalidations lead tocoherence misses! Proc A. Proc B. 1. Write shared varsVar_1 Var_2 Var_n invalidations ! 2. Read Var_1 Var_2 Var_n Data Coherence Misses in Proc. B ! 3. Write Var_1 Var_2 Var_3 invalidations ! “Coherence Bottlenecks” Reducing Coherence Misses, Invalidations  Improved Performance ! NC State University

  5. Question: Does a coherence bottleneck exist ? One Approach: Time-based Profilers Load Imbalance between 2 threads (KAI GuideView) Parallel Time Imbalance Time Problem:ImplicitInformation – Does imbalance/ speedup loss indicate a coherence bottleneck ? - We can’t tell ! NC State University

  6. Does a coherence bottleneck exist ? (contd) Another Approach: Using Hardware counters Total Misses Coherence Misses Invalidations # Total P-1 P-2 P-3 P-4 Processors  + Detect potential coherence bottlenecks. - Block level statistics – perturbation prevents fine-grained monitoring. - Can’t diagnose cause ! - Which source code refs ? data structures ? Need more detailed information ! NC State University

  7. What we offer.. • Hierarchicallevels of detail.. Overall statistics Per-reference metrics Invalidator True-Sharing False-Sharing V2 4811 20 V3 3199 0 Clock 200 1000 .... .... .... Invalidator Set for each reference • Rich metrics – coherence misses, true & false sharing, per-reference invalidators • Facilitates easy isolation of bottleneck references ! NC State University

  8. Our Framework • Bound OpenMP threads for SMP parallelism • Access Traces using dynamic binary rewriting • Traces used for incremental SMP cache simulation. (L1+L2+coherence) Instrumentation Trace Generation Thread-N Thread-1 Execute Thread-0 Handler() Instrument Compression Handler() Handler() Target OpenMP Executable Simulation Access Trace Access Trace Access Trace Controller Extract Detailed Coherence Metrics Target Descriptor SMP Cache Simulator Instruction Line, File Global & Local Variables NC State University

  9. Target Executable Instrumentation Machine Code CFG Controller DynInst myfunc: ….. ….. CALL _xlsmpParallelDoSetup … LOAD B[I] LOAD C[I] STORE A[I] … Exit_from_parallel_for … CALL _xlsmpBarrier myfunc() { …. #pragma omp parallel for For(I=0;I < N;I++) { A[I] = B[I] + C[I]; }//end parallel for … #pragma omp barrier Instrumentation Points • Enhanced DynInst Dynamic Binary Rewriting package ( U. Wisconsin) • Instrument Memory access instructions (LD/ST) • Instrument Compiler-generated OpenMP construct functions. NC State University

  10. Per-reference metrics • Uniprocessor Misses • Coherence Misses • Invalidations • List & count of Invalidator references “Fork-Join” Model //serial code ….. #pragma omp parallel { …. … }//end parallel … //serial code … #pragma omp parallel do { …. …. }//end parallel … “Region” Invalidations “In”-Region “Across”-Region “Region” False Sharing “Region” True Sharing False Sharing True Sharing “Region” NC State University

  11. In-depth Example: SMG2000 • ASCI Purple benchmark • Has been scaled up to 3150 processors • Hybrid OpenMP + MPI Parallelization • Code is known to be memory-intensive • Code Characteristics • 72 Files, 24213 lines (non-whitespace) • Instrumentation Characteristics • 4 OpenMP threads, default workload • Functions instrumented:313 (69 OpenMP, 244 Others) • Access Points instrumented:10692 • 8531 Load (2184 64-bit, 6329 32-bit, 18 8-bit) • 2161 Store (722 64-bit, 1425 32-bit, 14 8-bit) • Tracing:16.73 Million accesses logged. NC State University

  12. Overall Performance: SMG2000 10 A. Overall Misses B. Cumulative Coherence Misses • Most L2 misses are Coherence misses • Most Invalidations result in Coherence Misses • Only ~280 out of 10692 access points show coherence activity (2.6% !) • Only ~10% of these points account for >= 90% of the coherence misses NC State University

  13. Drilling Down: Per-Access Point Metrics • Top-5 Metrics for Processor-1 • Group-1 Refs: False-sharing In-Region (Same OpenMP region) invalidations dominate • Group-2 Ref: True-sharing In-Region invalidations only NC State University

  14. Drilling Further: A Ref & its invalidators for k = 0 to Kmax for j =0 to Jmax #pragma omp parallel do for i = 0 to Imax { ... rp[k][j][i] = rp[k][j][i] – Ai[] * xp[]; }//end omp do Invalidator List Cache Line Cache Line P1 P2 P3 P4 • Large number of False-In Region Invalidations ! • Sub-optimal Parallelization: Fine-grained sharing • Solution: Parallelize Outermost loop (Coarsening) NC State University

  15. Another Optimization #pragma omp parallel num_threads=omp_get_num_threads(); Invalidator List Cache Line num_thread P1 P2 P3 P4 • Multiple threads updating same shared variable ! • Solution: Remove unnecessary sharing (SharedRemoval) • num_threads = omp_get_max_threads(); NC State University

  16. Impact of Optimizations • SMG2000 run on IBM SP Blue Horizon. (POWER3) • Wall-clock times for recommended full-sized workloads (threads = 1, 2, 4, 8) • Maximum of 73% improvement for 4th Workload NC State University

  17. Highlights & Future Directions Highlights • First tool for characterizing OpenMP SMP performance • Dynamic Binary Rewriting – no source code modification ! • Detailed source-correlated statistics. • Rich set of coherence metrics. Future Directions • Use of partial access traces. (intermittent instrumentation) • Other Threading Models – Pthreads, etc. • Characterizing Perennial Server applications (Apache) NC State University

  18. The End NC State University

  19. Simulator Accuracy: Comparing #invalidations • NAS 3.0 OpenMP benchmarks + NBF • Total Invalidations: Hardware Counters(IBM SP: HPM) vs. Simulator (ccSIM) • Account for OpenMP runtime overhead in HPM. • 16.5% Maximum absolute error, most benchmarks have <= 7% error. NC State University

  20. Related Work MemSpy: Martonosi et. al.(1992) • + Execution-driven simulation • + Classifies by code & data objects • + Invalidations & coherence misses • No true/false sharing • No invalidator lists • Compiler-inserted instrumentation • Uniprocessor-simulated parallel threads SM-Prof: Brorsson et.al.(1995) • + Variable Classification tool • + Access classes of “shared/private read/write few/many” • Cant detect true/false sharing, magnitude • No coherence misses,invalidator lists Rsim, Proteus, SimOS • Architecture-oriented simulators, only bulk statistics • Not meant for application developers NC State University

  21. Tracing Overhead (earlier work: METRIC) • 1-3 Orders of Magnitude overhead, in most cases. • Conventional breakpoints (TRAP) have > 4Orders of Magnitude overhead. NC State University

  22. Trap-based instrumentation From DynInst Documentation dyninst gdb application # operations ops/sec time (sec) time (sec) compress 95 32,513 406,655.7 0.0874.35 li (xlmatch) 110,209 43,607.7 2.53221.04 li (compare) 4,475 640.2 6.9916.39 li (binary) 401 19.4 20.6921.62 NC State University

  23. Compression Ratio (earlier work: METRIC) • Comparison of Uncompressed and Compressed Stream sizes • 1 Million Accesses Logged • 2-4 Orders of Magnitude Compression , in most cases. NC State University

  24. NC State University

More Related