180 likes | 295 Views
This research presents a framework for identifying memory bottlenecks through data cache sampling. Implemented within a simulation environment using ATOM binary rewriting, the approach tracks cache misses with periodic interrupts and collects detailed statistics on memory accesses. Hardware simulation supports testing using SPEC 95 applications. Experimental evaluation indicates a straightforward implementation while allowing analysis of various programs, including MPI and OpenMP. The tool aids in understanding cache performance and can inform restructuring for improved memory access patterns.
E N D
University of Maryland Thomas J. Watson Research Center Instrumentation and Performance Analysis for Finding Memory Bottlenecks Jeff Hollingsworth Luiz Derose K Ekanadham
Using Data Cache Sampling • Hardware Requirements: • Periodic interrupt on cache miss • Ability to determine miss address • Associate count with each object • Variable or dynamically allocated memory • Interrupt after every n cache misses • Obtain address of miss • Find object containing it and increment count • Advantage: simplicity
Experimental Evaluation • Implemented in simulation • Simulator uses ATOM binary rewriting tool • Instrument load/stores for cache simulation • Instrument basic blocks for virtual cycle count • Simulates necessary hardware support • Sampling and n-way search run under simulation • Tested using SPEC 95 applications • tomcatv, swim, su2cor, mgrid, applu, compress, ijpeg • sampled 1 in 50,000 misses
Quality of Results Application Variable Actual Sample Rank % Rank % tomcatv RY 1 22.5 2 17.6 RX 2 1 37.1 22.5 AA 3 15.0 5 10.1 DD 4 10.0 3 15.0 X 5 10.0 6 9.8 Y 6 10.0 7 0.2 D 7 10.0 4 10.2 applu A 1 22.9 2 23.0 B 2 22.9 3 19.9 C 3 22.6 1 25.8 D 4 17.4 4 16.7 rsd 5 6.9 5 7.7
Application Variable Actual Sample Rank % Rank % tomcatv RY 1 22.5 1 22.6 RX 2 22.5 2 22.5 AA 3 15.0 3 15.6 DD 4 10.0 7 9.4 X 5 10.0 6 9.7 Y 6 10.0 4 10.5 D 7 10.0 5 9.8 Varying Sampling Interval Lesson re-learned: randomly vary sampling interval
Sigma Goals • A Research project • Less of a production tool than others from ACTC • Family of tools to understand caches • Focus of detailed statistics • Complement existing hardware counters • Ability to handle real applications • MPI and openMP programs • Fortran and C • Provide hints about restructuring • Padding (both inter and intra data structures) • Blocking
Approach • Run instrumented program • Capture full information about memory use • Produce compact trace • Extracts loops and memory strides • Post execution tools • Memory profiler • share of accesses due to each data structure • Cache Prediction Tool • Predict cache misses using symbolic equations • Detailed simulator • Full discrete event simulator
Cache Prediction Tool • Predict cache misses • Operate on compact traces • Only expand to full trace if needed • Use algorithms developed for compilers • Re-use vectors • Cache miss equations • Capacity, cold, and conflict misses are identified
Iteration Space • Re-use vectors • defines points in the iteration space that access the same data • Miss equations • describe points in interaction space that cause misses on conflicts
dumpMap .addr ProgramExecution trace files Instrumentedbinary CacheSimulator PredictionTool MemoryRef Tool Structure of SIGMA Data Collection source files SigmaCompile/Link .lst files
RPT BLK1 ADR ADR ADR BLK2 ADR ADR BLK3 250 100 200 300 300 500 7 4 4 4 4 4 Representing Program Execution • Capture full execution behavior • Record all basic blocks and memory addresses • Produces large traces (due to looping) • Trace compression • Maintain pattern buffer • Scan for repeating patterns • Extract memory strides • Repeat algorithms for nested loops Base Count Length Stride
Trace Information • Compression ratio a function of regularity • Slowdown depends on fraction of instructions that load/store memory
Cache Prediction Tool • Use compressed traces • Convert memory refs back to array refs • Solve Cache Miss Equations • computer re-use vectors • define misses as a system of linear equations • use Omega library to solve • Provides • count of misses • information about iterations that cause misses
Using Dyninst to Gather Data • Extend dyninst to support memory ops • Load/store/prefetch instrumentation points • Done and working on Power and SPARC • Extend dyninst AST to include effective addr • Allows code to use memory address • Dyninst for SIGMA Instrumentation provides • Multi-platform support • Dynamic control of instrumentation • Selection of specific functions, loops, memory ops • Possible use of CFGs to optimize instrumentation