250 likes | 379 Views
This paper discusses an innovative tool developed at UMD SIGMA, funded by PERC2, for tracing stochastic program execution to improve memory performance in complex applications. It integrates with existing hardware counters to capture detailed statistics while supporting MPI and OpenMP programs written in Fortran and C. The approach includes static instrumentation and dynamic analysis to provide compact traces and improve execution efficiency. Key features include memory profiling, pattern scanning, and modified linear regression for data structure access analysis, enabling users to restructure and optimize memory layouts effectively.
E N D
Stochastic Program Execution Tracing Jeff Odom, UMD
SIGMA Goals • IBM/UMD tools to understand caches • Focus of detailed statistics • Complement existing hardware counters • Ability to handle real applications • MPI and OpenMP programs • Fortran and C • Provide hints about restructuring • Padding (both inter and intra data structures) • Blocking • UMD effort funded by PERC2
Original SIGMA Approach • Static instrumentation • Capture full information about memory use • Produce compact trace • Extracts loops and memory strides • Post execution tools • Detailed simulator • Full discrete event simulator • Memory profiler • Portion of accesses attributed to each data structure
RPT BLK1 ADR ADR ADR BLK2 ADR ADR BLK3 250 100 200 300 300 500 7 4 4 4 4 4 Representing Program Execution • Capture full execution behavior • Record all basic blocks and memory addresses • Produces large traces (due to looping) • Trace compression • Maintain pattern buffer • Scan for repeating patterns • Extract memory strides • Repeat algorithms for nested loops Base Count Length Stride
Trace Compression Isn’t Enough • A few seconds… • Slows execution considerably • Generates gigabytes
Sampling • We want… • Shorter execution times • Smaller traces • We need… • Representative traces • Where to sample? • Timestep boundary • Outermost loop • Requires manual identification (for now)
Dyninst + SIGMA = dynSIGMA • Dyninst adds flexibility • Vary sample rate without recompilation • Adaptive/progressive rate during execution • Target application runs at native speed when instrumentation turned off • Leverage existing SIGMA infrastructure • Only generate trace • Offline simulation/profiling steps unchanged • Dual application framework • Mutatee generates trace • Mutator toggles instrumentation
Memtime • Simple but effective metric of application memory performance
Characteristic Pattern • Local and global data objects given canonical name • Vector of objects’ memtime is characteristic data pattern • Comparison of characteristic patterns done with simple linear correlation • Can also be applied for function objects
Example Application: seis • Seismic simulation from SPEChpc2002 • Models multiple seismic processes • Process results pipelined • Variable timesteps • Different data pattern for each process • C & Fortran • Fortran – data processing • C – dynamic memory management, IO
Space & Time Gains From Sampling • Includes 0:12 instrumentation overhead
Challenge of Irregularity • Compression requires regular accesses • Sampling may hide poor compression • Each sample may compress poorly • Offset by low sampling rate • Sampling may not be accurate enough • Control flow sampled as well • Sample boundary requires manual definition
Hybrid Traces • Accuracy may be more important than execution time, but storage capacity may be limited • Modeling data access at particular points can be more accurate than timestep sampling • Many codes are mostly regular, but irregular patterns spoil compression
Modified Linear Regression • Establish linear pattern (min 3 points) at each memory access location • Look for repetitions of pattern with higher-level strides • Once input no longer matches pattern, treat further input as irregular until new pattern discovered
Modified Linear Regression • Irregular sequence modeled using uniform distribution • Pattern matching done local to each instrumentation (memory access) point • Original SIGMA pattern matches globally
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5 • Becomes: 0 + x + 10y + {5,9,2,5}
Modified Linear Regression • Example: 0, 1, 2, 5, 9, 10, 11, 12, 2, 5 • Becomes: 0 + x + 10y + {5,9,2,5} • Becomes: 0 + x + 10y + {l:2, h:9}
Experiment Setup • NAS Parallel Benchmarks 3.2 Serial Version, Class S • IBM XL C 8.0, XL Fortran 10.1 • DyninstAPI 5.0, including • Liveness analysis • Up to 90% runtime reduction by excluding one SPR (MQ) • Additional 3% improvement with other GPR/FPR • Transactional instrumentation • Instrumentation always on (no sampling)
Transactional Instrumentation BPatch_thread *thr; BPatch_process *proc; proc = thr->getProcess(); proc->beginInsertionSet(); … thr->insertSnippet(…); thr->insertSnippet(…); … proc->finalizeInsertionSet(true); • Reduces • Memory allocation • Insertion time • Atomic operation
Future Work • Larger datasets (NPB Class B,C) • Some results already gathered for W • Distributions other than uniform • Irregular control flow • Example: Upper triangular matrix does not need to iterate all MxN values • Uses edge instrumentation • BPatch_basicBlock::getIncomingEdges • BPatch_basicBlock::getOutgoingEdges • BPatch_edge::getPoint