190 likes | 217 Views
Explore the challenges and solutions to understanding application behavior, leveraging data structure decomposition for memory analysis. Learn about instrumentation methods and tools used to study data structure criticality and access patterns.
E N D
The Memory Behaviorof Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University of Texas at Austin
Memory hierarchy trends • Growing latency to main memory • Growing cache complexity • More cache levels • New mechanisms, optimizations • Growing application complexity • Lots of abstraction Hard to predict how an application will perform on a specific system
Application understanding is hard • Observations can generate Gigabytes of data • Aggregation is necessary • Current metrics are too lossy • Different application behaviors →similar miss-rate • New metrics needed, richer but still concise Our approach: data structure decomposition
Why decompose by data structure? • Irregular app = multiple regular data structures • while (tmp) tmp=tmp->next; • Data structures are high-level • Results easy to visualize • Can be correlated back to application source code
Outline • Data structure decomposition using DTrack • Automatic instrumentation + timing simulation • Methodology • Tools, configurations simulated, benchmarks studied • Results • Data structures causing the most misses • Different types of access patterns • Case study: data structure criticality
Conventional simulation methodology • Simulated application shares resources with simulator • disk, file system, network • ..but is not aware of it Application Simulator Host Processor Resources
A different perspective Application can communicate with simulator Leave core application oblivious; automatically add simulator-aware instrumentation Simulator Application Resources
DTrack Application Sources Instrumented Sources Application Executable Detailed Statistics Source Translator Compiler Simulator - DTrack’s protocol for application-simulator communication
Application stores mapping at a predetermined shared location • (start address, end address) → variable name • Application signals simulator somehow • We enhance ISA with new opcode • Other techniques possible • Simulator detects signal, reads shared location Simulator now knows variable names of address regions DTrack’s protocol • Application stores mapping at a predetermined shared location • (start address, end address) → variable name • Application signals simulator somehow • We enhance ISA with new opcode • Other techniques possible • Simulator detects signal, reads shared location
DTrack instrumentation Global variables: just after initialization int globalTime ; int main () { … } Before: After: int Time ; int main () { print (FILE, “Time”, Time, sizeof(Time)); … asm (“mop”); }
DTrack instrumentation Heap variables: just after allocation Before: x = malloc(4); After: x = malloc(4); DTRACK_PTR = x ; DTRACK_NAME = “x” ; DTRACK_SIZE = 4 ; asm(“mop”);
Design decisions • Source-based rather than binary-based translation • Local variables – no instrumentation • Instrumenting every call/return is too much overhead • Doesn’t cause many cache misses anyway • Dynamic allocation on the stack: handle alloca just like malloc • Signalling opcode: overload an existing one • avoid modifying compiler, allow running natively
Minimizing perturbance • Global variables are easy • One-time cost • Heap variables are hard • DTRACK_PTR, etc. always hit in the cache • Measuring perturbance • Communicate specific start and end points in application to simulator • Compare instruction counts between them with and without instrumentation Instruction count <4% even with frequent malloc Instrumentation can perturb app behavior • Minimizing perturbance • Global variables are easy • One-time cost • Heap variables are hard • DTRACK_PTR, etc. always hit in the cache • Measuring perturbance • Communicate specific start and end points in application to simulator • Compare instruction counts between them with and without instrumentation
Outline • Data structure decomposition using DTrack • Automatic instrumentation + timing simulation • Methodology • Tools, configurations simulated, benchmarks studied • Results • Data structures causing the most misses • Different types of access patterns • Case study: data structure criticality
Methodology • Source translator: C-Breeze • Compiler: Alpha GEM cc • Simulator: sim-alpha • Validated model of 21264 pipeline • Simulated machine: Alpha 21264 • 4-way issue, 64KB 3-cycle DL1 • Benchmarks: 12 C applications from SPEC CPU2000 suite
Major data structures by DL1 misses % DL1 misses
art art f1[i] f1[i] bu[i] bu[i] i=i+1 i=i+1 i=i+1 i=i+1 node child node parent node sibling node siblingp node child node parent node sibling node siblingp Large variety in access patterns mcf mcf node[i] node[i] node = DFS(node) node = DFS(node) i=i+1 i=i+1 twolf twolf t1 = b[c[i]cblock] t2 = t1tileterm t3 = n[t2net] … t1 = b[c[i]cblock] t2 = t1tileterm t3 = n[t2net] … i=rand() i=rand() Code + Data profile = Access pattern
Most misses ≣ Most pipeline stalls? • Process: • Detect stall cycles when no instructions were committed • Assign blame to data structure of oldest instruction in pipeline • Result • Stall cycle ranks track miss count ranks • Exceptions: • tds in 179.art • search in 186.crafty
Summary • Toolchain for mapping addresses to high-level data structure • Communicating information to simulator • Reveals new patterns about applications • Applications show wide variety of distributions • Within an application, data structures have a variety of access patterns • Misses not correlated to accesses or footprint • ..but they correlate well with data structure criticality