Data Structure Decomposition for Memory Analysis

The Memory Behaviorof Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University of Texas at Austin

Memory hierarchy trends • Growing latency to main memory • Growing cache complexity • More cache levels • New mechanisms, optimizations • Growing application complexity • Lots of abstraction Hard to predict how an application will perform on a specific system

Application understanding is hard • Observations can generate Gigabytes of data • Aggregation is necessary • Current metrics are too lossy • Different application behaviors →similar miss-rate • New metrics needed, richer but still concise Our approach: data structure decomposition

Why decompose by data structure? • Irregular app = multiple regular data structures • while (tmp) tmp=tmp->next; • Data structures are high-level • Results easy to visualize • Can be correlated back to application source code

Outline • Data structure decomposition using DTrack • Automatic instrumentation + timing simulation • Methodology • Tools, configurations simulated, benchmarks studied • Results • Data structures causing the most misses • Different types of access patterns • Case study: data structure criticality

Conventional simulation methodology • Simulated application shares resources with simulator • disk, file system, network • ..but is not aware of it Application Simulator Host Processor Resources

A different perspective Application can communicate with simulator Leave core application oblivious; automatically add simulator-aware instrumentation Simulator Application Resources

DTrack Application Sources Instrumented Sources Application Executable Detailed Statistics Source Translator Compiler Simulator - DTrack’s protocol for application-simulator communication

Application stores mapping at a predetermined shared location • (start address, end address) → variable name • Application signals simulator somehow • We enhance ISA with new opcode • Other techniques possible • Simulator detects signal, reads shared location Simulator now knows variable names of address regions DTrack’s protocol • Application stores mapping at a predetermined shared location • (start address, end address) → variable name • Application signals simulator somehow • We enhance ISA with new opcode • Other techniques possible • Simulator detects signal, reads shared location

DTrack instrumentation Global variables: just after initialization int globalTime ; int main () { … } Before: After: int Time ; int main () { print (FILE, “Time”, Time, sizeof(Time)); … asm (“mop”); }

DTrack instrumentation Heap variables: just after allocation Before: x = malloc(4); After: x = malloc(4); DTRACK_PTR = x ; DTRACK_NAME = “x” ; DTRACK_SIZE = 4 ; asm(“mop”);

Design decisions • Source-based rather than binary-based translation • Local variables – no instrumentation • Instrumenting every call/return is too much overhead • Doesn’t cause many cache misses anyway • Dynamic allocation on the stack: handle alloca just like malloc • Signalling opcode: overload an existing one • avoid modifying compiler, allow running natively

Minimizing perturbance • Global variables are easy • One-time cost • Heap variables are hard • DTRACK_PTR, etc. always hit in the cache • Measuring perturbance • Communicate specific start and end points in application to simulator • Compare instruction counts between them with and without instrumentation Instruction count <4% even with frequent malloc Instrumentation can perturb app behavior • Minimizing perturbance • Global variables are easy • One-time cost • Heap variables are hard • DTRACK_PTR, etc. always hit in the cache • Measuring perturbance • Communicate specific start and end points in application to simulator • Compare instruction counts between them with and without instrumentation

Outline • Data structure decomposition using DTrack • Automatic instrumentation + timing simulation • Methodology • Tools, configurations simulated, benchmarks studied • Results • Data structures causing the most misses • Different types of access patterns • Case study: data structure criticality

Methodology • Source translator: C-Breeze • Compiler: Alpha GEM cc • Simulator: sim-alpha • Validated model of 21264 pipeline • Simulated machine: Alpha 21264 • 4-way issue, 64KB 3-cycle DL1 • Benchmarks: 12 C applications from SPEC CPU2000 suite

Major data structures by DL1 misses % DL1 misses

art art f1[i] f1[i] bu[i] bu[i] i=i+1 i=i+1 i=i+1 i=i+1 node child node parent node sibling node siblingp node child node parent node sibling node siblingp Large variety in access patterns mcf mcf node[i] node[i] node = DFS(node) node = DFS(node) i=i+1 i=i+1 twolf twolf t1 = b[c[i]cblock] t2 = t1tileterm t3 = n[t2net] … t1 = b[c[i]cblock] t2 = t1tileterm t3 = n[t2net] … i=rand() i=rand() Code + Data profile = Access pattern

Most misses ≣ Most pipeline stalls? • Process: • Detect stall cycles when no instructions were committed • Assign blame to data structure of oldest instruction in pipeline • Result • Stall cycle ranks track miss count ranks • Exceptions: • tds in 179.art • search in 186.crafty

Summary • Toolchain for mapping addresses to high-level data structure • Communicating information to simulator • Reveals new patterns about applications • Applications show wide variety of distributions • Within an application, data structures have a variety of access patterns • Misses not correlated to accesses or footprint • ..but they correlate well with data structure criticality

Data Structure Decomposition for Memory Analysis