Design and Evaluation of Architectures for Commercial Applications

Design and Evaluation of Architectures for Commercial Applications Part II: tools & methods Luiz André Barroso

Overview • Evaluation methods/tools • Introduction • Software instrumentation (ATOM) • Hardware measurement & profiling • IPROBE • DCPI • ProfileMe • Tracing & trace-driven simulation • User-level simulators • Complete machine simulators (SimOS) UPC, February 1999

Studying commercial applications: challenges • Size of the data sets and programs • Complex control flow • Complex interactions with Operating System • Difficult tuning process • Lack of access to source code (?) • Vendor restrictions on publications • important to have a rich set of tools UPC, February 1999

Tools are useful in many phases • Understanding behavior of workloads • Tuning • Performance measurements in existing systems • Performance estimation for future systems UPC, February 1999

Using ordinary system tools • Measuring CPU utilization and balance • Determining user/system breakdown • Detecting I/O bottlenecks • Disks • Networks • Monitoring memory utilization and swap activity UPC, February 1999

Gathering symbol table information • Most database programs are large statically linked stripped binaries • Most tools will require symbol table information • However, distributions typically consist of object files with symbolic data • Simple trick: • replace system linker with wrapper that remove “strip” flag, then calls real linker UPC, February 1999

ATOM: A Tool-Building System Developed at WRL by Alan Eustace & Amitabh Srivastava Easy to build new tools Flexible enough to build interesting tools Fast enough to run on real applications Compiler independent: works on existing binaries UPC, February 1999

Code Instrumentation • Application appears unchanged • ATOM adds code and data to the application • Information collected as a side effect of execution Trojan Horse TOOL V V UPC, February 1999

ATOM Programming Interface Given an application program: • Navigation: Move around • Interrogation: Ask questions • Definition: Define interface to analysis procedures • Instrumentation: Add calls to analysis procedures Pass ANYTHING as arguments! PC, effective addresses, constants, register values, arrays, function arguments, line numbers, procedure names, file names, etc. UPC, February 1999

Navigation Primitives • Get{First,Last,Next,Prev}Obj • Get{First,Last,Next,Prev}ObjProc • Get{First,Last,Next,Prev}Block • Get{First,Last,Next,Prev}Inst • GetInstBlock - Find enclosing block • GetBlockProc - Find enclosing procedure • GetProcObj - Find enclosing object • GetInstBranchTarget - Find branch target • ResolveTargetProc - Find subroutine destination UPC, February 1999

Interrogation • GetProgramInfo(PInfo) • number of procedures, blocks, and instructions. • text and data addresses • GetProcInfo(Proc *, BlockInfo) • Number of blocks or instructions • Procedure frame size, integer and floating point save masks • GetBlockInfo(Inst *, InstInfo) • Number of instructions • Any piece of the instruction (opcode, ra, rb, displacement) UPC, February 1999

Interrogation(2) • ProcFileName • Returns the file name for this procedure • InstLineNo • Returns the line number of this procedure • GetInstRegEnum • Returns a unique register specifier • GetInstRegUsage • Computes Source and Destination masks UPC, February 1999

Interrogation(3) • GetInstRegUsage • Computes instruction source and destination masks GetInstRegUsage(instFirst, &usageFirst); GetInstRegUsage(instSecond, &usageSecond); if (usageFirst.dreg_bitvec[0] & usageSecond.ureg_bitvec[0]) { /* set followed by a use */ } Exactly what you need to find static pipeline stalls! UPC, February 1999

Definition AddCallProto(“function(argument list)”) • Constants • Character strings • Program counter • Register contents • Cycle counter • Constant arrays • Effective Addresses • Branch Condition Values UPC, February 1999

Instrumentation • AddCallProgram(Program{Before,After}, “name”,args) • AddCallProc(p, Proc{Before,After}, “name”,args) • AddCallBlock(b, Block{Before,After}, “name”,args) • AddCallInst(i, Inst{Before,After}, “name”,args) • ReplaceProc(p, “new”) UPC, February 1999

Example #1: Procedure Tracing What procedures are executed by the following mystery program? #include <stdio.h> main() { printf(“Hello world!\n”); } Hint: main => printf => ??? UPC, February 1999

Procedure Tracing Example > cc hello.c -non_shared -g1 -o hello > atom hello ptrace.inst.c ptrace.anal.c -o hw.ptrace > hello.ptrace => __start => main => printf => _doprnt => __getmbcurmaz <= __getmbcurmax => memcpy <= memcpy => fwrite UPC, February 1999

Procedure Trace (2) UPC, February 1999

Example #2: Cache Simulator Write a tool that computes the miss rate of the application running in a 64KB, direct mapped data cache with 32 byte lines. > atom spice cache.inst.o cache.anal.o -o spice.cache > spice.cache < ref.in > ref.out > more cache.out 5,387,822,402 620,855,884 11.523% Great use for 64 bit integers! UPC, February 1999

Cache Tool Implementation Application Instrumentation Reference(-32592(gp)); Note: Passes addresses as if uninstrumented! Reference(-32592(gp)); PrintResults(); UPC, February 1999

Cache Instrumentation File #include <stdio.h> #include <cmplrs/atom.inst.h> unsigned InstrumentAll(int argc, char **argv) { AddCallProto(“Reference(VALUE)”); AddCallProto(“Print()”); for (o = GetFirstObj(); p != NULL; p = GetNextObj(o)) { if (BuildObj(o)) return (1); if (o == GetFirstObj()) AddCallObj(o,ObjAfter,”Print”); for (p = GetFirstProc(); p != NULL; p = GetNextProc(p)) for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b)) for (i = GetFirstInst(b); i != NULL; i = GetNextInst(i)) if (IsInstType(i, InstTypeLoad) || IsInstType(i,InstTypeStore)) AddCallInst(i, InstBefore, “Reference”, EffAddrValue); WriteObj(o); } return (0); } UPC, February 1999

Cache Analysis File #include <stdio.h> #define CACHE_SIZE 65536 #define BLOCK_SHIFT 5 long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses; Reference(long address) { int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT; long tag = address >> BLOCK_SHIFT; if (cache[index] != tag) { misses++; cache[index] = tag ; } refs++; } Print() { FILE *file = fopen(“cache.out”,”w”); printf(file,”%ld %ld %.2f\n”,refs, misses, 100.0 * misses / refs); fclose(file); } UPC, February 1999

Example #3: TPC-B runtime information • Statistics per transaction: • Instructions 180,398 • Loads (% shared) 47,643 (24%) • Stores (% shared) 21,380 (22%) • Lock/Unlock 118 • MBs 241 • Footprints/CPU • Instr. 300 KB (1.6 MB in pages) • Private data 470 KB (4 MB in pages) • Shared data 7 MB (26 MB in pages) • 50% of the shared data footprint is touched by at least one other process UPC, February 1999

TPC-B (2) UPC, February 1999

TPC-B (3) UPC, February 1999

Oracle SGA activity in TPC-B UPC, February 1999

ATOM wrap-up • Very flexible “hack-it-yourself” tool • Discover detailed information on dynamic behavior of programs • Especially good when you don’t have source code • Shipped with Digital Unix • Can be used for tracing (later) UPC, February 1999

Hardware measurement tools • IPROBE • interface to CPU event counters • DCPI • hardware assisted profiling • ProfileMe • hardware assisted profiling for complex CPU cores UPC, February 1999

IPROBE • Developed by Digital’s Performance Group • Use event counters provided by Alphas • Operation: • set counter to monitor a particular event (e.g., icache_miss) • start counter • every counter overflow, interrupt wakes up handler and events are accumulated • stop counter and read total • User can select: • which processes to count • user level, kernel level, both UPC, February 1999

IPROBE: 21164 event types issues single_issue_cycles long_stalls cycles dual_issue_cycles branch_mispr triple_issue_cycles pc_mispr quad_issue_cycles icache_miss split_issue_cycles dcache_miss pipe_dry dtb_miss pipe_frozen loads_merged replay_trap ldu_replays branches cycles cond_branches scache_miss jsr_ret scache_read_miss integer_ops scache_write float_ops scache_sh_write loads scache_write_miss stores bcache_miss icache_access sys_inv dcache_access itb_miss scache_access wb_maf_full_replays scache_read sys_read_req scache_write external bcache_hit mem_barrier_cycles bcache_victim load_locked sys_req scache_victim UPC, February 1999

IPROBE: what you can do • Directly measure relevant events (e.g. cache performance) • Overall CPU cycle breakdown diagnosis: • microbenchmark machine to estimate latencies • combine latencies with event counts • Main of inaccuracy • load/store overlap in the memory system UPC, February 1999

IPROBE example: 4-CPU SMP CPI = 7.4 Estimated breakdown of stall cycles Breakdown of CPU cycles UPC, February 1999

Why did it run so bad?!? • Nominal memory latencies were good: 80 cycles • Micro-benchmarks determined that: • latency under load is over 120 cycles on 4 processors • base dirty miss latency was over 130 cycles • off-chip cache latency was high • IPROBE data uncovered significant sharing: • for P=2, 15% of bcache misses are to dirty blocks • for P=4, 20% of bcache misses are to dirty blocks UPC, February 1999

Dirty miss latency on RISC SMPs • SPEC benchmark has no significant sharing • Current processors/systems optimize local cache access • All RISC SMPs have high dirty miss penalties UPC, February 1999

DCPI: continuous profiling infrastructure • Developed by SRC and WRL researchers • Based on periodic sampling • Hardware generates periodic interrupts • OS handles the interrupts and stores data • Program Counter (PC) and any extra info • Analysis Tools convert data • for users • for compilers Other examples: SGI Speedshop, Unix’s prof(), VTune UPC, February 1999

Sampling vs. Instrumentation • Much lower overhead than instrumentation • DCPI: program 1%-3% slower • Pixie: program 2-3 times slower • Applicable to large workloads • 100,000 TPS on Alpha • AltaVista • Easier to apply to whole systems (kernel, device drivers, shared libraries, ...) • Instrumenting kernels is very tricky • No source code needed UPC, February 1999

Information from Profiles DCPI estimates • Where CPU cycles went, broken down by • image, procedure, instruction • How often code was executed • basic blocks and CFG edges • Where peak performance was lost and why UPC, February 1999

Example: Getting the Big Picture Total samples for event type cycles = 6095201 cycles % cum% load file 2257103 37.03% 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so 1658462 27.21% 64.24% /vmunix 928318 15.23% 79.47% /usr/shlib/X11/libmi.so 650299 10.67% 90.14% /usr/shlib/X11/libos.so cycles % cum% procedure load file 2064143 33.87% 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% bcopy /vmunix 209835 3.44% 59.28% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% in_checksum /vmunix 161326 2.65% 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so UPC, February 1999

Example: Using the Microscope Where peak performance is lost and why UPC, February 1999

Example: Summarizing Stalls I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0% Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3% ------------------------------------------------------------- Subtotal dynamic 44.1% Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0% ------------------------------------------------------------- Subtotal static 4.8% ------------------------------------------------------------- Total stall 48.9% Execution 51.2% Net sampling error -0.1% ------------------------------------------------------------- Total tallied 100.0% (35171, 93.1% of all samples) UPC, February 1999

Example: Sorting Stalls % cum% cycles cnt cpi blame PC file:line 10.0% 10.0% 109885 4998 22.0 dcache 957c comp.c:484 9.9% 19.8% 108776 5513 19.7 dcache 9530 comp.c:477 7.8% 27.6% 85668 3836 22.3 dcache 959c comp.c:488 UPC, February 1999

Typical Hardware Support • Timers • Clock interrupt after N units of time • Performance Counters • Interrupt after N cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired, ... • Alpha 21064, 21164; PPro, PII;… • Easy to measure total cycles, issues, CPI, etc. Only extra information is restart PC UPC, February 1999

Problem: Inaccurate Attribution • Experiment • count data loads • loop: single load +hundreds of nops • In-Order Processor • Alpha 21164 • skew • large peak • Out-of-Order Processor • Intel Pentium Pro • skew • smear load UPC, February 1999

Ramification of Misattribution • No skew or smear • Instruction-level analysis is easy! • Skew is a constant number of cycles • Instruction-level analysis is possible • Adjust sampling period by amount of skew • Infer execution counts, CPI, stalls, and stall explanations from cycles samples and program • Smear • Instruction-level analysis seems hopeless • Examples: PII, StrongARM UPC, February 1999

Desired Hardware Support • Sample fetched instructions • Save PC of sampled instruction • E.g., interrupt handler reads Internal Processor Register • Makes skew and smear irrelevant • Gather more information UPC, February 1999

ProfileMe: Instruction-Centric Profiling Fetch counter overflow? fetch map issue exec retire random selection ProfileMe tag! arithunits interrupt! branchpredict dcache icache done? tagged? miss? pc mp? history stage latencies addr miss? retired? capture! internal processor registers UPC, February 1999

Instruction-Level Statistics • PC + Retire Status  execution frequency • PC + Cache Miss Flag  cache miss rates • PC + Branch Mispredict  mispredict rates • PC + Event Flag  event rates • PC + Branch Direction  edge frequencies • PC + Branch History  path execution rates • PC + Latency  instruction stalls “100-cycle dcache miss” vs. “dcache miss” UPC, February 1999

Cycle samples are proportional to total time at head of issue queue (at least on in-order Alphas) Frequency indicates frequent paths CPI indicates stalls A N A L Y S I S Frequency Compiled code Cycles per instruction Samples Stall explanations Data Analysis UPC, February 1999

1,000,000  1 CPI ? 1,000,000 Cycles 10,000  100 CPI Estimating Frequency from Samples • Problem • given cycle samples, compute frequency and CPI • Approach • Let F = Frequency / Sampling Period • E(Cycle Samples) = F X CPI • So … F = E(Cycle Samples) / CPI UPC, February 1999

Estimating Frequency (cont.) F = E(Cycle Samples) / CPI • Idea • If no dynamic stall, then know CPI, so can estimate F • So… assume some instructions have no dynamic stalls • Consider a group of instructions with the same frequency (e.g., basic block) • Identify instructions w/o dynamic stalls; then average their sample counts for better accuracy • Key insight: • Instructions without stalls have smaller sample counts UPC, February 1999

Design and Evaluation of Architectures for Commercial Applications