1 / 75

Design and Evaluation of Architectures for Commercial Applications

Design and Evaluation of Architectures for Commercial Applications. Part II: tools & methods. Luiz André Barroso. Overview. Evaluation methods/tools Introduction Software instrumentation (ATOM) Hardware measurement & profiling IPROBE DCPI ProfileMe Tracing & trace-driven simulation

lela
Download Presentation

Design and Evaluation of Architectures for Commercial Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Evaluation of Architectures for Commercial Applications Part II: tools & methods Luiz André Barroso

  2. Overview • Evaluation methods/tools • Introduction • Software instrumentation (ATOM) • Hardware measurement & profiling • IPROBE • DCPI • ProfileMe • Tracing & trace-driven simulation • User-level simulators • Complete machine simulators (SimOS) UPC, February 1999

  3. Studying commercial applications: challenges • Size of the data sets and programs • Complex control flow • Complex interactions with Operating System • Difficult tuning process • Lack of access to source code (?) • Vendor restrictions on publications • important to have a rich set of tools UPC, February 1999

  4. Tools are useful in many phases • Understanding behavior of workloads • Tuning • Performance measurements in existing systems • Performance estimation for future systems UPC, February 1999

  5. Using ordinary system tools • Measuring CPU utilization and balance • Determining user/system breakdown • Detecting I/O bottlenecks • Disks • Networks • Monitoring memory utilization and swap activity UPC, February 1999

  6. Gathering symbol table information • Most database programs are large statically linked stripped binaries • Most tools will require symbol table information • However, distributions typically consist of object files with symbolic data • Simple trick: • replace system linker with wrapper that remove “strip” flag, then calls real linker UPC, February 1999

  7. ATOM: A Tool-Building System Developed at WRL by Alan Eustace & Amitabh Srivastava Easy to build new tools Flexible enough to build interesting tools Fast enough to run on real applications Compiler independent: works on existing binaries UPC, February 1999

  8. Code Instrumentation • Application appears unchanged • ATOM adds code and data to the application • Information collected as a side effect of execution Trojan Horse TOOL V V UPC, February 1999

  9. ATOM Programming Interface Given an application program: • Navigation: Move around • Interrogation: Ask questions • Definition: Define interface to analysis procedures • Instrumentation: Add calls to analysis procedures Pass ANYTHING as arguments! PC, effective addresses, constants, register values, arrays, function arguments, line numbers, procedure names, file names, etc. UPC, February 1999

  10. Navigation Primitives • Get{First,Last,Next,Prev}Obj • Get{First,Last,Next,Prev}ObjProc • Get{First,Last,Next,Prev}Block • Get{First,Last,Next,Prev}Inst • GetInstBlock - Find enclosing block • GetBlockProc - Find enclosing procedure • GetProcObj - Find enclosing object • GetInstBranchTarget - Find branch target • ResolveTargetProc - Find subroutine destination UPC, February 1999

  11. Interrogation • GetProgramInfo(PInfo) • number of procedures, blocks, and instructions. • text and data addresses • GetProcInfo(Proc *, BlockInfo) • Number of blocks or instructions • Procedure frame size, integer and floating point save masks • GetBlockInfo(Inst *, InstInfo) • Number of instructions • Any piece of the instruction (opcode, ra, rb, displacement) UPC, February 1999

  12. Interrogation(2) • ProcFileName • Returns the file name for this procedure • InstLineNo • Returns the line number of this procedure • GetInstRegEnum • Returns a unique register specifier • GetInstRegUsage • Computes Source and Destination masks UPC, February 1999

  13. Interrogation(3) • GetInstRegUsage • Computes instruction source and destination masks GetInstRegUsage(instFirst, &usageFirst); GetInstRegUsage(instSecond, &usageSecond); if (usageFirst.dreg_bitvec[0] & usageSecond.ureg_bitvec[0]) { /* set followed by a use */ } Exactly what you need to find static pipeline stalls! UPC, February 1999

  14. Definition AddCallProto(“function(argument list)”) • Constants • Character strings • Program counter • Register contents • Cycle counter • Constant arrays • Effective Addresses • Branch Condition Values UPC, February 1999

  15. Instrumentation • AddCallProgram(Program{Before,After}, “name”,args) • AddCallProc(p, Proc{Before,After}, “name”,args) • AddCallBlock(b, Block{Before,After}, “name”,args) • AddCallInst(i, Inst{Before,After}, “name”,args) • ReplaceProc(p, “new”) UPC, February 1999

  16. Example #1: Procedure Tracing What procedures are executed by the following mystery program? #include <stdio.h> main() { printf(“Hello world!\n”); } Hint: main => printf => ??? UPC, February 1999

  17. Procedure Tracing Example > cc hello.c -non_shared -g1 -o hello > atom hello ptrace.inst.c ptrace.anal.c -o hw.ptrace > hello.ptrace => __start => main => printf => _doprnt => __getmbcurmaz <= __getmbcurmax => memcpy <= memcpy => fwrite UPC, February 1999

  18. Procedure Trace (2) UPC, February 1999

  19. Example #2: Cache Simulator Write a tool that computes the miss rate of the application running in a 64KB, direct mapped data cache with 32 byte lines. > atom spice cache.inst.o cache.anal.o -o spice.cache > spice.cache < ref.in > ref.out > more cache.out 5,387,822,402 620,855,884 11.523% Great use for 64 bit integers! UPC, February 1999

  20. Cache Tool Implementation Application Instrumentation Reference(-32592(gp)); Note: Passes addresses as if uninstrumented! Reference(-32592(gp)); PrintResults(); UPC, February 1999

  21. Cache Instrumentation File #include <stdio.h> #include <cmplrs/atom.inst.h> unsigned InstrumentAll(int argc, char **argv) { AddCallProto(“Reference(VALUE)”); AddCallProto(“Print()”); for (o = GetFirstObj(); p != NULL; p = GetNextObj(o)) { if (BuildObj(o)) return (1); if (o == GetFirstObj()) AddCallObj(o,ObjAfter,”Print”); for (p = GetFirstProc(); p != NULL; p = GetNextProc(p)) for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b)) for (i = GetFirstInst(b); i != NULL; i = GetNextInst(i)) if (IsInstType(i, InstTypeLoad) || IsInstType(i,InstTypeStore)) AddCallInst(i, InstBefore, “Reference”, EffAddrValue); WriteObj(o); } return (0); } UPC, February 1999

  22. Cache Analysis File #include <stdio.h> #define CACHE_SIZE 65536 #define BLOCK_SHIFT 5 long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses; Reference(long address) { int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT; long tag = address >> BLOCK_SHIFT; if (cache[index] != tag) { misses++; cache[index] = tag ; } refs++; } Print() { FILE *file = fopen(“cache.out”,”w”); printf(file,”%ld %ld %.2f\n”,refs, misses, 100.0 * misses / refs); fclose(file); } UPC, February 1999

  23. Example #3: TPC-B runtime information • Statistics per transaction: • Instructions 180,398 • Loads (% shared) 47,643 (24%) • Stores (% shared) 21,380 (22%) • Lock/Unlock 118 • MBs 241 • Footprints/CPU • Instr. 300 KB (1.6 MB in pages) • Private data 470 KB (4 MB in pages) • Shared data 7 MB (26 MB in pages) • 50% of the shared data footprint is touched by at least one other process UPC, February 1999

  24. TPC-B (2) UPC, February 1999

  25. TPC-B (3) UPC, February 1999

  26. Oracle SGA activity in TPC-B UPC, February 1999

  27. ATOM wrap-up • Very flexible “hack-it-yourself” tool • Discover detailed information on dynamic behavior of programs • Especially good when you don’t have source code • Shipped with Digital Unix • Can be used for tracing (later) UPC, February 1999

  28. Hardware measurement tools • IPROBE • interface to CPU event counters • DCPI • hardware assisted profiling • ProfileMe • hardware assisted profiling for complex CPU cores UPC, February 1999

  29. IPROBE • Developed by Digital’s Performance Group • Use event counters provided by Alphas • Operation: • set counter to monitor a particular event (e.g., icache_miss) • start counter • every counter overflow, interrupt wakes up handler and events are accumulated • stop counter and read total • User can select: • which processes to count • user level, kernel level, both UPC, February 1999

  30. IPROBE: 21164 event types issues single_issue_cycles long_stalls cycles dual_issue_cycles branch_mispr triple_issue_cycles pc_mispr quad_issue_cycles icache_miss split_issue_cycles dcache_miss pipe_dry dtb_miss pipe_frozen loads_merged replay_trap ldu_replays branches cycles cond_branches scache_miss jsr_ret scache_read_miss integer_ops scache_write float_ops scache_sh_write loads scache_write_miss stores bcache_miss icache_access sys_inv dcache_access itb_miss scache_access wb_maf_full_replays scache_read sys_read_req scache_write external bcache_hit mem_barrier_cycles bcache_victim load_locked sys_req scache_victim UPC, February 1999

  31. IPROBE: what you can do • Directly measure relevant events (e.g. cache performance) • Overall CPU cycle breakdown diagnosis: • microbenchmark machine to estimate latencies • combine latencies with event counts • Main of inaccuracy • load/store overlap in the memory system UPC, February 1999

  32. IPROBE example: 4-CPU SMP CPI = 7.4 Estimated breakdown of stall cycles Breakdown of CPU cycles UPC, February 1999

  33. Why did it run so bad?!? • Nominal memory latencies were good: 80 cycles • Micro-benchmarks determined that: • latency under load is over 120 cycles on 4 processors • base dirty miss latency was over 130 cycles • off-chip cache latency was high • IPROBE data uncovered significant sharing: • for P=2, 15% of bcache misses are to dirty blocks • for P=4, 20% of bcache misses are to dirty blocks UPC, February 1999

  34. Dirty miss latency on RISC SMPs • SPEC benchmark has no significant sharing • Current processors/systems optimize local cache access • All RISC SMPs have high dirty miss penalties UPC, February 1999

  35. DCPI: continuous profiling infrastructure • Developed by SRC and WRL researchers • Based on periodic sampling • Hardware generates periodic interrupts • OS handles the interrupts and stores data • Program Counter (PC) and any extra info • Analysis Tools convert data • for users • for compilers Other examples: SGI Speedshop, Unix’s prof(), VTune UPC, February 1999

  36. Sampling vs. Instrumentation • Much lower overhead than instrumentation • DCPI: program 1%-3% slower • Pixie: program 2-3 times slower • Applicable to large workloads • 100,000 TPS on Alpha • AltaVista • Easier to apply to whole systems (kernel, device drivers, shared libraries, ...) • Instrumenting kernels is very tricky • No source code needed UPC, February 1999

  37. Information from Profiles DCPI estimates • Where CPU cycles went, broken down by • image, procedure, instruction • How often code was executed • basic blocks and CFG edges • Where peak performance was lost and why UPC, February 1999

  38. Example: Getting the Big Picture Total samples for event type cycles = 6095201 cycles % cum% load file 2257103 37.03% 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so 1658462 27.21% 64.24% /vmunix 928318 15.23% 79.47% /usr/shlib/X11/libmi.so 650299 10.67% 90.14% /usr/shlib/X11/libos.so cycles % cum% procedure load file 2064143 33.87% 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% bcopy /vmunix 209835 3.44% 59.28% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% in_checksum /vmunix 161326 2.65% 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so UPC, February 1999

  39. Example: Using the Microscope Where peak performance is lost and why UPC, February 1999

  40. Example: Summarizing Stalls I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0% Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3% ------------------------------------------------------------- Subtotal dynamic 44.1% Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0% ------------------------------------------------------------- Subtotal static 4.8% ------------------------------------------------------------- Total stall 48.9% Execution 51.2% Net sampling error -0.1% ------------------------------------------------------------- Total tallied 100.0% (35171, 93.1% of all samples) UPC, February 1999

  41. Example: Sorting Stalls % cum% cycles cnt cpi blame PC file:line 10.0% 10.0% 109885 4998 22.0 dcache 957c comp.c:484 9.9% 19.8% 108776 5513 19.7 dcache 9530 comp.c:477 7.8% 27.6% 85668 3836 22.3 dcache 959c comp.c:488 UPC, February 1999

  42. Typical Hardware Support • Timers • Clock interrupt after N units of time • Performance Counters • Interrupt after N cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired, ... • Alpha 21064, 21164; PPro, PII;… • Easy to measure total cycles, issues, CPI, etc. Only extra information is restart PC UPC, February 1999

  43. Problem: Inaccurate Attribution • Experiment • count data loads • loop: single load +hundreds of nops • In-Order Processor • Alpha 21164 • skew • large peak • Out-of-Order Processor • Intel Pentium Pro • skew • smear load UPC, February 1999

  44. Ramification of Misattribution • No skew or smear • Instruction-level analysis is easy! • Skew is a constant number of cycles • Instruction-level analysis is possible • Adjust sampling period by amount of skew • Infer execution counts, CPI, stalls, and stall explanations from cycles samples and program • Smear • Instruction-level analysis seems hopeless • Examples: PII, StrongARM UPC, February 1999

  45. Desired Hardware Support • Sample fetched instructions • Save PC of sampled instruction • E.g., interrupt handler reads Internal Processor Register • Makes skew and smear irrelevant • Gather more information UPC, February 1999

  46. ProfileMe: Instruction-Centric Profiling Fetch counter overflow? fetch map issue exec retire random selection ProfileMe tag! arithunits interrupt! branchpredict dcache icache done? tagged? miss? pc mp? history stage latencies addr miss? retired? capture! internal processor registers UPC, February 1999

  47. Instruction-Level Statistics • PC + Retire Status  execution frequency • PC + Cache Miss Flag  cache miss rates • PC + Branch Mispredict  mispredict rates • PC + Event Flag  event rates • PC + Branch Direction  edge frequencies • PC + Branch History  path execution rates • PC + Latency  instruction stalls “100-cycle dcache miss” vs. “dcache miss” UPC, February 1999

  48. Cycle samples are proportional to total time at head of issue queue (at least on in-order Alphas) Frequency indicates frequent paths CPI indicates stalls A N A L Y S I S Frequency Compiled code Cycles per instruction Samples Stall explanations Data Analysis UPC, February 1999

  49. 1,000,000  1 CPI ? 1,000,000 Cycles 10,000  100 CPI Estimating Frequency from Samples • Problem • given cycle samples, compute frequency and CPI • Approach • Let F = Frequency / Sampling Period • E(Cycle Samples) = F X CPI • So … F = E(Cycle Samples) / CPI UPC, February 1999

  50. Estimating Frequency (cont.) F = E(Cycle Samples) / CPI • Idea • If no dynamic stall, then know CPI, so can estimate F • So… assume some instructions have no dynamic stalls • Consider a group of instructions with the same frequency (e.g., basic block) • Identify instructions w/o dynamic stalls; then average their sample counts for better accuracy • Key insight: • Instructions without stalls have smaller sample counts UPC, February 1999

More Related