1 / 69

Hardware and Software Tracing

Hardware and Software Tracing. David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu. Trace Collection Methodologies. Hardware Monitors and instrumentation Microcode Software Trap-based system Emulators

asis
Download Presentation

Hardware and Software Tracing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

  2. Trace Collection Methodologies • Hardware • Monitors and instrumentation • Microcode • Software • Trap-based system • Emulators • Code annotation (source, object, executable) • Direct execution

  3. Metrics for Evaluating Trace Collection Methodologies • Speed – trace capture rate • Memory – extra memory used • Accuracy – address perturbation • Intrusiveness – tracing overhead • Completeness – OS, interrupts, libraries • Granularity – smallest traceable unit • Flexibility – ease of use • Portability – platform dependence • Capacity – trace storage space • Cost - $$, time

  4. Hardware Monitors • Capture trace at peak execution rates • Challenge - match storage media speed to tracing needs utilizing interleaving and multiplexing • Pros: • Non-intrusive • Accurate • Complete • Cons: • Expensive • Limited probeability • Limited trace length

  5. Examples of Hardware Monitors • Monster – (U. of Michigan 1992) – R2000 traces using a DAS9200 • BACH (BYU, 1992) – i486, Pentium SPARC, 68K – developed a customized pod – being used by Intel today • Real-time Tracer (IBM 1992) – Customized SRAM array • National Instruments (2006) – provides a family of programmable instrumentation monitors

  6. Microcode-based Tracing • Places hooks in microcode to capture machine state • Pros: • Complete (OS, application) • Minimal slowdown (2-10x) • Cons: • Microcode is dated technology • Nonportable

  7. Example Microcode-based Tracing • ATUM (Stanford 1986) – VAX traces • PatchWrx (DEC WRL 1995, NU 1996) – Complete OS-rich traces on Alpha running NT

  8. Intrumenting NT-based Workloads

  9. Participants • Chakib Ouarraoui – EMC • Jason Casmira – Intel • John Fraser – US Air Force • David Hunter – VMWare • Sharon Smith – HP • Richard Sites – Adobe Systems

  10. Tracing tools that capture OS activity

  11. OS Rich and NT-based Instrumentation Tools • SimOS • UNIX-based platforms – (basis for VMWare) • OS, memory, I/O activity • High overhead (10X - 50,000X) • Etch • Intel x86-based platform • No OS activity • 35X slowdown

  12. PatchWrx Overview • Dynamic execution tracing tool suite • Captures full system workloads • Traces branches executed by the processor • Reconstructs full instruction stream • DEC Alpha 21064 Windows NT 4.0 platforms • Low overhead with minimum slowdown • 2X while running • 4X while tracing

  13. PatchWrx Components • PALcode – Alpha Privileged Architecture Library • Reserves trace buffer upon boot • Captures trace info • Facilitates long branches • Patch – instrument all NT images • Trace – collect runtime information • Reconstruct – reconstitute the information

  14. Patching an Image • Instrument all WinNT binary image types • COM, EXE, DLL, SYS, DRV • Replace branch-type instructions with branches to PatchWrx PAL calls • Log trace entry of branch type into buffer • Branch to original target

  15. Patching an Image ORIGINAL IMAGE PATCHED IMAGE A A’ PAL 1 B B 4 3 2 PATCH SECTION PWX PAL BR

  16. Patching Large Images • Normal Alpha ISA branch instruction • (PC+4) + SEXT(disp21) * 4 • New PatchWrx long branches • LBR (PC+4) + SEXT(disp25) * 4 • LBSR (PC+4) + ZEXT(disp20) * 32

  17. Patching Large Images LONG PATCHED IMAGE 1 A’ PAL 6 B 2 4 3 5 PATCH SECTION CAPTURE PWX PAL BR

  18. Tracing with PatchWrx • Trace • User controlled start/stop/dump • Dumps captured trace to binary file • Captures VA mapping snapshot of active processes during trace capture

  19. Reconstructing Execution IMAGE n IMAGE 0 I-STREAM AND/OR D-STREAM RAW TRACE . . . . RECONSTRUCT TOOL VA MAP SYMBOL TABLE 0 SYMBOL TABLE n

  20. OS-Rich Workload Characterization • Execution domain analysis • Hot EXEs / DLLs (system resources) • Instruction mix • Application-only • Full system • Branching behavior • Branch frequency (average basic block size) • Branch prediction in presence of OS

  21. Workloads Investigated

  22. Five most frequently used images in each benchmark or application

  23. Average basic block lengths

  24. Conditional Branch Prediction 2-level BTB, 12-bit PHR, 4096 entries, gshare

  25. Summary of Results • Benchmarks execute almost entirely within the application domain • Desktop applications execute across many images and interact with the kernel and system DLLs • Branch prediction accuracy can change drastically (sometimes it can even improve) when the operating system interaction is considered • The instruction mix in desktop applications changes significantly in the presence of OS • Increased number of indirect branches and privileged instructions (e.g., PALcalls)

  26. For Further Information 1. “Tracing and Characterization of Windows NT-based System Workloads,” J.P. Casmira, D.P. Hunter and D.R. Kaeli, Digital Technical Journal, Vol. 10, No. 1, 1998, pp. 6-21 (www.digital.com/info/DTJ01/DTJ01HM.HTM). 2. “Operating System Impact on Trace-Driven Simulation,” J.P. Casmira, J. Fraser and D.R. Kaeli, Proceedings of the 31st Simulation Symposium, Boston, MA, April 1998, pp. 76-82. 3. “A Code Annotation Tool for Capturing Operating System Execution,” J.Fraser, Northeastern University Technical Report, NUCAR_6-97-1, June 1997 (on the NUCAR website). http://www.ece.neu.edu/groups/nucar

  27. And now back to tracing……..

  28. Trap Based • Interrupt the application at selected points in order to save trace records • Pros: • Available on many CPUs • Portable • Inexpensive • Cons: • Considerable slowdown (1000x) • Intrusive (ISR), especially when considering real-time events • How we decide where to interrupt the processor and still maintain a representative trace?

  29. Example Trap Based Systems • VAX-Tracer – Clark&Emer study on VAX • OS2-Tracer – Intel 386 • Wisconsin Wind Tunnel – ECC error trapping – CM5 (SPARC) • Tapeworm II system – ECC error trapping – OS trap handler

  30. Emulators • Simulating the target ISA using one or a multiple machine instructions on the host ISA • Pros: • Minimal slowdown (10-100x) • Opportunity for JIT compilation • Portable • Flexible – software controlled • Cons: • Serious programming effort needed • Extra memory needed • Typically single process tracing

  31. Emulators • Shade (UW 1994) – dynamic translation • Compiles emulated instructions to native instructions (many elements of Shade have shown up in Transmeta products) • Host – SPARC-V8 • Targets – SPARC-V8, SPARC-V9, MIPS • Spa (Sun 1993) – Iterative interpretation • Reinterprets instructions on each occurrence • Host – MIPS-1 • Targets – MIPS-1, MIPS-2 • SPIM (U of Wisc 1991) – predecoded interpretation • Provides pointers to instruction handler and operands to speed decoding • Hosts – SPARC, 680x0, MIPS, HP-PA • Target – MIPS-1

  32. More Recent Emulators • VisualDSP (Analog Devices 1995-present) • Simulator for SHARC and BlackFin DSPs that runs on WinTel and Linux-x86 • Provides C/C++ compilation environment • Statistical profiling • Cycle-accurate simulator • Provides a full visualization environment for machine performance • AMD Opteron X86-64 (2003) • Simulator for the new 64-bit X86 from AMD • Runs on 32-bit Linux-x86 • Comes complete with a X86-64 version of gcc • http://www.x86-64.org/

  33. MP Emulators • MINT (University of Rochester 1994) • Predecoded interpretation – memory references • Host – R3000 (SGI, DECstations) • Target – R3000, (an Alpha-based derivative was developed called AINT) • RSim (Rice Univ 1997) – Simulator for high-ILP Multiprocessors • Detailed cycle-based emulation • Host – SPARC, SGI PowerChallenge • Target – MIPS R10K

  34. Machine Emulators • Simics (1996-present) Virtutech • Developed out research work at SICS • Provides a large number of CPU targets • Alpha, ARM, Itanium, MIPS, Pentium, PowerPC, SPARC, X86-64 • Provides both detailed simulation/emulation and high throughput • http://www.simics.com/ • SimOS (1997) Stanford University • Originally designed to run on an SGI platform • Actually boots a full operating system (SGI IRIX and DEC UNIX) • Implementations on Alpha and MIPS platforms • Designed around the operating system, emulating IO and other system-related events • Provided the base technology for VMWare products

  35. Code Annotation • Instrumented program produces trace while the application is run • Three levels of annotation • Source code modification • Object code modification • Binary code modification • Pros: • Ease of implementation • Small slowdown (10x) • Inexpensive • Cons: • Limited completeness (OS, multiprocessing) • May not capture DLLs • Memory dilation

  36. Source Code Annotation • TRAPEDS (Univ. of Illinois 1989) • Adds a call upon exit from a basic block • MPTrace (Univ. of Washington 1990) • I386, instruments only MP-relevant events • Tangolite (Stanford 1993) • Annotates all memory events in an MP environment

  37. Object Code Annotation • Epoxie (DEC WRL 1989) – Titan MP • Epoxie2 (DEC WRL 1993) – R3000 • ATOM (DEC WRL 1994) – Alpha • Alto (Univ. of Arizona 1996) – Alpha • PLTO (Univ. of Arizona 2001) – IA32

  38. Binary Code Annotation • Pixie (DEC 1991) – MIPS • Goblin (IBM/CMU 1991) – RS/6000 • IDtrace (Univ. of Mich.) – i486 • QPT (Univ. of Wisc.) – MIPS, SPARC • EEL (Univ. of Wisc.) – MIPS, SPARC • DSPTune (NEU) – ADI SHARC DSP • Pin (Intel 2005) – X86, XScale, Itanium

  39. Embedded Systems Profiling Tools • Enhance current embedded system compilation environments, providing profile-driven analysis and feedback capabilities • DSPTune - instrumentation and analysis package for the SHARC family of DSPs • Allows for full instrumentation of C and C++ codes at the source, assembly and ELF binary levels • Supported by Analog Devices and the NSF

  40. The DSPTune Toolset • A set of library routines that enable the user to instrument C and assembly programs • Function calls can be inserted at various locations in the application code, enabling execution driven simulation • The user provides: • instrumentation routines, which specify the selected instrumentation events (e.g., loads, branches, traps) • analysis routines, which carry out the desired simulation (e.g., caches, stacks, branch predictors)

  41. User application code Step I Parser User instrumentation code IntermediateRepresentation Step II Instrumenting Tool InstrumentedIR Step III Code Generator User analysis code Instrumented application code Step IV Assembler Linker Instrumented application executable

  42. BDSPTune • Provides similar capabilites as DSPTune • Allows ELF binaries to be instrumented • Enable instrumentation and profiling to include library routines

  43. Summary of Tracing Methodologies

  44. Counter-based Profiling and Instrumentation David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

  45. Counters are used to: • Identify Performance Bottlenecks • especially unpredictable dynamic stallse.g. cache misses, branch mispredicts, TLB misses, etc. • complex out-of-order processors make this difficult • Guide Optimizations • help programmers understand and improve code • automatic, profile-driven optimizations • Profile Production Workloads • low overhead • transparent • profile whole system

  46. Performance Counters • Interfaced through a device driver and supporting GUI (e.g., VTune) • Counters increment based on a set of events of interest (e.g., cache misses, pipeline stalls) • Interrupt will occur that signals that the counter has overflowed • An interrupt service routine reads the counter information and tags it to a program counter (PC) value • Information is then available for offline analysis

  47. Performance Counters • Low overhead method for obtaining performance and profiling information • Typically less than 5% slowdown • Requires no modification of the binary • May require root level access to system • Lacks precision in cause/affect analysis • Come for free on most ISAs • Commonly used today to measure performance and estimate power usage

  48. Counter Library • A number of counter libraries are available to provide an API to program and access common architectures • Rabbit • for Intel/AMD Processors and Linux • URL: www.scl.ameslab.gov/Projects/Rabbit/ • PAPI • Linux IA32, IA64 • Allows counters to be captured on a per thread basis • URL: icl.cs.utk.edu/projects/papi/

  49. Counters available on different ISAs

  50. Events countable on different ISAs

More Related