1 / 32

Ubiquitous Memory Introspection (UMI)

Ubiquitous Memory Introspection (UMI). Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS. CGO 2007, March 14 2007 San Jose, CA. The complexity crunch. hardware complexity ↑ + software complexity ↑ ⇓ too many things happening concurrently,

kennan
Download Presentation

Ubiquitous Memory Introspection (UMI)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14 2007San Jose, CA

  2. The complexity crunch hardware complexity ↑ + software complexity ↑ ⇓ too many things happening concurrently, complex interactions and side-effects ⇓ less understanding of program execution behavior ↓

  3. The importance of program behavior characterization • S/W developer • Computation vs. communication ratio • Function/path frequency • Test coverage • H/W designer • Cache behavior • Branch prediction • System administrator • Interaction between processes Better understanding of program characteristics can lead to more robust software, and more efficient hardware and systems

  4. Common approaches toprogram understanding Space and time overhead Coarse grained summaries vs. instruction level and contextual information Portability and ability to adapt to different uses

  5. Common approaches toprogram understanding

  6. Common approaches toprogram understanding

  7. Slowdown due to HW counters (counting L1 misses for 181.mcf on P4) (>2000%) Time (seconds) (325%) (35%) (10%) (1%) (1%) (% slowdown)

  8. Common approaches toprogram understanding

  9. Common approaches toprogram understanding UMI

  10. Key components • Dynamic Binary Instrumentation • Complete coverage, transparent, language independent, versatile, … • Bursty Simulation • Sampling and fast forwarding techniques • Detailed context information • Reasonable extrapolation and prediction

  11. Ubiquitous Memory Introspection online mini-simulations analyze short memory access profiles recorded from frequently executed code regions • Key concepts • Focus on hot code regions • Selectively instrument instructions • Fast online mini-simulations • Actionable profiling results for online memory optimizations

  12. Working prototype • Implemented in DynamoRIO • Runs on Linux • Used on Intel P4 and AMD K7 • Benchmarks • SPEC 2000, SPEC 2006, Olden • Server apps: MySQL, Apache • Desktop apps: Acrobat-reader, MEncoder

  13. UMI is cheap and non-intrusive(SPEC2K reference workloads on P4) • Average overhead is 14% • 1% more than DynamoRIO DynamoRIO UMI 80% 60% % slowdown compared to native execution 40% 20% 0% ft tsp mst em3d health 179.art 175.vpr treeadd Average 252.eon 254.gap 181.mcf 176.gcc 301.apsi 164.gzip 56.bzip2 300.twolf 171.swim 173.applu 172.mgrid 189.lucas 255.vortex 178.galgel 177.mesa 186.crafty 191.fma3d 188.ammp 197.parser 183.equake 187.facerec 253.perlbmk 168.wupwise 200.sixtrack

  14. What can UMI do for you? • Inexpensive introspection everywhere • Coarse grained memory analysis • Quick and dirty • Fine grained memory analysis • Expose opportunities for optimizations • Runtime memory-specific optimizations • Pluggable prefetching, learning, adaptation

  15. Coarse grained memory analysis • Experiment: measure cache misses in three ways • HW counters • Full cache simulator (Cachegrind) • UMI • Report correlation between measurements • Linear relationship of two sets of data no correlation -1 1 strong positive correlation strong negative correlation 0

  16. Cache miss correlation results • HW counter vs. Cachegrind • 0.99 • 20x to 100x slowdown • HW counter vs. UMI • 0.88 • Less than 2x slowdownin worst case 1 0.8 0.6 0.4 0.2 0 (HW, CG) (HW, UMI)

  17. What can UMI do for you? • Inexpensive introspection everywhere • Coarse grained memory analysis • Quick and dirty • Fine grained memory analysis • Expose opportunities for optimizations • Runtime memory-specific optimizations • Pluggable prefetching, learning, adaptation

  18. Fine grained memory analysis • Experiment: predict delinquent loads using UMI • Individual loads with cache miss rate greater than threshold • Delinquent load set determined according to full cache simulator (Cachegrind) • Loads that contribute 90% of total cache misses • Measure and report two metrics Delinquent loads predicted by UMI Delinquent loads identified by Cachegrind falsepositive recall

  19. UMI delinquent load prediction accuracy

  20. What can UMI do for you? • Inexpensive introspection everywhere • Coarse grained memory analysis • Quick and dirty • Fine grained memory analysis • Expose opportunities for optimizations • Runtime memory-specific optimizations • Pluggable prefetcher, learning, adaptation

  21. Experiment: online stride prefetching • Use results of delinquent load prediction • Discover stride patterns for delinquent loads • Insert instructions to prefetch data to L2 • Compare runtime for UMI and P4 with HW stride prefetcher

  22. SW prefetching + DynamoRIO + UMI HW prefetching + Native Binary 1.1 Combined prefetching + DynamoRIO + UMI 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 ft mst em3d 301.apsi 179.art Average 181.mcf 172.mgrid 171.swim 191.fma3d 188.ammp 183.equake ft Average Data prefetching results summary More than offsets slowdowns from binary instrumentation! 1.0 0.9 0.8 0.7 running time normalized to native execution (lower is better) 0.6 0.5 0.4 0.3 0.2

  23. The Gory Details

  24. UMI components • Region selector • Instrumentor • Profile analyzer

  25. Region selector • Identify representative code regions • Focus on traces, loops • Frequently executed code • Piggy back on binary instrumentor tricks • Reinforce with sampling • Time based, or leverage HW counters • Naturally adapt to program phases

  26. Record address references Insert instructions to record address referenced by memory operation Manage profiling overhead Clone code trace (akin to Arnold-Ryder scheme) Selective instrumentation of memory operations E.g., ignore stack and static data Instrumentor

  27. counter counter page protection to detect profile overflow Recording profiles Address Profiles Code Trace Profile Code Trace T1 T1 op1 op2 op3 T1 T2 0x100 0x011 0x024 T1 early trace exist early trace exist 0x012 0x013 0x028 0x104 Code Trace T2 op1 op2 0x032 0x031

  28. Mini-simulator • Triggered when code or address profile is full • Simple cache simulator • Currently simulate L2 cache of host • LRU replacement • Improve approximations with techniques similar to offline fast forwarding simulators • Warm up and periodic flushing • Other possible analyzer • Reference affinity model • Data reuse and locality analysis

  29. falsepositive recall normalized performance recall ratio false positive ratio Mini-simulations and parameter sensitivity 181.mcf 1.40 1.20 1.00 0.80 • Regular data structures • If sampling threshold too high, starts to exceed loop bounds: miss out on profiling important loops • Adaptive threshold is best 0.60 0.40 0.20 0.00 1 2 4 8 16 32 64 128 256 512 1024 sampling frequency threshold

  30. falsepositive recall normalized performance recall ratio false positive ratio Mini-simulations and parameter sensitivity 197.parser 1.40 1.20 1.00 • Irregular data structures • Need longer profiles to reach useful conclusions 0.80 0.60 0.40 0.20 0.00 64 128 256 512 1K 2K 4K 8K 16K 32K address profile length

  31. Summary • UMI is lightweight and has a low overhead • 1% more than DynamoRIO • Can be done with Pin, Valgrind, etc. • No added hardware necessary • No synchronization or syscall headaches • Other cores can do real work! • Practical for extracting detailed information • Online and workload specific • Instruction level memory reference profiles • Versatile and user-programmable analysis • Facilitate migration of offline memory optimizations to online setting

  32. Future work • More types of online analysis • Include global information • Incremental (leverage previous execution info) • Combine analysis across multiple threads • More runtime optimizations • E.g., advanced prefetch optimization • Hot data stream prefetch • Markov prefetcher • Locality enhancing data reorganization • Pool allocation • Cooperative allocation between different threads • Your ideas here…

More Related