1 / 26

A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling

A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling. Licheng Chen , Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan. Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS). ISPASS 2012 April 2, 2012.

denton
Download Presentation

A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, YongbingHuang, and Guangming Tan Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS) ISPASS 2012 April 2, 2012

  2. Background • Memory behavior is the key factor of the performance of a program. • Understanding memory behavior is significant for identifying the bottleneck of both architecture and application. • For example, • TLB is an essential component of memory system • Applications’ working set tends to be larger and lager, leading to serious TLB miss • Study 1: that TLB miss can degrade system performance by 5~14% [Bhargava’08] • Study 2: a large number of TLB misses in multi-threaded programs are redundant and predictable, which implies the optimization potential. [Bhattacharjee’08] Done by memory profiling

  3. Memory Profiling • Memory profiling is to collect memory behavior information during the execution of programs. • Profiling can be performed for • different hardware components • at different software levels TLB/Cache/DRAM Function Objects (Array, List etc.) Application Whole System

  4. Object Memory Profiling • Object refers to a group of data stored as a unit [Wu’04] • Distinguish regular patterns from mixed and irregular traces • Valuable for optimization • Memory trace compression • Data layout • Object-level prefetching • Cache partition [Soft-OLP, PACT 2009] Object Trace Application Traces Whole System Traces Irregular Regular

  5. Current Profiling Approaches • Existing approaches • Compiler-driven: re-compile/re-link, source code • Instrumentation: heavy overhead • Simulation: accuracy problem, slow • Performance Counter: lack of detailed information • All cannot observe page table walks due to TLB Miss • We propose a hybrid hardware/software approach for object memory profiling • Accurate: real application & real system • Lightweight • Track page table walks at object-level

  6. Outline • Background • Design and Implementation • Experimental Results • Conclusion

  7. An Overview Virtual Address Trace Physical Address Trace Object Access Pattern 0x1f05000 0x1f06000 0x1f07000 …… 0x1f15000 0x1f16000 0x1f17000 …… 0x1f25000 0x1f26000 …… 0x398f24a 0x398f24b 0x398f24c …… 0x1af4aa 0x1af4a6 0x1af4a8 …… 0x38d2cfc 0x38d2cfd …… Matrix (VA: 0x1f05000)

  8. HMTT • Hybrid Memory Trace Toolkit • A DDR3 SDRAM compatible memory trace monitoring system • Adopts hardware snooping technology Memory Trace: <time_stamp, r/w, phy_addr> • Advantages: • Platform independent • Negligible overhead • Full-systemreal memory traces, including OS, page table walks PCIE Cable Connector DIMM plugged on the other side

  9. Challenges (1) • How to translate physical address trace to virtual address trace of a specific process? • ModifyOS kernel to obtain page table • Lookup a phy_addrin the dumped page table • Generate virtual trace of each process

  10. Challenge (2) • How to synchronize hardware and software when an page table update occurs in kernel? • Physical Page allocation/Free in kernel • Trigger annotations in OS VM module • Update dumped page table • Send a sync_tag to hardware

  11. Challenge (3) • How to translate virtual address to objects without modifying source codes? Virtual Address Space • The role of malloc() is to map VA to object • Use dynamic library overwrite to replace malloc() Object: matrix matrix = mymalloc(0x1000) matrix = malloc(0x1000) Object-VA Mapping Table

  12. Put them all together Virtual Address Trace Physical Address Trace Object Access Pattern 0x1f05000 0x1f06000 0x1f07000 …… 0x1f15000 0x1f16000 0x1f17000 …… 0x1f25000 0x1f26000 …… 0x398f24a 0x398f24b 0x398f24c …… 0x1af4aa 0x1af4a6 0x1af4a8 …… 0x38d2cfc 0x38d2cfd …… Matrix (VA: 0x1f05000) Dumped Page Table sync_tag page walk sync_tag Object-VA Mapping Table page walk • Use page table to distinguish three types of memory access • Sync_tag  update page table • Access page table itself  page table walk due to TLB miss • Other memory access  virtual address

  13. Evaluation Methodology

  14. Validation • For SpMV benchmark (CSR) : y = ax * xhost • Micro-benchmark: • The error is less than 2% • Our system is able to distinguish regular access pattern from irregular pattern

  15. Overhead • Two main overhead: • Dumping page table traces: + dump_pt • Dumping object-VA mapping: + dump_obj • Monitoring objects >= 4KB: result in most memory references <2% <1%

  16. Case Study 1: BFS (Breadth-First Search) • columnobject got about 71% of page walks  key object • Optimization: use huge page for column object • Speedup: about 12% for 8-thread, 8% for 128-thread 8.18%

  17. Case Study 2: Canneal (PARSEC) • Cache-aware simulated annealing (SA) to minimize the routing cost of a chip design • Two objects contribute most of the memory accesses: _elementsand _location The memory accessalmost do not change while increasing thread number.

  18. Case Study 2: Canneal • _elements object contributes the most of the increased page walks • Put the _elements object into huge page to reduce TLB miss  Speedup: about 5% for 8-thread

  19. A Visual Demo of the HMTT

  20. Conclusion • We have designed and implemented a hybrid hardware/software approach to conduct object-relative memory profiling. • Accurate: real application & real system • Lightweight • Track page table walks at object-level • We demonstrate two case studies to show how the approach can help users better understand memory behavior and optimize performance. • We intend to use this approach to analyze virtual machine on real machines.

  21. Thanks! &Questions?

  22. Extra Slides

  23. Memory Profiling Approaches Note: √-Yes, ×-No, *-Maybe

  24. Reverse Page Table • Physical address  pid, virtual address

  25. Validation • Access objects with different pattern: • a0: all read accesses, forward • a1: 3/4 read and 1/4 write accesses, forward • a2: 2/4 read and 2/4 write accesses, forward • a3: 1/4 read and 3/4 write accesses, backward • a4: all write accesses, backward Size 256MB, access step 64B, requests: 4M a0 a4

  26. HMTT Configuration Space • A reserved physical memory region • Can be accessed by source codes and binary codes

More Related