1 / 11

Memory Management & Multi-Thread

Performace study of CMS Reco: tcmalloc, pool allocator 17 July ‘08 High Performance Computing for High Energy Physics. Memory Management & Multi-Thread. Multi-threading adds new freedom to the lifestyles of objects… Object lifetime restricted in a single thread Event parallelism

spearst
Download Presentation

Memory Management & Multi-Thread

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performace study of CMS Reco:tcmalloc, pool allocator 17 July ‘08High Performance Computing for High Energy Physics V.I. -- Performance Task Force

  2. Memory Management & Multi-Thread Multi-threading adds new freedom to the lifestyles of objects… • Object lifetime restricted in a single thread • Event parallelism • Objects created in one thread, consumed in (all) others • Conditions, geometry • Objects fully shared in read/write among many threads • Real-time/interactive applications, parallel algorithms • Processor and kernel architecture plays a big role • Numa, caches, pages, tbls, dynamic-loader • A single allocation strategy may not fit all size!

  3. Memory management tools • Thread-Caching Malloc (by Google)http://goog-perftools.sourceforge.net/doc/tcmalloc.html • Boost pool-allocatorhttp://www.boost.org/doc/libs/1_35_0/libs/pool/doc/concepts.html

  4. Installing TCMalloc Now all applications (including the shell!) will use tcmalloc in place of malloc

  5. Boost pool allocator • In boost since long time • The “interface” may need to be revamped… • Simple local allocator (useful to allocate working space) • Singleton (sic) used as std::allocator • Use mutex for sharing among threads (extremely slow!) • Easy, but error prone, to transform singleton in TSS • Application has to guarantee the container is thread local

  6. Simple tests of tcmalloc • Simple tests using small programs are a not conclusive • Never MUCH better than malloc (actually never better!) • Observed 3x worse timing for a naïve multi-thread creation of std::map! • It seems to migrate memory to the transfer cache for no reason… • Recovers when using a home-made TSS version of boost pool-allocator • (default one uses a singleton with mutex!!!!) • Managed to allocate hugepages (2MB) (needs sudo!) • Requires the application to allocate memory in chunk of 2MB • Easy with the boost pool-allocator • Small tests are (at best) non conclusive • Need a more realistic test bench

  7. Few results from a simple test • Allocate/deallocate 50K times in 6 threads • a map<int,int> with 1000 entries • 10K small (3 floats) objects • May hide additional complexity due to the too repetitive pattern

  8. TCMalloc for 32 bit on xa64 • The solution I found to use tcmalloc with a 32 bit application on xa64 is to first initialize the environment to use 32 bit compiler and libraries (as in current CMSSW) and then use “linux32” Does not work with hugepages….

  9. Measuring CMS reconstruction • Stefan as installed perfmon on lxbuild118 • I used CMSSW 209 • Run the preparatory simulation steps • cmsDriver.py B_JETS -s STEP -n 100 with STEP= GEN, SIM, DIGI • Then run pfmon_deluxe.py cmsDriver.py B_JETS -s RECO -n 100 • Repeated preloading tcmalloc (32bit)

  10. Perfmon counters

  11. Perfmon ratio

More Related