Memory Management & Multi-Thread

Performace study of CMS Reco:tcmalloc, pool allocator 17 July ‘08High Performance Computing for High Energy Physics V.I. -- Performance Task Force

Memory Management & Multi-Thread Multi-threading adds new freedom to the lifestyles of objects… • Object lifetime restricted in a single thread • Event parallelism • Objects created in one thread, consumed in (all) others • Conditions, geometry • Objects fully shared in read/write among many threads • Real-time/interactive applications, parallel algorithms • Processor and kernel architecture plays a big role • Numa, caches, pages, tbls, dynamic-loader • A single allocation strategy may not fit all size!

Memory management tools • Thread-Caching Malloc (by Google)http://goog-perftools.sourceforge.net/doc/tcmalloc.html • Boost pool-allocatorhttp://www.boost.org/doc/libs/1_35_0/libs/pool/doc/concepts.html

Installing TCMalloc Now all applications (including the shell!) will use tcmalloc in place of malloc

Boost pool allocator • In boost since long time • The “interface” may need to be revamped… • Simple local allocator (useful to allocate working space) • Singleton (sic) used as std::allocator • Use mutex for sharing among threads (extremely slow!) • Easy, but error prone, to transform singleton in TSS • Application has to guarantee the container is thread local

Simple tests of tcmalloc • Simple tests using small programs are a not conclusive • Never MUCH better than malloc (actually never better!) • Observed 3x worse timing for a naïve multi-thread creation of std::map! • It seems to migrate memory to the transfer cache for no reason… • Recovers when using a home-made TSS version of boost pool-allocator • (default one uses a singleton with mutex!!!!) • Managed to allocate hugepages (2MB) (needs sudo!) • Requires the application to allocate memory in chunk of 2MB • Easy with the boost pool-allocator • Small tests are (at best) non conclusive • Need a more realistic test bench

Few results from a simple test • Allocate/deallocate 50K times in 6 threads • a map<int,int> with 1000 entries • 10K small (3 floats) objects • May hide additional complexity due to the too repetitive pattern

TCMalloc for 32 bit on xa64 • The solution I found to use tcmalloc with a 32 bit application on xa64 is to first initialize the environment to use 32 bit compiler and libraries (as in current CMSSW) and then use “linux32” Does not work with hugepages….

Measuring CMS reconstruction • Stefan as installed perfmon on lxbuild118 • I used CMSSW 209 • Run the preparatory simulation steps • cmsDriver.py B_JETS -s STEP -n 100 with STEP= GEN, SIM, DIGI • Then run pfmon_deluxe.py cmsDriver.py B_JETS -s RECO -n 100 • Repeated preloading tcmalloc (32bit)

Perfmon counters

Perfmon ratio

Memory Management & Multi-Thread