Hybrid (Software-Hardware) Dynamic Memory Allocator

Hybrid (Software-Hardware) Dynamic Memory Allocator Prepared by Mustafa Özgür Akduran İstanbul, 2006 Boğaziçi Üniversitesi CMPE 511, Fall 2006

Outline • Introduction • Related Research • Proposed Hybrid Allocator • Complexity and Performance Comparison • Conclusion • References CMPE 511, Fall 2006

Introduction • Need for • Efficient Implementation of Memory Management Functions • Memory Usage • Execution Performance Dynamic Memory Allocation (DMA) Garbage Collection Modern Programming Languages CMPE 511, Fall 2006

Introduction Dynamic Memory Management DMM Current Systems • Execution time spent on Memory Management is 42%. • Still important researches on • Good execution performance • Memory locality • How to get free chunks of memory ? • Software Allocator • Hardware Allocator Pure Software Low Cost Allocator CMPE 511, Fall 2006

Software Allocator Different Search Techniques to organize available chunks of free memory Disadvantage Search could be in the critical path of allocators causing a major performance bottleneck. Hardware Allocator Parallel Search Speed up Memory Allocation Improve Performance Hide execution latency of freeing objects Coalescing of free chunks of memory Disadvantage Potential Hardware Complexity Introduction CMPE 511, Fall 2006

Introduction A New Hybrid Software-Hardware Allocator Chang’s Hardware Allocator PHK (Poul-Henning Kamp) Allocation Algorithm used in Free-BSD System Aim is to balance the hardware complexity with performance by using both hardware and software together. CMPE 511, Fall 2006

Related Research PHK (Poul-Henning Kamp) Allocator Two most popular general purpose open source allocator 1. Doug Lea used in LINUX System 2. PHK used in Free-BSD System Difference between them is less than 3% for memory allocation intensive benchmarks in SPEC 2000 CPU. PHK Allocator chosen bacause of its suitability for hardware/software co-design. Free-BSD (Berkeley Software Distribution ) is an advanced operating system for x86 compatible (including Pentium® and Athlon™), architectures. It is derived from BSD, the version of UNIX® developed at the University of California, Berkeley. It is developed and maintained by a large team of individuals. CMPE 511, Fall 2006

Related Research PHK (Poul-Henning Kamp) Allocator • Page based allocator • Each page can only contain objects of one size • For a large object sufficient number of pages allocated • For small objects less than a half page, object size is padded to the nearest power of 2 • Allocator keeps a page directory for all allocated pages and at the beginning of each small object page, bitmap of allocation information is created • While allocating small objects, PHK Allocator performs a linear search on the bitmap to find the first available chunk in that page CMPE 511, Fall 2006

Related Research • Chang’s Hardware Allocator • Based on Buddy System invented by Knuth • The buddy memory allocation technique divides memory into partitions to satisfy a memory request as suitably as possible • This system makes use of splitting memory into halves to try to give a best-fit • Compared to the memory allocation techniques (such as paging) that modern OS such as MS Windows and Linux use, the buddy memory allocation is relatively easy to implement, and does not have the hardware requirement of a memory management unit • Chang’s algorithm is a first method based on a binary OR-tree and a binary AND-tree. CMPE 511, Fall 2006

Chang’s Hardware Allocator Each leaf node of the OR-tree represents base size of the smallest unit of memory that can be allocated The leaves of OR-tree together represent the entire memory AND-tree has the same number of leaves as the OR-tree Input of the AND-tree is generated by a complex interconnection network of the OR-tree Related Research Or Gates CMPE 511, Fall 2006

Chang’s Hardware Allocator Or-Tree Determine if there is a large enough space for allocation request AND-Tree Find the beginning address of that memory chunk Flip the bits corresponding to the memory chunk in the bit-vector Related Research Bit-vector CMPE 511, Fall 2006

Related Research CMPE 511, Fall 2006

Related Research • The interconnection between the OR-tree and the AND-tree is the most complex part of the Chang’s allocator • The interconnection has the same critical path delay as the OR/AND-tree • Final allocation result is produced by the output of the AND-tree through a set of multiplexers • The Hardware complexity, in terms of number of gates is O(n logn) # the memory chunks Critical path delay CMPE 511, Fall 2006

Proposed Hybrid Allocator Problems of hardware-software only allocators 1. Complexity of the hardware increases with the size of the memory managed 2. Poor object locality Pure hardware allocators based on buddy system Poor execution performance Software Allocators CMPE 511, Fall 2006

Proposed Hybrid Allocator • Using small , fixed hardware to help manage the memory • Software portion which is based on PHK algorithm provides better object localities than buddy system • Hardware portion improves execution performance of the software portion New Hybrid Allocator CMPE 511, Fall 2006

Proposed Hybrid Allocator • Creating page indexes • For large sized objects (>half a page) does the allocation without any assistance from hardware • Allocation for a small sized object, it will locate the bitmap of a page with free memory and issues a search request to the hardware Software portion Responsible for • Search the page index (or bitmap) in parallel to find a free chunk • Mark the bitmap to indicate an allocation Hardware portion CMPE 511, Fall 2006

Proposed Hybrid Allocator • OR-tree responsible for determining if there is a free chunk in a page (similar to Chang’s system) • AND-tree will locate the position of the first free chunk in the page (similar to Chang’s system) • Because an OR-tree and an AND-tree are dedicated to one object size, complex interconnections between OR and AND tree are not needed( unlike Chang’s) CMPE 511, Fall 2006

Proposed Hybrid Allocator • MUX uses opcode to select the address of the bit needed to be flipped. • If the opcode is “alloc” the address from the AND-tree will be chosen • If the opcode is “free” the address from the request will be selected • D-latches are used as storage devices where the bitmap will be loaded from the page in accordance with the allocation size • DEMUX used to decode the address from the MUX CMPE 511, Fall 2006

Proposed Hybrid Allocator • Bit-flippers use the decoded address and the opcode to determine how to flip a desired bit Block Diagram of Proposed Hardware Component (For Page Size 4096 bytes and Object Size 16 bytes) CMPE 511, Fall 2006

Overall design of the system with 4096-byte pages For different object sizes, the hardware needed to support the bit-map will be different In our design, preselected object sizes are from 16-bytes to 2048-bytes and include hardware to support pages for these objects MUX is used to select the hardware unit that will be responsible for supporting objects of a given size The larger the object size, the smaller the amount of hardware needed to support the bit-maps indicating the availability of chunks in that page Proposed Hybrid Allocator CMPE 511, Fall 2006

With 4096-byte pages, we have 8 different sized objects ranging from 16-bytes to 2048-bytes. For allocating 2048-byte objects we need a tree with two leaves 16-byte objects we need trees with 256 leaves For a 16-byte object we need only 255 AND/OR gates For overall system 1+3+7+15+31+63+127+255=502 AND gates and 502 OR gates are needed Very small amount compared to billions of transistors available on modern processor chips Proposed Hybrid Allocator CMPE 511, Fall 2006

Complexity and Performance Comparison • Complexity Comparison • Existing hardware allocator designs implement the buddy system • The amount of hardware that is used to implement a buddy allocator is dependent on the size of memory • That makes buddy system based allocators not scalable. • Our design has much lower hardware complexity than Chang’s allocator. (Buddy System) CMPE 511, Fall 2006

Complexity and Performance Comparison M: Total dynamic memory size P: Page size S: Smallest allocated object size CMPE 511, Fall 2006

Complexity and Performance Comparison Performance Analysis Conventional CPU using SimpleScalar simulation tool set V2.0 Hardware-assisted PHK allocator CMPE 511, Fall 2006

Complexity and Performance Comparison CMPE 511, Fall 2006

Complexity and Performance Comparison • We show the reduced memory management execution cycles normalized to the original execution cycles spent on memory management functions by software only allocator • Cfrac application shows the best performance improvement • Ave.obj.size is 8 bytes which means that most pages allocated contain 256 objects • Linear search in the software implementation for that many objects will be very slow • The hardware speeds up the search, leading to 76.2% normalized performance improvement over the software only allocation • Benchmark espresso with average object size of 250 bytes shows the least amount of improvement using the hybrid allocator • Pages allocated for espresso contain fewer than 20 objects • Linear search of 20 objects is not significant, and the hardware allocator only shows 48.0% nornalized performance improvement • Other benchmarks have average object sizes of 16 bytes to 48 bytes, so the performance gains are not significant as cfrac, but better than espresso • On average, the Hybrid allocator reduces the memory management time by 58.9%. The average overall execution speedup of our design when compared to a software only allocator implementation is 12.7% CMPE 511, Fall 2006

Conclusion Our Design • Significantly lower hardware complexity • Lower critical path delays • Our design has a fixed hardware complexity which is dependent on the size of a memory page (not the total user memory being managed) Compared to Hardware only allocators • Overall execution performance is 12.7% better on memory intensive benchmarks • Memory management efficiency improved by 58.9% Compared to Software only allocators CMPE 511, Fall 2006

Conclusion • Future Work • Exploring variable sized pages such that the number of allocated objects are the same in each page • All the bitmaps will have the same number of bits • Thus, we need only one pair of AND-tree and Or-tree in the design • That will further reduce the hardware complexity • This will also improve the memory management efficiency of allocators for large objects CMPE 511, Fall 2006

References • [1] W.Li, S.P.Mohanty and K.Kavi, “A Page-based Hybrid (Software-Hardware) Dynamic Memory Allocator” IEEE Computer Architecture Letters (accepted in July 2006 for future issue) • [2] J.M. Chang and E.F.Gehringer, “A High-Performance Memory Allocator for Object-oriented Systems”, IEEE Transactions on Computers, Mar. 1996, pp 357-366. • [3] P.H.Kamp.“Malloc(3)revisited”, http://phk.freebsd.dk/pubs/malloc.pdf • [4]D.E.Knuth, The Art of Computer Programming Vol.I: Fundamental Algorithms., Addison-Wesley, 1968. • [5]D.Burgerand, T.M.Austin, “The Simple Scalar Tool Set, V2.0”, Tech Report CS-1342, University of Wisconsin-Madison, Jun. 1997. CMPE 511, Fall 2006

Questions ? CMPE 511, Fall 2006

Hybrid (Software-Hardware) Dynamic Memory Allocator