hybrid software hardware dynamic memory allocator n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Hybrid (Software-Hardware) Dynamic Memory Allocator PowerPoint Presentation
Download Presentation
Hybrid (Software-Hardware) Dynamic Memory Allocator

Loading in 2 Seconds...

play fullscreen
1 / 31

Hybrid (Software-Hardware) Dynamic Memory Allocator - PowerPoint PPT Presentation


  • 214 Views
  • Uploaded on

Hybrid (Software-Hardware) Dynamic Memory Allocator. Prepared by Mustafa Özgür Akduran İ stanbul, 200 6 Bo ğaziçi Üniversitesi. Outline. Introduction Related Research Proposed Hybr id Allocator Complexity and Performance Comparison Conclusion References. Introduction. Need for

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hybrid (Software-Hardware) Dynamic Memory Allocator' - sagittarius


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
hybrid software hardware dynamic memory allocator

Hybrid (Software-Hardware) Dynamic Memory Allocator

Prepared by Mustafa Özgür Akduran

İstanbul, 2006

Boğaziçi Üniversitesi

CMPE 511, Fall 2006

outline
Outline
  • Introduction
  • Related Research
  • Proposed Hybrid Allocator
  • Complexity and Performance Comparison
  • Conclusion
  • References

CMPE 511, Fall 2006

introduction
Introduction
  • Need for
    • Efficient Implementation of Memory Management Functions
      • Memory Usage
      • Execution Performance

Dynamic Memory Allocation (DMA)

Garbage Collection

Modern Programming Languages

CMPE 511, Fall 2006

introduction1
Introduction

Dynamic Memory Management

DMM

Current Systems

  • Execution time spent on Memory Management is 42%.
  • Still important researches on
    • Good execution performance
    • Memory locality
  • How to get free chunks of memory ?
    • Software Allocator
    • Hardware Allocator

Pure Software

Low Cost Allocator

CMPE 511, Fall 2006

introduction2
Software Allocator

Different Search Techniques to organize available chunks of free memory

Disadvantage

Search could be in the critical path of allocators causing a major performance bottleneck.

Hardware Allocator

Parallel Search

Speed up Memory Allocation

Improve Performance

Hide execution latency of freeing objects

Coalescing of free chunks of memory

Disadvantage

Potential Hardware Complexity

Introduction

CMPE 511, Fall 2006

introduction3
Introduction

A New Hybrid Software-Hardware Allocator

Chang’s Hardware Allocator

PHK (Poul-Henning Kamp) Allocation Algorithm used in Free-BSD System

Aim is to balance the hardware complexity with performance by using both hardware and software together.

CMPE 511, Fall 2006

related research
Related Research

PHK (Poul-Henning Kamp) Allocator

Two most popular general purpose open source allocator

1. Doug Lea used in LINUX System

2. PHK used in Free-BSD System

Difference between them is less than 3% for memory allocation intensive benchmarks in SPEC 2000 CPU.

PHK Allocator chosen bacause of its suitability for hardware/software co-design.

Free-BSD (Berkeley Software Distribution ) is an advanced operating system for x86 compatible (including Pentium® and Athlon™), architectures. It is derived from BSD, the version of UNIX® developed at the University of California, Berkeley. It is developed and maintained by a large team of individuals.

CMPE 511, Fall 2006

related research1
Related Research

PHK (Poul-Henning Kamp) Allocator

  • Page based allocator
  • Each page can only contain objects of one size
  • For a large object sufficient number of pages allocated
  • For small objects less than a half page, object size is padded to the nearest power of 2
  • Allocator keeps a page directory for all allocated pages and at the beginning of each small object page, bitmap of allocation information is created
  • While allocating small objects, PHK Allocator performs a linear search on the bitmap to find the first available chunk in that page

CMPE 511, Fall 2006

related research2
Related Research
  • Chang’s Hardware Allocator
    • Based on Buddy System invented by Knuth
    • The buddy memory allocation technique divides memory into partitions to satisfy a memory request as suitably as possible
    • This system makes use of splitting memory into halves to try to give a best-fit
    • Compared to the memory allocation techniques (such as paging) that modern OS such as MS Windows and Linux use, the buddy memory allocation is relatively easy to implement, and does not have the hardware requirement of a memory management unit
    • Chang’s algorithm is a first method based on a binary OR-tree and a binary AND-tree.

CMPE 511, Fall 2006

related research3
Chang’s Hardware Allocator

Each leaf node of the OR-tree represents base size of the smallest unit of memory that can be allocated

The leaves of OR-tree together represent the entire memory

AND-tree has the same number of leaves as the OR-tree

Input of the AND-tree is generated by a complex interconnection network of the OR-tree

Related Research

Or Gates

CMPE 511, Fall 2006

related research4
Chang’s Hardware Allocator

Or-Tree

Determine if there is a large enough space for allocation request

AND-Tree

Find the beginning address of that memory chunk

Flip the bits corresponding to the memory chunk in the bit-vector

Related Research

Bit-vector

CMPE 511, Fall 2006

related research5
Related Research

CMPE 511, Fall 2006

related research6
Related Research
  • The interconnection between the OR-tree and the AND-tree is the most complex part of the Chang’s allocator
  • The interconnection has the same critical path delay as the OR/AND-tree
  • Final allocation result is produced by the output of the AND-tree through a set of multiplexers
  • The Hardware complexity, in terms of number of gates is

O(n logn)

# the memory chunks

Critical path delay

CMPE 511, Fall 2006

proposed hybr id allocator
Proposed Hybrid Allocator

Problems of hardware-software only allocators

1. Complexity of the hardware increases with the size of the memory managed

2. Poor object locality

Pure hardware allocators based on buddy system

Poor execution performance

Software Allocators

CMPE 511, Fall 2006

proposed hybr id allocator1
Proposed Hybrid Allocator
  • Using small , fixed hardware to help manage the memory
  • Software portion which is based on PHK algorithm provides better object localities than buddy system
  • Hardware portion improves execution performance of the software portion

New Hybrid Allocator

CMPE 511, Fall 2006

proposed hybr id allocator2
Proposed Hybrid Allocator
  • Creating page indexes
  • For large sized objects (>half a page) does the allocation without any assistance from hardware
  • Allocation for a small sized object, it will locate the bitmap of a page with free memory and issues a search request to the hardware

Software portion

Responsible for

  • Search the page index (or bitmap) in parallel to find a free chunk
  • Mark the bitmap to indicate an allocation

Hardware portion

CMPE 511, Fall 2006

proposed hybr id allocator3
Proposed Hybrid Allocator
  • OR-tree responsible for determining if there is a free chunk in a page (similar to Chang’s system)
  • AND-tree will locate the position of the first free chunk in the page (similar to Chang’s system)
  • Because an OR-tree and an AND-tree are dedicated to one object size, complex interconnections between OR and AND tree are not needed( unlike Chang’s)

CMPE 511, Fall 2006

proposed hybr id allocator4
Proposed Hybrid Allocator
  • MUX uses opcode to select the address of the bit needed to be flipped.
    • If the opcode is “alloc” the address from the AND-tree will be chosen
    • If the opcode is “free” the address from the request will be selected
  • D-latches are used as storage devices where the bitmap will be loaded from the page in accordance with the allocation size
  • DEMUX used to decode the address from the MUX

CMPE 511, Fall 2006

proposed hybr id allocator5
Proposed Hybrid Allocator
  • Bit-flippers use the decoded address and the opcode to determine how to flip a desired bit

Block Diagram of Proposed Hardware Component (For Page Size 4096 bytes and Object Size 16 bytes)

CMPE 511, Fall 2006

proposed hybr id allocator6
Overall design of the system with 4096-byte pages

For different object sizes, the hardware needed to support the bit-map will be different

In our design, preselected object sizes are from 16-bytes to 2048-bytes and include hardware to support pages for these objects

MUX is used to select the hardware unit that will be responsible for supporting objects of a given size

The larger the object size, the smaller the amount of hardware needed to support the bit-maps indicating the availability of chunks in that page

Proposed Hybrid Allocator

CMPE 511, Fall 2006

proposed hybr id allocator7
With 4096-byte pages, we have 8 different sized objects ranging from 16-bytes to 2048-bytes.

For allocating

2048-byte objects we need a tree with two leaves

16-byte objects we need trees with 256 leaves

For a 16-byte object we need only 255 AND/OR gates

For overall system 1+3+7+15+31+63+127+255=502 AND gates and 502 OR gates are needed

Very small amount compared to billions of transistors available on modern processor chips

Proposed Hybrid Allocator

CMPE 511, Fall 2006

complexity and performance comparison
Complexity and Performance Comparison
  • Complexity Comparison
    • Existing hardware allocator designs implement the buddy system
    • The amount of hardware that is used to implement a buddy allocator is dependent on the size of memory
    • That makes buddy system based allocators not scalable.
    • Our design has much lower hardware complexity than Chang’s allocator. (Buddy System)

CMPE 511, Fall 2006

complexity and performance comparison1
Complexity and Performance Comparison

M: Total dynamic memory size

P: Page size

S: Smallest allocated object size

CMPE 511, Fall 2006

complexity and performance comparison2
Complexity and Performance Comparison

Performance Analysis

Conventional CPU using SimpleScalar simulation tool set V2.0

Hardware-assisted PHK allocator

CMPE 511, Fall 2006

complexity and performance comparison5
Complexity and Performance Comparison
  • We show the reduced memory management execution cycles normalized to the original execution cycles spent on memory management functions by software only allocator
  • Cfrac application shows the best performance improvement
    • Ave.obj.size is 8 bytes which means that most pages allocated contain 256 objects
    • Linear search in the software implementation for that many objects will be very slow
    • The hardware speeds up the search, leading to 76.2% normalized performance improvement over the software only allocation
  • Benchmark espresso with average object size of 250 bytes shows the least amount of improvement using the hybrid allocator
    • Pages allocated for espresso contain fewer than 20 objects
    • Linear search of 20 objects is not significant, and the hardware allocator only shows 48.0% nornalized performance improvement
  • Other benchmarks have average object sizes of 16 bytes to 48 bytes, so the performance gains are not significant as cfrac, but better than espresso
  • On average, the Hybrid allocator reduces the memory management time by 58.9%. The average overall execution speedup of our design when compared to a software only allocator implementation is 12.7%

CMPE 511, Fall 2006

conclusion
Conclusion

Our Design

  • Significantly lower hardware complexity
  • Lower critical path delays
  • Our design has a fixed hardware complexity which is dependent on the size of a memory page (not the total user memory being managed)

Compared to Hardware only allocators

  • Overall execution performance is 12.7% better on memory intensive benchmarks
  • Memory management efficiency improved by 58.9%

Compared to Software only allocators

CMPE 511, Fall 2006

conclusion1
Conclusion
  • Future Work
    • Exploring variable sized pages such that the number of allocated objects are the same in each page
    • All the bitmaps will have the same number of bits
    • Thus, we need only one pair of AND-tree and Or-tree in the design
    • That will further reduce the hardware complexity
    • This will also improve the memory management efficiency of allocators for large objects

CMPE 511, Fall 2006

references
References
  • [1] W.Li, S.P.Mohanty and K.Kavi, “A Page-based Hybrid (Software-Hardware) Dynamic Memory Allocator” IEEE Computer Architecture Letters (accepted in July 2006 for future issue)
  • [2] J.M. Chang and E.F.Gehringer, “A High-Performance Memory Allocator for Object-oriented Systems”, IEEE Transactions on Computers, Mar. 1996, pp 357-366.
  • [3] P.H.Kamp.“Malloc(3)revisited”, http://phk.freebsd.dk/pubs/malloc.pdf
  • [4]D.E.Knuth, The Art of Computer Programming Vol.I: Fundamental Algorithms., Addison-Wesley, 1968.
  • [5]D.Burgerand, T.M.Austin, “The Simple Scalar Tool Set, V2.0”, Tech Report CS-1342, University of Wisconsin-Madison, Jun. 1997.

CMPE 511, Fall 2006

questions
Questions ?

CMPE 511, Fall 2006