1 / 33

Smart Memory for Smart Phones

Smart Memory for Smart Phones Chris Clack University College London clack@cs.ucl.ac.uk Outline Target Architecture Problems Focus on Fragmentation Results from UT A fast allocator (not embedded) Doug Lea’s Allocator Can We Do Better? Overheads Results Target Architecture

Jims
Download Presentation

Smart Memory for Smart Phones

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smart Memory for Smart Phones Chris Clack University College London clack@cs.ucl.ac.uk

  2. Outline • Target Architecture • Problems • Focus on Fragmentation • Results from UT • A fast allocator (not embedded) • Doug Lea’s Allocator • Can We Do Better? • Overheads • Results

  3. Target Architecture • Small hand-held integrated phone/PDA devices • Soft real-time, “open box”, constrained applications heap • Competition pressure for more, more flexible, and better (larger) applications

  4. To compact: copy when nearly full Problems (1) live free TOP A free fragment • Memory overhead • Compaction delay

  5. Problems (2) live free TOP To compact: do sliding compaction when nearly full • Compaction delay

  6. Problems (3) live free FREE LIST To compact: do sliding compaction when allocation fails • Compaction delay

  7. Focus on Fragmentation • What happens in real programs? • Great paper by Mark Johnstone and Paul Wilson (UT): • “The Memory Fragmentation Problem: Solved?”, M.Johnstone & P.Wilson, 1997 • Fragmentation experiments using real programs running on real data

  8. Max live at any time Max Kb at any time Average lifetime of an allocated byte

  9. No difference within experimental error RESULTS

  10. #4 #3 e.g. %frag #4 = (value_at_3 – value_at_2) * 100 / value_at_2 MEASURE OF FRAGMENTATION

  11. No difference within experimental error

  12. Johnstone & Wilson’s conclusion • The best free-list management policy • in terms of fragmentation behaviour • on real programs is BEST-FIT • (Knuth notwithstanding)

  13. A Fast Best-Fit Allocator • IMPLICATION: use Best-fit allocation and we (maybe?) won’t ever need to compact • At least, compaction delays will be minimized • BUT: best-fit allocation is S-L-O-W • Worst-case: have to scan the entire free list • Let’s look at a widely-used best-fit allocator: Doug Lea’s malloc • (arguably) the fastest best-fit allocator

  14. Boundary tag – used for coalescing Boundary tag Boundary tag

  15. Sorted by size Worst case: all free blocks in one bin – reduces to O(n) search exact-fit bins Fixed-width bins W Costs time to sort

  16. Can we do better? • Support boundary tags and coalescing • Simple Idea (1) (of 4): • Probability of fragmentation triggering compaction depends on RANGE of allocatable block sizes • Very large block alloc more likely to fail due to frags • Very small free blocks create frags • (NB if all blocks same size, fragmentation is zero!)

  17. No need to sort • Restrict range of allocatable sizes and create an exact-fit table: … lb lb+1 lb+2 lb+3 ub-2 ub-1 ub Worst case: O(n) search  for next highest occupied bin

  18. lb lb+1 lb+2 lb+3 ub-2 ub-1 ub • Old idea • Use an occupancy bitmap • If (ub-lb) = 31, bitmap is just one word • To search/allocate: read bitmap; AND with mask; find highest set bit; maybe modify bit and write 00110000000000000000000000000101

  19. Problem • What if range is very large? • E.g. Nikhil wants to allocate blocks that vary from 2 words to 212 words • 212 different block sizes • Worst case = linear search of 128 bitmap words (128 reads + …) • Two solutions: • Use more efficient bitmapping • Use unconstrained hybrid scheme (see later)

  20. More efficient bitmapping • Simple Idea (2) • Use a bitmap tree: • Requires 128 + 4 + 1 words • Requires worst case 5 reads, 3 tests for zero, 3 masks, 3 finds of greatest set bit, 3 modify&writes • Generally: O(log32 ((ub-lb)/32)) • (Depends what you are counting … but it is fast!) • Ten times faster than any other scheme we know

  21. LIFO/FIFO? • Simple Idea (3) • Although J&W found no difference between LIFO/FIFO/AO best fit, this might be different for embedded apps • So far, we can only do LIFO • We can achieve FIFO if we double-link ALL free blocks into one big chain • Drawback – now free takes as long as malloc (but still O(log32 ((ub-lb)/32)))

  22. Or for FIFO: search bitmap tree to the left , then follow link to next highest free block If requested size not available, for LIFO: search bitmap tree to the right  Bitmap tree Freed blocks placed at heads of chains … lb lb+1 lb+2 lb+3 ub-2 ub-1 ub

  23. Simple Idea (4) • We can trivially also support Worst-fit by adding a pointer that always refers to the biggest block • And this is where we put our wilderness block! • We have no data on fragmentation behaviour of worst-fit • If it turns out to be similar to best fit, it would be preferable because we would have O(1) alloc and O(log32 ((ub-lb)/32)) free.

  24. max Bitmap tree … lb lb+1 lb+2 lb+3 ub-2 ub-1 ub W

  25. Overheads • Dynamic per-block overhead • Depends on (ub-lb) – can be very small • Example (total 32 bits per live block): • 16 bit signed int for size and availability of current block • 16 bit signed int for size and availability of previous block • Could optimize for live block overhead: 1 bit in header + free blocks also hold size at end of block • But, if 4-byte aligned and ANY overhead per block, can’t do better than this! • Free blocks additionally need to hold two pointers • minimum block size = header + 2 pointers

  26. Static overheads • Code • A few registers (e.g. max) • Data structures: • Bitmap tree: 133 words • Table: (ub-lb) words • NOTE • if (ub-lb=heap) then table size is the size of the heap! (same overhead as semi-space) • So we don’t want to use this scheme for large size ranges!!! – instead use a hybrid

  27. Hybrid scheme • Most used range of block sizes: • Use the bitmap tree and exact-fit bins as described • Bigger block sizes: • These are all kept on the double-linked chain above the biggest exact-fit block. • Can use fixed-width bins like Lea, together with a separate bitmap tree, • We lose the worst-case property of the primary scheme

  28. RESULTS • Re-run Johnstone and Wilson’s tests, using our allocator on their trace files

  29. Memory required by gmalloc Memory required by new allocator Memory requested by the program Test 1 Memory requirement halved ! Roughly 5% fragmentation?

  30. Memory required by gmalloc Memory required by new allocator Memory requested by the program Test 2

  31. Memory required by gmalloc Memory required by new allocator Memory requested by the program Test 3

  32. Memory required by gmalloc Memory requirements consistently halved! Fragmentation consistently ~ 5% (?) Memory required by new allocator Memory requested by the program Test 4

  33. Status • Currently working with Symbian to conduct malloc-replacement trials using real smartphone applications

More Related