automatic pool allocation improving performance by controlling data structure layout in the heap n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap PowerPoint Presentation
Download Presentation
Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Loading in 2 Seconds...

play fullscreen
1 / 52

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap. Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005. Presented by: Jeff Da Silva CARG - Aug 2 nd 2005. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap' - aulani


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automatic pool allocation improving performance by controlling data structure layout in the heap

Automatic Pool Allocation:Improving Performance by Controlling Data Structure Layout in the Heap

Paper by:

Chris Lattner and Vikram Adve

University of Illinois at Urbana-Champaign

Best Paper Award at PLDI 2005

Presented by:

Jeff Da Silva

CARG - Aug 2nd 2005

motivation
Motivation
  • Computer architecture and compiler research has primarily focused on analyzing & optimizing memory access pattern for dense arrays rather than for pointer-based data structures.
    • e.g. caches, prefetching, loop transformations, etc.
  • Why?
    • Compilers have precise knowledge of the runtime layout and traversal patterns associated with arrays.
    • The layout of heap allocated data structures and traversal patterns can be difficult to statically predict (generally it’s not worth the effort).
intuition
Intuition
  • Improving a dynamic data structure’s spatial locality through program analysis (like shape analysis) is probably too difficult and might not be the best approach.
  • What if you can somehow influence the layout dynamically so that the data structure is allocated intelligently and possibly appears like a dense array?
  • A new approach: develop a new technique that operates at the “macroscopic” level
    • i.e. at the level of the entire data structure rather than individual pointers or objects
what is the problem

What the program creates:

What is the problem?

List 1

Nodes

List 2

Nodes

Tree

Nodes

what is the problem1

What the compiler sees:

What is the problem?

List 1

Nodes

List 2

Nodes

Tree

Nodes

what is the problem2

What the compiler sees:

What we want the program to create and the

compiler to see:

What is the problem?

List 1

Nodes

List 2

Nodes

Tree

Nodes

their approach segregate the heap
Their Approach: Segregate the Heap
  • Step #1: Memory Usage Analysis
    • Build context-sensitive points-to graphs for program
    • We use a fast unification-based algorithm
  • Step #2: Automatic Pool Allocation
    • Segregate memory based on points-to graph nodes
    • Find lifetime bounds for memory with escape analysis
    • Preserve points-to graph-to-pool mapping
  • Step #3: Follow-on pool-specific optimizations
    • Use segregation and points-to graph for later optimizations
why segregate data structures
Why Segregate Data Structures?
  • Primary Goal: Better compiler information & control
    • Compiler knows where each data structure lives in memory
    • Compiler knows order of data in memory (in some cases)
    • Compiler knows type info for heap objects (from points-to info)
    • Compiler knows which pools point to which other pools
  • Second Goal: Better performance
    • Smaller working sets
    • Improved spatial locality
      • Especially if allocation order matches traversal order
    • Sometimes convert irregular strides to regular strides
contributions of this paper
Contributions of this Paper
  • First “region inference” technique for C/C++:
    • Previous work required type-safe programs: ML, Java
    • Previous work focused on memory management
  • Region inference driven by pointer analysis:
    • Enables handling non-type-safe programs
    • Simplifies handling imperative programs
    • Simplifies further pool+ptr transformations
  • New pool-based optimizations:
    • Exploit per-pool and pool-specific properties
  • Evaluation of impact on memory hierarchy:
    • We show that pool allocation reduces working sets
outline
Outline
  • Introduction & Motivation
  • Automatic Pool Allocation Transformation
  • Pool Allocation-Based Optimizations
  • Pool Allocation & Optimization Performance Impact
  • Conclusion
example
Example

struct list { list *Next; int *Data; };

list *createnode(int *Data) {

list *New = malloc(sizeof(list));

New->Data = Num; return New;

return New;

}

void splitclone(list *L, list **R1, list **R2) {

if(L==0) {*R1 = *R2 = 0 return;}

if(some_predicate(L->Data)) {

*R1 = createnode(L->data);

splitclone(L->Next, &(*R1)->Next);

} else {

*R2 = createnode(L->data);

splitclone(L->Next, &(*R1)->Next);

}

}

example1
Example

void processlist(list *L) {

list *A, *B, *tmp;

// Clone L, splitting nodes in list A, and B.

splitclone(L, &A, &B);

processPortion(A); // Process first list

processPortion(B); // Process second list

// free A list

while (A) {tmp = A->Next; free(A); A = tmp; }

// free B list

while (B) {tmp = B->Next; free(B); B = tmp; }

}

Note that lists A and B use distinct heap memory;

it would therefore be beneficial if they were allocated

using separate pools of memory.

pool alloc runtime library interface
Pool Alloc Runtime Library Interface

void poolcreate(Pool *PD, uint Size, uint Align);

Initialize a pool descriptor. (obtains one or more pages of memory using malloc)

void pooldestroy(Pool *PD);

Releases pool memory and destroy pool descriptor.

void *poolalloc(Pool *PD, uint numBytes);

void poolfree(Pool *PD, void *ptr);

void *poolrealloc(Pool *PD, void *ptr, uint numBytes);

Interface also uses:

poolinit_bp(..), poolalloc_bp(..), pooldestroy_bp(..).

algorithm steps
Algorithm Steps
  • Generate a DS Graph (Points-to Graphs) for each function
  • Insert code to create and destroy pool descriptors for DS nodes who’s lifetime does not escape a function.
  • Add pool descriptor arguments for every DS node that escapes a function.
  • Replace malloc and free with calls to poolalloc and poolfree.
  • Further refinements and optimizations
points to graph ds graph
Points-To Graph – DS Graph
  • Builds a points-to graph for each function in Bottom Up [BU] order
  • Context sensitive naming of heap objects
    • More advanced than the traditional “allocation callsite”
  • A unification-based approach
    • Allows for a fast and scalable analysis
    • Ensures every pointer points to one unique node
  • Field Sensitive
    • Added accuracy
  • Also used to compute escape info
example ds graph
Example – DS Graph

list *createnode(int *Data) {

list *New = malloc(sizeof(list));

New->Data = Num; return New;

return New;

}

example ds graph1

P1

P2

Example – DS Graph

void splitclone(list *L, list **R1, list **R2) {

if(L==0) {*R1 = *R2 = 0 return;}

if(some_predicate(L->Data)) {

*R1 = createnode(L->data);

splitclone(L->Next, &(*R1)->Next);

} else {

*R2 = createnode(L->data);

splitclone(L->Next, &(*R1)->Next);

}

}

example ds graph2

P1

P2

Example – DS Graph

void processlist(list *L) {

list *A, *B, *tmp;

// Clone L, splitting nodes in list A, and B.

splitclone(L, &A, &B);

processPortion(A); // Process first list

processPortion(B); // Process second list

// free A list

while (A) {tmp = A->Next; free(A); A = tmp; }

// free B list

while (B) {tmp = B->Next; free(B); B = tmp; }

}

example transformation
Example – Transformation

list *createnode(Pool *PD, int *Data) {

list *New = poolalloc(PD, sizeof(list));

New->Data = Num; return New;

return New;

}

example transformation1

P1

P2

Example – Transformation

void splitclone(Pool *PD1, Pool *PD2,

list *L, list **R1, list **R2) {

if(L==0) {*R1 = *R2 = 0 return;}

if(some_predicate(L->Data)) {

*R1 = createnode(PD1, L->data);

splitclone(PD1, PD2, L->Next, &(*R1)->Next);

} else {

*R2 = createnode(PD2, L->data);

splitclone(PD1, PD2, L->Next, &(*R1)->Next);

}

}

example transformation2

P1

P2

Example – Transformation

void processlist(list *L) {

list *A, *B, *tmp;

Pool PD1, PD2;

poolcreate(&PD1, sizeof(list), 8);

poolcreate(&PD2, sizeof(list), 8);

splitclone(&PD1, &PD2, L, &A, &B);

processPortion(A); // Process first list

processPortion(B); // Process second list

// free A list

while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; }

// free B list

while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; }

pooldestroy(&PD1);

pooldestroy(&PD2);

}

more algorithm details
More Algorithm Details
  • Indirect Function Call Handling:
    • Partition functions into equivalence classes:
      • If F1, F2 have common call-site same class
    • Merge points-to graphs for each equivalence class
    • Apply previous transformation unchanged
  • Global variables pointing to memory nodes
    • Use global pool variable rather than passing them around through function args.
more algorithm details1
More Algorithm Details
  • poolcreate / pooldestroy placement:
    • Move calls earlier/later by analyzing the pool’s lifetime
    • Reduces memory usage
    • Enables poolfree elimination
  • poolfree elimination
    • Eliminate unnecessary poolfree calls
      • No allocations between poolfree & pooldestroy
    • Behaves like static garbage collection
example poolcreate pooldestroy placement
Example – poolcreate/pooldestroy placement

void processlist(list *L) {

list *A, *B, *tmp;

Pool PD1, PD2;

poolcreate(&PD1, sizeof(list), 8);

poolcreate(&PD2, sizeof(list), 8);

splitclone(&PD1, &PD2, L, &A, &B);

processPortion(A); // Process first list

processPortion(B); // Process second list

// free A list

while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; }

// free B list

while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; }

pooldestroy(&PD1);

pooldestroy(&PD2);

}

void processlist(list *L) {

list *A, *B, *tmp;

Pool PD1, PD2;

poolcreate(&PD1, sizeof(list), 8);

poolcreate(&PD2, sizeof(list), 8);

splitclone(&PD1, &PD2, L, &A, &B);

processPortion(A); // Process first list

processPortion(B); // Process second list

// free A list

while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; }

pooldestroy(&PD1);

// free B list

while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; }

pooldestroy(&PD2);

}

example poolfree elimination
Example – poolfree Elimination

void processlist(list *L) {

list *A, *B, *tmp;

Pool PD1, PD2;

poolcreate(&PD1, sizeof(list), 8);

poolcreate(&PD2, sizeof(list), 8);

splitclone(&PD1, &PD2, L, &A, &B);

processPortion(A); // Process first list

processPortion(B); // Process second list

// free A list

while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; }

pooldestroy(&PD1);

// free B list

while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; }

pooldestroy(&PD2);

}

void processlist(list *L) {

list *A, *B, *tmp;

Pool PD1, PD2;

poolcreate(&PD1, sizeof(list), 8);

poolcreate(&PD2, sizeof(list), 8);

splitclone(&PD1, &PD2, L, &A, &B);

processPortion(A); // Process first list

processPortion(B); // Process second list

// free A list

while (A) {tmp = A->Next; A = tmp; }

pooldestroy(&PD1);

// free B list

while (B) {tmp = B->Next; B = tmp; }

pooldestroy(&PD2);

}

void processlist(list *L) {

list *A, *B, *tmp;

Pool PD1, PD2;

poolcreate(&PD1, sizeof(list), 8);

poolcreate(&PD2, sizeof(list), 8);

splitclone(&PD1, &PD2, L, &A, &B);

processPortion(A); // Process first list

processPortion(B); // Process second list

pooldestroy(&PD1);

pooldestroy(&PD2);

}

outline1
Outline
  • Introduction & Motivation
  • Automatic Pool Allocation Transformation
  • Pool Allocation-Based Optimizations
  • Pool Allocation & Optimiztion Performance Impact
  • Conclusion
paopts 1 4 and 2 4
PAOpts (1/4) and (2/4)
  • Selective Pool Allocation
    • Don’t pool allocate when not profitable
    • Avoids creating and destroying a pool descriptor (minor) and avoids significant wasted space when the object is much smaller than the smallest internal page.
  • PoolFree Elimination
    • Remove explicit de-allocations that are not needed
looking closely anatomy of a heap

16-byte

user data

16-byte

user data

Looking closely: Anatomy of a heap
  • Fully general malloc-compatible allocator:
    • Supports malloc/free/realloc/memalign etc.
    • Standard malloc overheads: object header, alignment
    • Allocates slabs of memory with exponential growth
    • By default, all returned pointers are 8-byte aligned
  • In memory, things look like (16 byte allocs):

4-byte padding for user-data alignment

4-byte object header

16-byte

user data

One 32-byte Cache Line

paopts 3 4 bump pointer optzn
PAOpts (3/4): Bump Pointer Optzn
  • If a pool has no poolfree’s:
    • Eliminate per-object header
    • Eliminate freelist overhead (faster object allocation)
  • Eliminates 4 bytes of inter-object padding
    • Pack objects more densely in the cache
  • Interacts with poolfree elimination (PAOpt 2/4)!
    • If poolfree elim deletes all frees, BumpPtr can apply

16-byte

user data

16-byte

user data

16-byte

user data

16-byte

user data

One 32-byte Cache Line

paopts 4 4 alignment analysis
PAOpts (4/4): Alignment Analysis
  • Malloc must return 8-byte aligned memory:
    • It has no idea what types will be used in the memory
    • Some machines bus error, others suffer performance problems for unaligned memory
  • Type-safe pools infer a type for the pool:
    • Use 4-byte alignment for pools we know don’t need it
    • Reduces inter-object padding

4-byte object header

16-byte

user data

16-byte

user data

16-byte

user data

16-byte

user data

One 32-byte Cache Line

outline2
Outline
  • Introduction & Motivation
  • Automatic Pool Allocation Transformation
  • Pool Allocation-Based Optimizations
  • Pool Allocation & Optimization Performance Impact
  • Conclusion
implementation infrastructure
Implementation & Infrastructure
  • Link-time transformation using the LLVM Compiler Infrastructure
  • Uses LLVM-to-C back-end and the resulting code is compiled with GCC 3.4.2 –O3
  • Evaluated on AMD Athlon MP 2100+
    • 64KB L1, 256KB L2
pool allocation speedup

Most programs are 0% to 20% faster with pool allocation alone

Pool Allocation Speedup
  • Several programs unaffected by pool allocation (see paper)
  • Sizable speedup across many pointer intensive programs
  • Some programs (ft, chomp) order of magnitude faster
pool allocation speedup1

Two are 10x faster, one is almost 2x faster

Pool Allocation Speedup
  • Several programs unaffected by pool allocation (see paper)
  • Sizable speedup across many pointer intensive programs
  • Some programs (ft, chomp) order of magnitude faster
pool optimization speedup fullpa

Most are 5-15% faster with optimizations than with Pool Alloc alone

Pool Optimization Speedup (FullPA)

Baseline 1.0 = Run Time with Pool Allocation

  • Optimizations help all of these programs:
    • Despite being very simple, they make a big impact

PA Time

Figure 9

(with different baseline)

pool optimization speedup fullpa1

One is 44% faster, other is 29% faster

Pool Optimization Speedup (FullPA)

Baseline 1.0 = Run Time with Pool Allocation

  • Optimizations help all of these programs:
    • Despite being very simple, they make a big impact

PA Time

Figure 9

(with different baseline)

pool optimization speedup fullpa2

Pool optimizations help some progs that pool allocation itself doesn’t

Pool Optimization Speedup (FullPA)

Baseline 1.0 = Run Time with Pool Allocation

  • Optimizations help all of these programs:
    • Despite being very simple, they make a big impact

PA Time

Figure 9

(with different baseline)

pool optimization speedup fullpa3

Pool optzns effect can be additive with the pool allocation effect

Pool Optimization Speedup (FullPA)

Baseline 1.0 = Run Time with Pool Allocation

  • Optimizations help all of these programs:
    • Despite being very simple, they make a big impact

PA Time

Figure 9

(with different baseline)

cache tlb miss reduction
Cache/TLB miss reduction

Miss rate measured

with perfctr on AMD Athlon 2100+

Sources:

  • Defragmented heap
  • Reduced inter-object padding
  • Segregating the heap!

Figure 10

pool allocation conclusions
Pool Allocation Conclusions
  • Segregate heap based on points-to graph
    • Improved Memory Hierarchy Performance
    • Give compiler some control over layout
    • Give compiler information about locality
  • Optimize pools based on per-pool properties
    • Very simple (but useful) optimizations proposed here
    • Optimizations could be applied to other systems
pool allocation example

, Pool* P){

poolalloc(P);

, P)

Pool* P2

Pool P1; poolinit(&P1);

, &P1)

, P2)

, &P1)

, P2)

pooldestroy(&P1);

P1

P2

Pool Allocation: Example

list *makeList(int Num) {

list *New = malloc(sizeof(list));

New->Next = Num ? makeList(Num-1) : 0;

New->Data = Num; return New;

}

int twoLists( ) {

list *X = makeList(10);

list *Y = makeList(100);

GL = Y;

processList(X);

processList(Y);

freeList(X);

freeList(Y);

}

Change calls to free into calls to poolfree  retain explicit deallocation

pool specific optimizations

build

traverse

destroy

complex allocation pattern

head

head

head

list: HMR

list: HMR

list: HMR

list*

list*

list*

int

int

int

Pool Specific Optimizations

Different Data Structures Have Different Properties

  • Pool allocation segregates heap:
    • Roughly into logical data structures
    • Optimize using pool-specific properties
  • Examples of properties we look for:
    • Pool is type-homogenous
    • Pool contains data that only requires 4-byte alignment
    • Opportunities to reduce allocation overhead
benchmarks
Benchmarks
  • Pointer-intensive SPECINT 2000, Ptrdist, Olden, FreeBench suites
  • povray, espresso, fpgrowth, llu-bench, chomp
  • Benchmarks with custom allocators are not evaluated except for parser, which they hand modified.