Profiling, Instrumentation, and Profile Based Optimization

1 / 75

# Profiling, Instrumentation, and Profile Based Optimization - PowerPoint PPT Presentation

Profiling, Instrumentation, and Profile Based Optimization. Robert Cohn Robert.Cohn@compaq.com Mark T. Vandevoorde. Introduction. Understanding the dynamic interaction between programs and processors What do programs do? How do processors perform? How can we make it faster?. What to do?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Profiling, Instrumentation, and Profile Based Optimization' - scout

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Profiling, Instrumentation, and Profile Based Optimization

Robert Cohn Robert.Cohn@compaq.com

Mark T. Vandevoorde

Introduction

Understanding the dynamic interaction between programs and processors

• What do programs do?
• How do processors perform?
• How can we make it faster?

Profiling Tutorial

What to do?

Build tools!

• Profiling
• Instrumentation
• Profile based optimization

Profiling Tutorial

The Big Picture

Sampling

Instrumentation

Profiling

Profile Based Optimization

Analysis

Modeling

Profiling Tutorial

Instrumentation
• User level view
• Executable editing

Profiling Tutorial

TOOL

V

V

Code Instrumentation

Trojan Horse

• Application appears unchanged
• Data collected as a side effect of execution

Profiling Tutorial

Instrumentation Example

if (b > c)

t = 1;

else

b = 3;

if (b > c) {

bb[0]++;

t = 1;

} else {

bb[1]++;

b = 3;

}

Instrumentation

Profiling Tutorial

Instrumentation Uses
• Profiles
• Model new hardware
• What will this new branch predictor do?
• What is the miss rate of this new cache?
• Optimization opportunities
• find unnecessary loads and stores
• find divides by 1

Profiling Tutorial

What Tool Does Instrumentation?
• Compiler
• Compiler inserts extra operations
• Executable editor
• Post-link tool inserts instrumentation code
• No rebuild, source code not required
• More difficult to relate back to source

Profiling Tutorial

Instrumentation Tools for Alpha
• All executable based
• General instrumentation:
• Atom on Digital Unix
• Distributed with Digital Unix
• Ntatom on Windows NT
• Specialized tools based on above
• hiprof, pixie, 3rd, ...

Profiling Tutorial

ATOM
• Tool for customized instrumentation
• User writes program that describes how to instrument application
• Instrumentation program applied to application, generates instrumented application
• Instrumented application is run
• Data is collected

Profiling Tutorial

User Supplies
• Instrumentation routines: user written program that inserts instrumentation
• calls to analysis routines
• Analysis routines: do the instrumentation work at runtime (e.g. count a basic block)

Profiling Tutorial

Iterate

Iterate

Atom Programming Model

spice

libc.so

libm.so

main()

Compute()

_exit()

block2

block3

block1

block5

block4

ldq r1, 8(sp)

stq r2, 8(sp)

bne r1, 0x1ffc40

Profiling Tutorial

• Objects (binary, shared library)
• GetFirstObj, GetNextObj
• Procedures
• GetFirstProc, GetNextProc
• Basic blocks
• GetFirstBlock, GetNextBlock
• Instructions
• GetFirstInst, GetNextInst

Profiling Tutorial

ATOM Instrumentation API: Interrogation
• GetObjInfo, GetProcInfo, GetBlockInfo, GetInstInfo
• IsBranchTarget
• GetInstRegUsage
• InstPC
• InstLineNo
• ...

Profiling Tutorial

ATOM Instrumentation API: Definition
• tells atom the types of the arguments for calls to analysis routines

Profiling Tutorial

ATOM Instrumentation API: Instrumentation
• Insert before or after

Profiling Tutorial

Arguments to analysis routines
• Constants
• variables in instrumentation program, but constant at instrumentation point
• e.g. uninstrumented PC, function name
• VALUE computed at runtime
• effective address, branch taken predicate
• Register
• r3, arguments, return value

Profiling Tutorial

Sample #1: Cache Simulator

Write a tool that computes the miss rate of the application running in a 64KB, direct mapped data cache with 32 byte lines.

> atom spice cache.inst.o cache.anal.o -o spice.cache

> spice.cache < ref.in > ref.out

> more cache.out

5,387,822,402 620,855,884 11.523%

Profiling Tutorial

Reference(0(a0))

Reference (0(a0));

Cache Tool Implementation

Application

Instrumentation

main:

clr t0

loop:

ldl t2,0(a0)

stl t2,0(a0)

bne t3,loop

ret

VALUE

PrintResults();

Profiling Tutorial

Cache Analysis File

#include <stdio.h>

#define CACHE_SIZE 65536

#define BLOCK_SHIFT 5

long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;

int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT;

long tag = address >> BLOCK_SHIFT;

if (cache[index] != tag) { misses++; cache[index] = tag ; }

refs++;}

Print() {

FILE *file = fopen("cache.out","w");

fprintf(file,"%ld %ld %.2f\n",refs, misses, 100.0 * misses / refs);

fclose(file);}

Profiling Tutorial

Cache Instrumentation File

#include <stdio.h>

#include <cmplrs/atom.inst.h>

unsigned Instrument(int argc, char **argv, Obj *o) {

Inst *i;Block *b;Proc *p;

for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))

for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b))

for (i = GetFirstInst(b); i != NULL; i = GetNextInst(i))

}

Profiling Tutorial

Sample #2: Profiler

Write a tool that outputs the address of each basic block and the number of times it is executed.

Hello world

120001030 1

120001038 1

12000103c 1

120001058 33

120001064 1

Profiling Tutorial

Count(1)

Init(3)

Count(0)

Count(2)

Profiler Tool Implementation

Application

Instrumentation

main:

clr t0

loop:

ldl t2,0(a0)

stl t2,0(a0)

bne t3,loop

ret

Constant

Profiling Tutorial

Profiler: prof.anal.c

#include <stdio.h>

long * counts;

void Init(int nblocks) {

counts = (long *)malloc(nblocks * sizeof(long));

memset(counts,0,nblocks * sizeof(long));}

void Count(int index){ counts[index]++; }

void Print(long *blocks,int nblocks) {

int i; FILE *file = fopen("prof.out","w");

for (i = 0; i < nblocks; i++)

fprintf(file,"%lx %ld\n",blocks[i],counts[i]);

fclose(file);

}

Profiling Tutorial

Profiler: prof.inst.c

#include <stdio.h>

#include <cmplrs/atom.inst.h>

void CallInitPrint();

void Instrument(int argc, char **argv,Obj * o) {

Block *b;Proc *p;int index=0;

int nblocks = GetObjInfo(o,ObjNumberBlocks);

long *addresses = (long *)malloc(nblocks * sizeof(long));

for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))

for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b)) {

}}

Profiling Tutorial

Profiler: prof.inst.c

void CallInitPrint(long * addresses, int nblocks)

{

char buffer[100];

sprintf(buffer,"Print(const stable int[%d],int)");

}

Profiling Tutorial

Executable editors
• Input: executable, ouput: executable
• Instrument, optimize, translate
• Executable = image = binary = shared library = shared object = dynamically linked library (DLL)
• Executable editor, executable optimizer, binary rewriter, binary translator, post link optimizer

Profiling Tutorial

Executable Editing
• Insert/delete/reorder instructions and data
• Obstacle to modification
• Registers are bound

Profiling Tutorial

lda a0,0x1000

bsr Reference

Obstacles

if (a) a = b;

beq r1,+2

ldl r1,0x1000

• Is a0 free?

Profiling Tutorial

Phases

1. Decompose

2. Build IR

3. Insert instrumentation

4. Convert IR to executable

Profiling Tutorial

1. Decompose Executable

Executable

Text (code)

Program code & data

Data

Rdata

Exception Info

Meta

data

Relocations

Debug

Profiling Tutorial

Decompose
• Break executable into units
• unit: minimum data that must be kept together
• code: unit is instruction
• data: unit is data section
• alternative: unit is data item

Profiling Tutorial

Instruction list

Data sections

Data

Sdata

beq

Exception

Info

Relocations

2. Build Internal Representation

Profiling Tutorial

Intermediate Representation
• Similar to compiler
• except unstructured, untyped data
• 1 to 1 mapping for IR and machine instructions
• Base representation should be compact
• fit in physical memory
• initial/final phases do multiple passes
• Representations built/thrown away for procedures

Profiling Tutorial

Data:

1

2

0x12345678

3

Code:

br +4

ldah r0,0x1234

lda r0,0x5678(r0)

Begin: 0x12345678

End: 0x12345680

Profiling Tutorial

• No translation
• Dynamic translation
• Static translation

Profiling Tutorial

No translation
• Leave code and data at same address

beq r1,L2

ldl r1,0x1234

L2:

beq r1,L2

br L1

L2:

...

...

L1:

lda a0,0x1234

bsr Reference

ldl r1,0x1234

br L2

Profiling Tutorial

Dynamic translation
• Image has map of old->new address
• Better:
• Do PC relative branches statically
• Keep data section at original address
• Still: indirect calls and jumps (not returns)

Profiling Tutorial

Static translation
• Address computation is altered for new layout
• Determine what they point to:
• unit, offset
• Insert instrumentation

Profiling Tutorial

• combine separately compiled objects
• unit is section of object (data, text)
• unit is entire image
• Use relocations

Profiling Tutorial

Relocations

Data:

1

2

0x12345678

3

No relocation required

Code:

br +4

ldl r1,10(gp)

ldah r0,0x1234

lda r0,0x5678(r0)

May require relocation

Relocation example:

type: ldah literal

object: 0x12345670

external:

Requires relocation

Profiling Tutorial

• example: procedure begin, procedure end
• implicit in structure of data
• example: literal address in data section
• use relocations
• example: pc relative branch, offset for base pointer
• may not need adjustment,usually no relocation

Profiling Tutorial

• Address and Address + Offset point to same unit: ok, unit moved as a unit
• Example:

a->field1 ar[4]

ldl r0,field1(a) ldl r0,16(ar)

Profiling Tutorial

• Offset spans multiple units
• example:

Jump table:

base:

br l1

br l2

br l3

br l4

PC relative branch

br +4

Must be 1 unit

Profiling Tutorial

Map address to unit and offset

• in code: interpret instructions

br +4

ldah r0,0x1234

lda r0,0x5678(r0)

• in data: data is address

.data

0x12345678

Profiling Tutorial

Map address to unit and offset

• to code: pointer to instruction
• to data: data section and offset
• alternative: data item and offset

Profiling Tutorial

3. Insert Instrumentation

Instruction list

Data sections

Data

Sdata

beq

Ndata

Exception

Info

Relocations

Profiling Tutorial

• Instrumentation requires free registers
• wrapper routine saves and restores registers

beq r1,+2

save registers

lda a0,0x1000

bsr ra,wrapper

restore registers

ldl r1,0(r2)

Save registers on stack

bsr ra,Reference

Restore registers

return

Reference

• Local/global/interprocedural analysis finds free registers

Profiling Tutorial

4. Convert IR to Executable

Executable

Text

Program code data

Data

Rdata

Ndata

Exception Info

Meta

data

Relocations

Debug

Profiling Tutorial

### Profile Based Optimization

Profile based optimization
• Collect profile information
• example: how often basic blocks are executed
• Use profile to guide optimization
• example: inlining

Profiling Tutorial

Profile based Optimization
• Available on Alpha, MIPS, PA, PPC, Sparc, x86
• Used in compilers and executable optimizers
• Spec, products, too.

Profiling Tutorial

Speedup from code layout

Profiling Tutorial

User level view

Compiler:

• Compile
• Instrument
• Run scenario1
• Run scenario2
• Merge profiles
• Recompile

Executable optimizer:

• Instrument
• Run scenario1
• Run scenario2
• Merge profiles
• Optimize

Profiling Tutorial

Optimization’s sensitivity to training data
• Experience with varying training
• Some training sets are better than others
• Can find one or a combination that gives best results in all scenarios
• Sometimes requires tuning of optimizations

Profiling Tutorial

Types of optimizations
• Enhance conventional optimization with weights based on profile
• Transformations driven by profile info
• Examples
• Register allocation
• Code layout
• Inlining

Profiling Tutorial

Register allocation

top:

cmpgt a,3,t0

brfalse t0,then

br join

then:

ldl t0,c

stl t0,c

join:

subl a,1,a

brtrue a,top

While (a) {

if (a > 3)

b++;

else

c++;

a--;

}

• a, b, and c live for entire loop
• Should b or c get the last register?
• Information: block counts

Profiling Tutorial

2

1

3

3

4

5

6

5

2

4

7

6

1

7

Code layout: Reduce the number of taken branches
• Greedy algorithm, lay out common paths sequentially
• Information:
• flow edge counts

60

40

60

40

45

55

45

55

Profiling Tutorial

Inlining

RtnC

RtnA

• Probably no advantage to inline RtnD into RtnA
• RtnB is almost always called from RtnA
• thus no cache penalty for inlining
• Information: Call edge counts

1000

2

0

RtnB

RtnD

Profiling Tutorial

Information to drive optimization

Basic:

• basic block counts
• flow edge counts
• call edge counts

• path profiles
• cache misses
• branch mispredicts

Profiling Tutorial

Computing basic block counts
• Instrumentation
• Use atom tool
• Use 64 bit integers
• Sampling

Profiling Tutorial

rtna

rtnb

rtnc

rtnd

Computing call edges

rtna:

move 1,a0

move 2,a1

bsr rtnb

rtna:

move 1,a0

move 2,a1

ldl r0,20(t0)

jsr r0

PC relative call:

Call edge count is

same as basic block count

Indirect call: keep hash table of targets and counts

Profiling Tutorial

10

30

10

Computing flow edge counts from basic block counts
• Basic block count

= Σ incoming edges

= Σ outgoing edges

• Exceptions, longjmp/setjmp are implicit edges
• Tolerate inconsistencies

10

20

10

Profiling Tutorial

Computing flow edge counts from basic block counts
• Some graphs have multiple solutions
• Guess!
• Instrument edges
• Instrument minimum number of blocks and edges

while (a) a--;

bzero a,skip

top:subl a,1,a

bnzero a,top

skip:

10

10

1

9

9

19

1

11

20

20

1

9

10

10

Two solutions for same bb count

Profiling Tutorial

Computing flow edge counts from basic block counts
• Spanning tree algorithm
• given flow graph, costs, finds lowest cost set of instrumentation points
• costs derived from static analysis or earlier runs
• Read Ball and Larus for details

Profiling Tutorial

Instrumenting flow edges
• ATOM: branch taken value can be passed to analysis routine
• branch not taken: insert call to count after conditional branch
• taken branch, indirect jump: insert new basic block between branch and target

Profiling Tutorial

Merging multiple profiles
• Multiple runs generate multiple profiles, how do we combine them?
• Should the profiles be weighted equally?
• User defined
• Scale so that sums are equal

Profiling Tutorial

Using profiles
• Edge, block counts are in database
• For each procedure, compiler locates counts in database and copies them to IR
• Every flow edge, call edge, block labeled with execution count
• Optimizations that modify flow graph must update profile information

Profiling Tutorial

Executable

IR

IR/Profiled program mismatch
• Does the flow graph of the program you profiled match the flow graph in the compiler IR?
• Optimization
• Code generation
• Usually ok if you disable optimization
• Not a problem for executable optimizers

Profiling Tutorial

Persistence
• If the program is modified, can you use an old profile?
• Generating a profile can be difficult and time consuming
• Don’t hold up build process generating a new profile every time

Profiling Tutorial

Usability
• Make it easy or no one will use it
• Limited changes to build process
• Limited opportunities for user to mess up

Profiling Tutorial

Profile based optimization nirvana
• Profile any build
• tolerate IR/profiled program mismatches
• No instrumentation step
• Low cost profiling, < 5%
• No restructuring of makefile
• Big speedup!

DCPI

Profiling Tutorial

Tools for profile based optimization
• Unix
• cc, f77
• om: executable optimizer called from cc
• cord: user specified procedure ordering
• NT
• scc: calls Visual C
• spike: executable optimizer
• link /order: user specified procedure ordering
• wstune generates ordering

Profiling Tutorial