profiling instrumentation and profile based optimization n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Profiling, Instrumentation, and Profile Based Optimization PowerPoint Presentation
Download Presentation
Profiling, Instrumentation, and Profile Based Optimization

Loading in 2 Seconds...

play fullscreen
1 / 75

Profiling, Instrumentation, and Profile Based Optimization - PowerPoint PPT Presentation


  • 156 Views
  • Uploaded on

Profiling, Instrumentation, and Profile Based Optimization. Robert Cohn Robert.Cohn@compaq.com Mark T. Vandevoorde. Introduction. Understanding the dynamic interaction between programs and processors What do programs do? How do processors perform? How can we make it faster?. What to do?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Profiling, Instrumentation, and Profile Based Optimization' - scout


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
profiling instrumentation and profile based optimization

Profiling, Instrumentation, and Profile Based Optimization

Robert Cohn Robert.Cohn@compaq.com

Mark T. Vandevoorde

introduction
Introduction

Understanding the dynamic interaction between programs and processors

  • What do programs do?
  • How do processors perform?
  • How can we make it faster?

Profiling Tutorial

what to do
What to do?

Build tools!

  • Profiling
  • Instrumentation
  • Profile based optimization

Profiling Tutorial

the big picture
The Big Picture

Sampling

Instrumentation

Profiling

Profile Based Optimization

Analysis

Modeling

Profiling Tutorial

instrumentation
Instrumentation
  • User level view
  • Executable editing

Profiling Tutorial

code instrumentation

TOOL

V

V

Code Instrumentation

Trojan Horse

  • Application appears unchanged
  • Data collected as a side effect of execution

Profiling Tutorial

instrumentation example
Instrumentation Example

if (b > c)

t = 1;

else

b = 3;

  • Add extra code

if (b > c) {

bb[0]++;

t = 1;

} else {

bb[1]++;

b = 3;

}

Instrumentation

Profiling Tutorial

instrumentation uses
Instrumentation Uses
  • Profiles
  • Model new hardware
    • What will this new branch predictor do?
    • What is the miss rate of this new cache?
  • Optimization opportunities
    • find unnecessary loads and stores
    • find divides by 1

Profiling Tutorial

what tool does instrumentation
What Tool Does Instrumentation?
  • Compiler
    • Compiler inserts extra operations
    • Requires recompile, access to source code
  • Executable editor
    • Post-link tool inserts instrumentation code
    • No rebuild, source code not required
    • More difficult to relate back to source

Profiling Tutorial

instrumentation tools for alpha
Instrumentation Tools for Alpha
  • All executable based
  • General instrumentation:
    • Atom on Digital Unix
      • Distributed with Digital Unix
    • Ntatom on Windows NT
      • New! Download from web
  • Specialized tools based on above
    • hiprof, pixie, 3rd, ...

Profiling Tutorial

slide11
ATOM
  • Tool for customized instrumentation
  • User writes program that describes how to instrument application
  • Instrumentation program applied to application, generates instrumented application
  • Instrumented application is run
  • Data is collected

Profiling Tutorial

user supplies
User Supplies
  • Instrumentation routines: user written program that inserts instrumentation
    • calls to analysis routines
  • Analysis routines: do the instrumentation work at runtime (e.g. count a basic block)

Profiling Tutorial

atom programming model

Iterate

Iterate

Atom Programming Model

spice

libc.so

libm.so

main()

Compute()

_exit()

block2

block3

block1

block5

block4

ldq r1, 8(sp)

addq r1, 0x1, r2

stq r2, 8(sp)

bne r1, 0x1ffc40

Profiling Tutorial

atom instrumentation api navigation
ATOM Instrumentation API: Navigation
  • Objects (binary, shared library)
    • GetFirstObj, GetNextObj
  • Procedures
    • GetFirstProc, GetNextProc
  • Basic blocks
    • GetFirstBlock, GetNextBlock
  • Instructions
    • GetFirstInst, GetNextInst

Profiling Tutorial

atom instrumentation api interrogation
ATOM Instrumentation API: Interrogation
  • GetObjInfo, GetProcInfo, GetBlockInfo, GetInstInfo
  • IsBranchTarget
  • GetInstRegUsage
  • InstPC
  • InstLineNo
  • ...

Profiling Tutorial

atom instrumentation api definition
ATOM Instrumentation API: Definition
  • AddCallProto
    • tells atom the types of the arguments for calls to analysis routines

Profiling Tutorial

atom instrumentation api instrumentation
ATOM Instrumentation API: Instrumentation
  • AddCallProgram, AddCallObj, AddCallProc, AddCallBlock, AddCallInst, ReplaceProcedure
  • Insert before or after

Profiling Tutorial

arguments to analysis routines
Arguments to analysis routines
  • Constants
    • variables in instrumentation program, but constant at instrumentation point
    • e.g. uninstrumented PC, function name
  • VALUE computed at runtime
    • effective address, branch taken predicate
  • Register
    • r3, arguments, return value

Profiling Tutorial

sample 1 cache simulator
Sample #1: Cache Simulator

Write a tool that computes the miss rate of the application running in a 64KB, direct mapped data cache with 32 byte lines.

> atom spice cache.inst.o cache.anal.o -o spice.cache

> spice.cache < ref.in > ref.out

> more cache.out

5,387,822,402 620,855,884 11.523%

Profiling Tutorial

cache tool implementation

Reference(0(a0))

Reference (0(a0));

Cache Tool Implementation

Application

Instrumentation

main:

clr t0

loop:

ldl t2,0(a0)

addl t0,4,t0

addl t2,0x10,t2

stl t2,0(a0)

bne t3,loop

ret

VALUE

PrintResults();

Profiling Tutorial

cache analysis file
Cache Analysis File

#include <stdio.h>

#define CACHE_SIZE 65536

#define BLOCK_SHIFT 5

long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;

Reference(long address) {

int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT;

long tag = address >> BLOCK_SHIFT;

if (cache[index] != tag) { misses++; cache[index] = tag ; }

refs++;}

Print() {

FILE *file = fopen("cache.out","w");

fprintf(file,"%ld %ld %.2f\n",refs, misses, 100.0 * misses / refs);

fclose(file);}

Profiling Tutorial

cache instrumentation file
Cache Instrumentation File

#include <stdio.h>

#include <cmplrs/atom.inst.h>

unsigned Instrument(int argc, char **argv, Obj *o) {

Inst *i;Block *b;Proc *p;

AddCallProto("Reference(VALUE)"); AddCallProto("Print()");

AddCallProgram(ProgramAfter,"Print");

for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))

for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b))

for (i = GetFirstInst(b); i != NULL; i = GetNextInst(i))

if (IsInstType(i, InstTypeLoad) || IsInstType(i,InstTypeStore))

AddCallInst(i, InstBefore, "Reference", EffAddrValue);

}

Profiling Tutorial

sample 2 profiler
Sample #2: Profiler

Write a tool that outputs the address of each basic block and the number of times it is executed.

vssad-27> atom a.out prof.inst.c prof.anal.c

vssad-28> a.out.atom

Hello world

vssad-29> head prof.out

120001030 1

120001038 1

12000103c 1

120001058 33

120001064 1

Profiling Tutorial

profiler tool implementation

Count(1)

Init(3)

Count(0)

Count(2)

Profiler Tool Implementation

Application

Instrumentation

main:

clr t0

loop:

ldl t2,0(a0)

addl t0,4,t0

addl t2,0x10,t2

stl t2,0(a0)

bne t3,loop

ret

Constant

PrintResults(addresses,3);

Profiling Tutorial

profiler prof anal c
Profiler: prof.anal.c

#include <stdio.h>

long * counts;

void Init(int nblocks) {

counts = (long *)malloc(nblocks * sizeof(long));

memset(counts,0,nblocks * sizeof(long));}

void Count(int index){ counts[index]++; }

void Print(long *blocks,int nblocks) {

int i; FILE *file = fopen("prof.out","w");

for (i = 0; i < nblocks; i++)

fprintf(file,"%lx %ld\n",blocks[i],counts[i]);

fclose(file);

}

Profiling Tutorial

profiler prof inst c
Profiler: prof.inst.c

#include <stdio.h>

#include <cmplrs/atom.inst.h>

void CallInitPrint();

void Instrument(int argc, char **argv,Obj * o) {

Block *b;Proc *p;int index=0;

int nblocks = GetObjInfo(o,ObjNumberBlocks);

long *addresses = (long *)malloc(nblocks * sizeof(long));

CallInitPrint(addresses,nblocks);

for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))

for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b)) {

addresses[index] = InstPC(GetFirstInst(b));

AddCallInst(GetFirstInst(b), InstBefore, "Count",index++);

}}

Profiling Tutorial

profiler prof inst c1
Profiler: prof.inst.c

void CallInitPrint(long * addresses, int nblocks)

{

char buffer[100];

AddCallProto("Count(int)");

AddCallProto("Init(int)");

AddCallProgram(ProgramBefore,"Init",nblocks);

sprintf(buffer,"Print(const stable int[%d],int)");

AddCallProto(buffer);

AddCallProgram(ProgramAfter,"Print",addresses,nblocks);

}

Profiling Tutorial

executable editors
Executable editors
  • Input: executable, ouput: executable
  • Instrument, optimize, translate
  • Executable = image = binary = shared library = shared object = dynamically linked library (DLL)
  • Executable editor, executable optimizer, binary rewriter, binary translator, post link optimizer

Profiling Tutorial

executable editing
Executable Editing
  • Insert/delete/reorder instructions and data
  • Obstacle to modification
    • Addresses are bound
    • Registers are bound

Profiling Tutorial

obstacles

lda a0,0x1000

bsr Reference

Obstacles

if (a) a = b;

beq r1,+2

ldl r1,0x1000

  • Is a0 free?
  • Adjust branch offsets
  • Adjust literal addresses

Profiling Tutorial

phases
Phases

1. Decompose

2. Build IR

3. Insert instrumentation

4. Convert IR to executable

Profiling Tutorial

1 decompose executable
1. Decompose Executable

Executable

Header

Text (code)

Program code & data

Data

Rdata

Exception Info

Meta

data

Relocations

Debug

Profiling Tutorial

decompose
Decompose
  • Break executable into units
  • unit: minimum data that must be kept together
  • code: unit is instruction
  • data: unit is data section
    • alternative: unit is data item

Profiling Tutorial

2 build internal representation

Instruction list

Data sections

add

Data

load

Sdata

beq

MetaData

Exception

Info

Relocations

2. Build Internal Representation

Profiling Tutorial

intermediate representation
Intermediate Representation
  • Similar to compiler
    • except unstructured, untyped data
    • 1 to 1 mapping for IR and machine instructions
  • Base representation should be compact
    • fit in physical memory
      • initial/final phases do multiple passes
  • Representations built/thrown away for procedures

Profiling Tutorial

bound addresses
Bound addresses

Data:

1

2

0x12345678

3

Code:

br +4

ldah r0,0x1234

lda r0,0x5678(r0)

Metadata:

Begin: 0x12345678

End: 0x12345680

Profiling Tutorial

adjusting addresses
Adjusting addresses
  • No translation
  • Dynamic translation
  • Static translation

Profiling Tutorial

no translation
No translation
  • Leave code and data at same address

beq r1,L2

ldl r1,0x1234

L2:

beq r1,L2

br L1

L2:

...

...

L1:

lda a0,0x1234

bsr Reference

ldl r1,0x1234

br L2

Profiling Tutorial

dynamic translation
Dynamic translation
  • Address computation is unchanged
  • Image has map of old->new address
  • Code inserted to map old->new address at runtime for load/store/branch
  • Better:
    • Do PC relative branches statically
    • Keep data section at original address
    • Still: indirect calls and jumps (not returns)

Profiling Tutorial

static translation
Static translation
  • Address computation is altered for new layout
  • Find addresses
  • Determine what they point to:
    • unit, offset
  • Insert instrumentation
  • Adjust literals or offsets to compute new address of unit

Profiling Tutorial

other tools that change addresses
Other tools that change addresses
  • Linker
    • combine separately compiled objects
    • adjust addresses based on assigned load address
    • unit is section of object (data, text)
  • Loader
    • Load address != link address for DLL
    • unit is entire image
  • Use relocations

Profiling Tutorial

relocations
Relocations

Data:

1

2

0x12345678

3

No relocation required

Code:

br +4

ldl r1,10(gp)

ldah r0,0x1234

lda r0,0x5678(r0)

May require relocation

Relocation example:

address: 0x200

type: ldah literal

object: 0x12345670

external:

Requires relocation

Profiling Tutorial

how to recognize addresses
How to recognize addresses?
  • Metadata
    • example: procedure begin, procedure end
    • implicit in structure of data
  • Absolute addresses
    • example: literal address in data section
    • use relocations
  • Relative addresses: address offset
    • example: pc relative branch, offset for base pointer
    • may not need adjustment,usually no relocation

Profiling Tutorial

relative addresses
Relative Addresses
  • Address computed as offset of another address
  • Address and Address + Offset point to same unit: ok, unit moved as a unit
  • Example:

a->field1 ar[4]

ldl r0,field1(a) ldl r0,16(ar)

Profiling Tutorial

relative addresses1
Relative Addresses
  • Offset spans multiple units
  • example:

Jump table:

ad = base + i

jmp ad

base:

br l1

br l2

br l3

br l4

PC relative branch

br +4

Must be 1 unit

Profiling Tutorial

map address to unit and offset
Map address to unit and offset

Reference -> address

  • in code: interpret instructions

br +4

ldah r0,0x1234

lda r0,0x5678(r0)

  • in data: data is address

.data

0x12345678

Profiling Tutorial

map address to unit and offset1
Map address to unit and offset

(relocation,address) -> (unit,offset)

  • to code: pointer to instruction
  • to data: data section and offset
    • alternative: data item and offset
  • offset = address - unit address

Profiling Tutorial

3 insert instrumentation
3. Insert Instrumentation

Instruction list

add

Data sections

load

Data

load

Sdata

beq

Ndata

MetaData

Exception

Info

Relocations

Profiling Tutorial

adding instrumentation code
Adding instrumentation code
  • Instrumentation requires free registers
    • wrapper routine saves and restores registers

beq r1,+2

save registers

lda a0,0x1000

bsr ra,wrapper

restore registers

ldl r1,0(r2)

Save registers on stack

bsr ra,Reference

Restore registers

return

Reference

  • Local/global/interprocedural analysis finds free registers

Profiling Tutorial

4 convert ir to executable
4. Convert IR to Executable

Executable

Header

Text

Program code data

Data

Rdata

Ndata

Exception Info

Meta

data

Relocations

Debug

Profiling Tutorial

profile based optimization1
Profile based optimization
  • Collect profile information
    • example: how often basic blocks are executed
  • Use profile to guide optimization
    • example: inlining

Profiling Tutorial

profile based optimization2
Profile based Optimization
  • Available on Alpha, MIPS, PA, PPC, Sparc, x86
  • Used in compilers and executable optimizers
  • Spec, products, too.

Profiling Tutorial

speedup from code layout
Speedup from code layout

Profiling Tutorial

user level view
User level view

Compiler:

  • Compile
  • Instrument
  • Run scenario1
  • Run scenario2
  • Merge profiles
  • Recompile

Executable optimizer:

  • Instrument
  • Run scenario1
  • Run scenario2
  • Merge profiles
  • Optimize

Profiling Tutorial

optimization s sensitivity to training data
Optimization’s sensitivity to training data
  • Experience with varying training
    • compiler, spreadsheet, CAD, Spec95
  • Some training sets are better than others
  • Can find one or a combination that gives best results in all scenarios
  • Sometimes requires tuning of optimizations

Profiling Tutorial

types of optimizations
Types of optimizations
  • Enhance conventional optimization with weights based on profile
  • Transformations driven by profile info
  • Examples
    • Register allocation
    • Code layout
    • Inlining

Profiling Tutorial

register allocation
Register allocation

top:

cmpgt a,3,t0

brfalse t0,then

addl b,1,b

br join

then:

ldl t0,c

addl t0,1,t0

stl t0,c

join:

subl a,1,a

brtrue a,top

While (a) {

if (a > 3)

b++;

else

c++;

a--;

}

  • a, b, and c live for entire loop
  • Should b or c get the last register?
  • Information: block counts

Profiling Tutorial

code layout reduce the number of taken branches

2

1

3

3

4

5

6

5

2

4

7

6

1

7

Code layout: Reduce the number of taken branches
  • Greedy algorithm, lay out common paths sequentially
  • Information:
    • flow edge counts

60

40

60

40

45

55

45

55

Profiling Tutorial

inlining
Inlining

RtnC

RtnA

  • Probably no advantage to inline RtnD into RtnA
  • RtnB is almost always called from RtnA
    • thus no cache penalty for inlining
  • Information: Call edge counts

1000

2

0

RtnB

RtnD

Profiling Tutorial

information to drive optimization
Information to drive optimization

Basic:

  • basic block counts
  • flow edge counts
  • call edge counts

More advanced:

  • path profiles
  • cache misses
  • branch mispredicts

Profiling Tutorial

computing basic block counts
Computing basic block counts
  • Instrumentation
    • Use atom tool
    • Use 64 bit integers
  • Sampling

Profiling Tutorial

computing call edges

rtna

rtnb

rtnc

rtnd

Computing call edges

rtna:

move 1,a0

move 2,a1

bsr rtnb

rtna:

move 1,a0

move 2,a1

ldl r0,20(t0)

jsr r0

PC relative call:

Call edge count is

same as basic block count

Indirect call: keep hash table of targets and counts

Profiling Tutorial

computing flow edge counts from basic block counts

10

30

10

Computing flow edge counts from basic block counts
  • Basic block count

= Σ incoming edges

= Σ outgoing edges

  • Exceptions, longjmp/setjmp are implicit edges
  • Tolerate inconsistencies

10

20

10

Profiling Tutorial

computing flow edge counts from basic block counts1
Computing flow edge counts from basic block counts
  • Some graphs have multiple solutions
  • Guess!
  • Instrument edges
  • Instrument minimum number of blocks and edges

while (a) a--;

bzero a,skip

top:subl a,1,a

bnzero a,top

skip:

10

10

1

9

9

19

1

11

20

20

1

9

10

10

Two solutions for same bb count

Profiling Tutorial

computing flow edge counts from basic block counts2
Computing flow edge counts from basic block counts
  • Spanning tree algorithm
    • given flow graph, costs, finds lowest cost set of instrumentation points
    • costs derived from static analysis or earlier runs
  • Read Ball and Larus for details

Profiling Tutorial

instrumenting flow edges
Instrumenting flow edges
  • ATOM: branch taken value can be passed to analysis routine
  • branch not taken: insert call to count after conditional branch
  • taken branch, indirect jump: insert new basic block between branch and target

Profiling Tutorial

merging multiple profiles
Merging multiple profiles
  • Multiple runs generate multiple profiles, how do we combine them?
    • Add them together
    • Should the profiles be weighted equally?
      • User defined
      • Scale so that sums are equal

Profiling Tutorial

using profiles
Using profiles
  • Edge, block counts are in database
  • For each procedure, compiler locates counts in database and copies them to IR
  • Every flow edge, call edge, block labeled with execution count
  • Optimizations that modify flow graph must update profile information

Profiling Tutorial

ir profiled program mismatch

Executable

IR

IR/Profiled program mismatch
  • Does the flow graph of the program you profiled match the flow graph in the compiler IR?
    • Optimization
    • Code generation
  • Usually ok if you disable optimization
  • Not a problem for executable optimizers

Profiling Tutorial

persistence
Persistence
  • If the program is modified, can you use an old profile?
  • Generating a profile can be difficult and time consuming
  • Don’t hold up build process generating a new profile every time

Profiling Tutorial

usability
Usability
  • Make it easy or no one will use it
  • Limited changes to build process
  • Limited opportunities for user to mess up

Profiling Tutorial

profile based optimization nirvana
Profile based optimization nirvana
  • Profile any build
    • tolerate IR/profiled program mismatches
  • No instrumentation step
  • Low cost profiling, < 5%
  • No restructuring of makefile
  • Big speedup!

DCPI

Profiling Tutorial

tools for profile based optimization
Tools for profile based optimization
  • Unix
    • cc, f77
    • om: executable optimizer called from cc
    • cord: user specified procedure ordering
  • NT
    • scc: calls Visual C
    • spike: executable optimizer
    • link /order: user specified procedure ordering
      • wstune generates ordering

Profiling Tutorial