profile guided optimizations in visual c 2005
Download
Skip this Video
Download Presentation
Profile Guided Optimizations in Visual C++ 2005

Loading in 2 Seconds...

play fullscreen
1 / 33

Profile Guided Optimizations in Visual C++ 2005 - PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on

Profile Guided Optimizations in Visual C++ 2005. Andrew Pardoe Phoenix Team (C++ Optimizer). What do optimizers do?. int setArray(int a, int *array) { for(int x = 0; x < a; ++x) array[x] = 0; return x; } The compiler knows nothing about the value of ‘a’

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Profile Guided Optimizations in Visual C++ 2005' - emil


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
profile guided optimizations in visual c 2005

Profile Guided Optimizations in Visual C++ 2005

Andrew Pardoe

Phoenix Team (C++ Optimizer)

what do optimizers do
What do optimizers do?

int setArray(int a, int *array)

{

for(int x = 0; x < a; ++x)

array[x] = 0;

return x;

}

  • The compiler knows nothing about the value of ‘a’
  • The compiler knows nothing about the array’s alignment
  • The compiler doesn’t look at all the source files together
  • The compiler doesn’t know how the program will execute
what is pgo pronounced pogo
What is PGO (pronounced PoGO)?
  • A “profile” details a program’s behavior in a specific scenario
  • Profile-guided optimizations use the profile to guide the optimizer for that given scenario
  • PGO tells the optimizer which areas of the application were most frequently executed
  • This information lets the optimizer be more selective in optimizing the program
  • PGO has its own set of optimizations as well as improving traditional optimizations
example of a pgo win
Example of a PGO win
  • Compiler optimizations make assumptions based on static analysis and standard heuristics
    • For example, we assume that a loop executes multiple times

for (p=list; *p; p=p->next) {

p->f = sqrt(F);

}

    • The optimizer would hoist the call to the loop invariant sqrt(F)

tmp = sqrt(F);

for (p=list; *p; p=p->next) {

p->f = tmp;

}

    • If the profile shows that p is zero, we will not hoist the call
how is pgo used
How is PGO used?

Instrumented binary

Source code

PGO Probes

Profile

Instrumented binary

Scenarios

Profile

Optimized binary

Source code

how is pgo used1
How is PGO used?
  • PGO is built on top of Link-Time Code Generation
  • Must link object files twice: once for instrumented build, once for optimized build
  • Can be used on almost all native code
    • exe, dll, lib
    • COM/MFC
    • Windows services
  • Cannot be used on system or managed code
    • Drivers or kernel mode code
    • No code compiled with /CLR
  • Incorrect scenarios could cause worse optimizations!
pgo profile gathering
PGO profile gathering
  • Two major themes of PGO profile gathering
    • Identify “hot paths” in program execution path and optimize to make these paths perform well
    • Likewise, identify “cold paths” to separate cold code—or dead code—from hot code
    • Identify “typical” values such as switch values, loop induction variables and targets of indirect calls and optimize code for these values
pgo main optimizations inlining
PGO main optimizations: inlining
  • Improved inlining heuristics
    • Inline based on frequency of call, not function size or depth of call stack
    • “Hot” call sites: inline agressively
    • “Cold” call sites: only inline if there are other optimization opportunities (such as folding)
    • “Dead” call sites: only inline the trivial cases
pgo main optimizations inlining1
PGO main optimizations: inlining
  • Speculative inlining: used for virtual call specification
    • Indirect calls are profiled to find typical targets
    • An indirect call heavily biased toward certain target(s) can be multi-versioned
    • The new sequence contains direct call(s) to typical target(s), which can be inlined
  • Partial inlining: only inline the portions of the callee we execute. If the cold code is called, call the non-inlined function.
pgo main optimizations code size
PGO main optimizations: code size
  • Choice of favoring size versus speed made on a per-function basis
      • Program execution should be dominated by functions optimized for speed and less-frequently used functions should be small
  • PGO computes a dynamic instruction count for each profiled function.
    • Inlining effects are taken into account.
  • Sorts functions in descending order by count.
  • Functions in the upper 99% of total dynamic instruction count are optimized for speed. Others are compressed.
  • In large applications (Vista, SQL) most functions are optimized for size.
pgo main optimizations locality
PGO main optimizations: locality
  • Reorder the code to “fall through” wherever possible
  • Intra-function layout reorders basic blocks so that the major trace falls through whenever possible.
  • Inter-function layout tries to place frequent caller-callee pairs near one another in the image.
  • Extract “dead” code from the .text section and put it in a remote section of the image
  • Dead code can be entire functions that are not called or basic blocks inside a function
  • Penalty for being wrong is very large so the profile must be accurate!
what code benefits most
What code benefits most?
  • C++ programs: many virtual calls can be inlined once the target is determined through profiling
  • Large applications where size and speed are important
  • Code with frequent branches that are difficult to predict at compile time
  • Code which can be separated by profiling into “hot” and “cold” blocks to help instruction cache locality
  • Code for which you know the typical usage patterns and can produce accurate profiling scenarios
scenario 1
Scenario 1
  • Customer compiles with /O2 and gets pretty good performance but wants to take advantage of advanced optimizations like LTCG and PGO
  • Code is tested by the dev team throughout development cycle using unit and bug regression tests
  • Customer has done performance measurements of the code. Customer has no automated tests to measure performance but believes it can improve.
  • Is this customer ready to try PGO? Probably not.
scenario 2
Scenario 2
  • Customer has well-defined performance goals and tests set up to measure performance
  • Customer knows typical usage patterns for the application
  • Application is being built with LTCG
  • Most of the execution time is spent in tightly-nested loops doing heavy floating-point calculations
  • Is this customer ready to use PGO? Maybe…
scenario 3
Scenario 3
  • Customer has well-defined performance goals and tests set up to measure performance
  • Customer knows typical usage patterns for the application
  • Application is being built with LTCG
  • Application spends most of its time in branches and calls
  • Application is fairly large and makes use of inheritance
  • Is this customer ready to use PGO? Definitely.
scenario 4
Scenario 4
  • Customer has a build lab and wants to enable PGO in nightly builds
  • But profiling every night seems too expensive
  • Solution: PGO Incremental Update
    • Avoid running profile scenarios at every build
    • PGU uses “stale” profile data
    • Can check in profile data and refresh weekly
  • PGU restricts optimizations
    • Functions which have changed will not be optimized
    • Effects of localized changes are usually negligible
pgo sweeper
PGO sweeper
  • Some scenarios are difficult to collect profile data for
    • Profile scenario may not begin and end with application launch and shutdown
    • Some components cannot write a file
    • Some components cannot link to the PGO runtime DLL
  • PGO sweeper collects profile data from running instrumented processes
  • This allows you to close a currently open .pgc file and create a new one without exiting the instrumented binary
  • You get one .pgc file per run or sweep. You can delete any .pgc files you do not want reflected in your scenario.
pgo manager
PGO Manager
  • PGO manager adds profile data from one or more .pgc files into the .pgd file
  • The .pgd file is the main profile database
  • Allows you to profile multiple scenarios (.pgc) for a single codebase into one profile database (.pgd)
  • PGO manager also lets you generate reports from the .pgd file to see that your scenarios “feel right” in the code
  • Information in the reports include
    • Module count, function count, arc and value count
    • Static (all) instruction count, dynamic (hot) instruction count
    • Basic block count, average basic block size
    • Function entry count
how much performance does pgo get
How much performance does PGO get?
  • Performance gain is architecture and application specific
    • IA64 sees biggest gains
    • x64 benefits more than x32
    • Large applications benefit more than small: SQL server saw over 30% gains through PGO
    • Many parts of Windows use PGO to balance size vs. speed
  • If you understand your real-world scenarios and have adequate, repeatable tests PGO is almost always a win
  • Once your testing is in-place integrating PGO into your build process should be easy
call graph profiling

a

bar

baz

Call-graph profiling
  • Given this call graph, determine which code paths are hot and which are cold

foo

bat

call graph profiling continued
Call-graph profiling continued
  • Measure the frequency of calls

10

75

75

a

bar

baz

20

50

20

foo

bar

50

baz

100

15

bat

100

bar

baz

15

15

call graph profiling after inlining

bar

baz

Call-graph profiling after inlining
  • Inline functions based on call profile
    • Highest-frequency calls are (bar, baz) and (bat, bar)

10

a

20

125

foo

100

15

bat

bar

baz

15

reordering basic blocks

Default layout

Optimized layout

A

A

B

C

C

D

D

B

Reordering basic blocks
  • Change code layout to improve instruction cache locality

Execution profile

Default layout

Optimized layout

A

10

100

100

10

B

C

10

100

100

10

D

speculative inlining of virtual calls
Speculative inlining of virtual calls
  • Profiling shows the dynamic type of object A in function Func was almost always Foo (and almost never Bar)

void Func(Base *A)

{

while(true)

{

if(type(A) == Foo:Base)

{

// inline of A->call();

}

else // virtual dispatch

A->call();

}

}

void Bar(Base *A)

{

while(true)

{

A->call();

}

}

class Base

{

virtual void call();

}

class Foo:Base

{

void call();

}

class Bar:Base

{

void call();

}

partial inlining
Partial inlining

Profiling shows that

condition Cond favors

the left branch over

the right branch

Basic Block 1

Cond

Hot Code

Cold Code

More Code

partial inlining concluded
Partial inlining concluded

We can inline the hot

path, and not the cold

path. We can make different

decisions at each call site!

Basic Block 1

Cond

Cold Code

Hot Code

More Code

slide28

Using PGO (in more detail)

Object files

Compile with /GL and opts

Source code

.PGD file

Object files

Link with /LTCG:PGI

Instrumented binary

Instrumented binary

Scenarios

.PGC file(s)

.PGC files

Object files

Optimized binary

Link with /LTCG:PGO

.PGD file

pgo tips
PGO tips
  • The scenarios used to generate the profile data should be real-world scenarios. The scenarios are NOT and attempt to do code coverage.
  • Using scenarios to train with that are not representative of real-world use can result in code that performs worse than if PGO was not used.
  • Name the optimized code something different from the instrumented code, for example, app.opt.exe and app.inst.exe. This way you can rerun the instrumented application to supplement your set of scenario profiles without rerunning everything again.
  • To tweak results, use the /clear option of pgomgr to clear out a .PGD files.
pgo tips1
PGO tips
  • If you have two scenarios that run for different amounts of time, but would like them to be weighted equally, you can use the weight switch (/merge:weight in pgomgr) on .PGC files to adjust them.
  • You can use the speed switch to change the speed/size thresholds.
  • You can control the inlining threshold with a switch but use it with care. The values from 0-100 aren\'t linear.
  • Integrate PGO into your build process and update scenarios frequently for the most consistent results and best performance increases.
in summary
In summary
  • Using PGO is very easy, with four simple steps
    • CL to parse the source files
      • cl /c /O2 /GL *.cpp
    • LINK / PGI to generate instrumented image
      • link /ltcg:pgi /pgd:appname.pgd *.obj *.lib
      • Also generates a PGD file (PGO database)
    • Run your program on representative scenarios
      • Generates PGC files (PGO profile data)
    • LINK / PGO to generate optimized image
      • Implicitly uses the generated PGC files
      • link /ltcg:pgo /pgd:appname.pgd *.obj *.lib
more information
More information
  • Matt Pietrek’s Under the Hood column from May 2002 has a fantastic explanation of LTCG internals
  • Multiple articles on PGO located on MSDN
    • The links are long: just search for PGO on MSDN
  • Look through articles by Kang Su Gatlin on his blog at http://blogs.msdn.com/kangsu or on MSDN
  • Improvements are coming in the new VC++ backend
    • Based on the Phoenix optimization framework
    • Profiling is a major scenario for the Phoenix-based optimizer
    • There will be a talk on Phoenix later today
ad