seminar series
Skip this Video
Download Presentation
Seminar Series

Loading in 2 Seconds...

play fullscreen
1 / 50

Seminar Series - PowerPoint PPT Presentation

  • Uploaded on

Seminar Series. Static and Dynamic Compiler Optimizations (6/28). Speculative Compiler Optimizations (7/05) ADORE: An Adaptive Object Code ReOptimization System (7/19) Current Trends in CMP/CMT Processors (7/26) Static and Dynamic Helper Thread Prefetching (8/02)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Seminar Series' - patrice

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
seminar series
Seminar Series
  • Static and Dynamic Compiler Optimizations (6/28).
  • Speculative Compiler Optimizations (7/05)
  • ADORE: An Adaptive Object Code ReOptimization System (7/19)
  • Current Trends in CMP/CMT Processors (7/26)
  • Static and Dynamic Helper Thread Prefetching (8/02)
  • Dynamic Instrumentation/Translation (8/16)
  • Virtual Machine Technologies and their Emerging Applications (8/23)
professional background
Professional Background
  • CE BS and CE MS, NCTU
  • CS Ph.D. University of Wisconsin, Madison
  • Cray Research, 1987-1993
    • Architect for Cray Y-MP, Cray C-90, FAST
    • Compiler optimization for Cray-Xmp, Ymp, Cray-2, Cray-3.
  • Hewlett Packard, 1993 -1999
    • Compiler technical lead for HP-7200, HP-8000, IA-64
    • Lab technical lead for adaptive systems
  • University of Minnesota, 2000-now
    • ADORE/Itanium and ADORE/Sparc systems
  • Sun Microsystem, 2004-2005
    • Visiting professor
  • Optimization:

A process of making something as effective as possible

  • Compiler:

A computer program that translates programs written in high-level languages into machine instructions

  • Compiler Optimization:

The phases of compilation that generates good code to make as efficiently use of the target machines as possible.

background cont
Background (cont.)
  • Static Optimization:

compile time optimization – one time, fixed optimization that will not change after distribution.

  • Dynamic Optimization:

optimization performed at program execution time – adaptive to the execution environment.

some examples
Some Examples
  • Redundancy elimination

C = (A+B)*(A+B)  t=A+B; C=t*t;

  • Register allocation

keep frequently used data items in registers

  • Instruction scheduling

to avoid pipeline bubbles

  • Cache prefetching

to minimize cache miss penalties

how important is compiler optimization
How Important Is Compiler Optimization?
  • In the last 15 years, the computer performance has increased by ~2000 times.
    • Clock rate increased by ~100 X
    • Micro-architecture contributed ~5-10X

the number of transistors doubles every 18 months.

    • Compiler optimization added ~2-3X for single processors
static compilation system
Static compilation system

C Front End

C++ Front End

Fortran Front End

  • Machine-independent optimizations

Platform neutral



(IL, IR)



IL to IL Inter-



  • Machine-dependent optimizations







criteria for optimizations
Criteria for optimizations
  • Must preserve the meaning of programs
    • Example

T1 = c/N;

For (I=0; I


A[I] += b[I]+T1;


For (I=0; I


A[I] += b[I]+c/N;



What if N == 0?


If (C > 0)


A += b[j] + d[j];




If (C > 0)


A += T1+T2;



What if b[j] except when C<=0?

Basic Concepts
  • Optimizations improve performance, but do not give optimal performance
  • Optimizations generally (or statistically) improve performance. They could also slow down the code.
    • Example: LICM, Cache prefetching, Procedure inlining
  • Must be absolutely (not statistically!) correct (safe or conservative)
  • Some optimizations are more important in general purpose compilers
    • Loop optimizations, reg allocation, inst scheduling
optimization at different levels
Optimization at different levels
  • Local (within a basic block)
  • Global (cross basic blocks but within a procedure)
  • Inter-procedural
  • Cross module (link time)
  • Post-link time (such as Spike/iSpike)
  • Runtime (as in dynamic compilation)
tradeoff in optimizations
Tradeoff in Optimizations
  • Space vs. Speed
    • Usually favors speed. However, on machines with small memory or I-cache, space is equally important
  • Compile time vs. Execution Time
    • Usually favors execution time, but not necessary true in recent years. (e.g. JIT, large apps)
  • Absolutely robust vs. statistically robust
    • Decrease default optimization level at less important regions.
  • Complexity vs. Efficiency
    • Select between complex but more efficient and simple but less efficient (easier to maintain) algorithms.
overview of optimizations
Overview of Optimizations

Early Optimizations

scalar replacement, constant folding

local/global value numbering

local/global copy propagation

Redundancy Elimination

local/global CSE, PRE


code hoisting

Loop Optimizations

strength reduction

induction variable removal

unnecessary bound checking elimination

overview of optimizations19
Overview of Optimizations

Procedure Optimizations

tail-recursion elimination, in-line expansion, leaf-routine optimization, shrink wrapping, memorization

Register Allocation

graph coloring

Instruction Scheduling

local/global code scheduling

software pipelining

trace scheduling, superblock formation

overview of optimizations20
Overview of Optimizations

Memory Hierarchy Optimizations

loop blocking, loop interchange

memory padding, cache prefetching, data re-layout

Loop Transformations

reduction recognition, loop collapsing, loop reversal, strip mining, loop fusion, loop distribution

Peephole Optimizations

Profile Guided Optimizations

Code Re-positioning, I-cache prefetching, Profiling guided in-lining, RA, IS, ….

overview of optimizations21
Overview of Optimizations

More Optimizations

SIMD Transformation, VLIW Transformation

Communication Optimizations

(See David Bacon and Susan Graham’s survey paper)

Optimization Evaluation

Is there a commonly accepted method?

  • user’s choice
  • benchmarks
    • Livermore loops (14 kernels from scientific code)
    • SPEC
importance of individual opt
Importance of Individual Opt.
  • How much performance an optimization contributes ?
    • Is this optimization commonplace?

does it happen in one particular instance?

does it happen in one particular program?

does it happen for one particular type of app?

    • how much difference does it makes?
    • does it enable other optimizations

procedure integration, unrolling

  • Ordering is important, some dependences between optimizations exist
    • Procedure integration and loop unrolling usually enable other optimizations
    • Loop transformations should be done before address linearization.
  • No optimal ordering
  • Some optimizations should be applied multiple times (e.g. copy propagation, DCE)
  • Some recent research advocate exhaustive search with intelligent pruning
example organization
Example Organization

Reaching definition

Define-use chains










Flow graph

Identify loops

Global CSE

Copy prop

Code motion

loops in flow graph
Loops in Flow Graph
  • Dominators

d of a flow graph dominates node n, written as d dom n, if every path from the initial node of the flow graph to n goes through d.





1 dom all

3 dom 4,5,6,7

4 dom 5,6,7





loops in flow graph cont
Loops in Flow Graph (cont.)
  • Natural loops
  • A loop must have a single entry point, called the “header”. It dominates all nodes in the loop.
  • At least one path back to the header.
  • Backedge
    • An edge in the flow graph whose head dominates its tail. For example,

edge 4 3 and edge 7 1

global data flow analysis
Global Data Flow Analysis
  • To provide global information about how a procedure manipulates its data.






Can we propagate constant 3 for A?

data flow equations
Data Flow Equations

A typical data flow equation has the form

Out [S] = Gen[S] U (In[S] – Kill[S])

S means a statement

Gen[S] means definitions generated within S

Kill[S] means definitions killed as control flows through S

In[S] means definitions live at the beginning of S

Out[S] means definitions available at the end of S

reaching definitions
Reaching Definitions
  • A definition d reaches a point p, if there is a path from the point immediately following d to p, such that d is not killed along that path.

d1: I=m-1

d2: j:=n

d3: a=u1


d1,d2,d5 reach B2

d5 kills d2, so d2

does not reach



d4: I=I+1


d5: j=j-1




d6: a=u2

data flow equation for reaching definition
Data Flow Equation forReaching Definition

gen[S] = {d1}

kill[S] = all def of a

out[S] = gen[S] U

(in[S] – kill[S])


d1: a = b+c

gen[S] = gen[S1] U gen[S2]

kill[S] = kill[S1] I kill[S2]

out[S] = out[S1] U out[S2]




gen[S] = gen[S1]

kill[S] = kill[S1]

In[S1] = in[S] U gen[S1]

out[S] = out[S1]



transformation example licm
Transformation example: LICM
  • Loop Invariant Code Motion
    • A loop invariant is an instruction (a load or a calculation) in a loop whose result is always the same in every iteration.
    • Once we identified loops, and tracked the locations at which operand values are defined (i.e. reaching definition), we can recognize a loop invariant if each of its operands

1) is a constant,

2) has reaching definitions that all lie outside the loop or

3) has a single reaching definition that itself is a loop invariant.

static compilers
Static Compilers
  • Traditional compilation model for C, C++, Fortran, …
  • Extremely mature technology
  • Static design point allows for extremely deep and accurate analyses supporting sophisticated program transformation for performance.
  • ABI enables a useful level of language interoperability


static compilation the downsides
Static compilation…the downsides
  • CPU designers restricted by requirement to deliver increasing performance to applications that will not be recompiled
    • Slows down the uptake of new ISA and micro-architectural features
    • Constrains the evolution of CPU design by discouraging radical changes
  • Model for applying feedback information from application profile to optimization and code generation components is awkward and not widely adopted thus diluting the performance achieved on the system
static compilation the downsides34
Static compilation…the downsides
  • Largely unable to satisfy our increasing desire to exploit dynamic traits of the application
  • Even link-time is too early to be able to catch some high-value opportunities for performance improvement
  • Whole classes of speculative optimizations are infeasible without heroic efforts
tyranny of the dusty deck
Tyranny of the “Dusty Deck”
  • Binary compatibility is one of the crowning achievements of the early computer yearsbut…
  • It does (or at least should) make CPU architects think very carefully about adding anything new because
    • you can almost never get rid of anything you add
    • it takes a long time to find out for sure whether anything you add is a good idea or not
profile directed feedback pdf
Profile-Directed Feedback (PDF)

Two-step optimization process:

  • First pass instruments the generated code to collect statistics about the program execution
    • Developer exercises this program with common inputs to collect representative data
    • Program may be executed multiple times to reflect variety of common inputs
  • Second pass re-optimizes the program based on the profile data collected

Also called Profile-Guided Optimization (PGO) or Profile-Based Optimization (PBO)

data collected by pdf
Data collected by PDF
  • Basic block execution counters
    • How many times each basic block in the program is reached
    • Used to derive branch and call frequencies
  • Value profiling
    • Collects a histogram of values for a particular attribute of the program
    • Used for specialization
other pdf opportunities
Other PDF Opportunities
  • Path Profile
  • Alias Profile
  • Cache Miss Profile
    • I-cache miss
    • D-cache miss
    • Miss types
    • ITLB/DTLB misses
  • Speculation Failure Profile
  • Event Correlation Profile
optimizations affected by pdf
Optimizations affected by PDF
  • Inlining
    • Uses call frequencies to prioritize inlining sites
  • Function partitioning
    • Groups the program into cliques of routines with high call affinity
  • Speculation
    • Control speculative execution, data speculative execution and value speculation based optimizations.
  • Predication
  • Code Layout
  • Superblock formation
optimizations triggered by pdf in the ibm compiler
Optimizations triggered by PDF(in the IBM compiler)
  • Specialization triggered by value profiling
    • Arithmetic ops, built-in function calls, pointer calls
  • Extended basic block creation
    • Organizes code to frequently fall-through on branches
  • Specialized linkage conventions
    • Treats all registers as non-volatile for infrequent calls
  • Branch hinting
    • Sets branch-prediction hints available on the ISA
  • Dynamic memory reorganization
    • Groups frequently accessed heap storage
impact of pdf on specint 2000
Impact of PDF on SpecInt 2000

On a PWR4 system running AIX using the latest IBM compilers, at the highest available optimization level (-O5)

sounds great what s the problem
Sounds great…what’s the problem?
  • Only the die-hard performance types use it (eg. HPC, middleware)
  • It’s tricky to get right…you only want to train the system to recognize things that are characteristic of the application and somehow ignore artifacts of the input set
  • In the end, it’s still static and runtime checks and multiple versions can only take you so far
  • Undermines the usefulness of benchmark results as a predictor of application performance when upgrading hardware
  • In summary…the usability issue for developers that shows no sign of going away anytime soon
dynamic compilation system
Dynamic Compilation System




Java Virtual Machine

JIT Compiler



jvm evolution
JVM Evolution
  • First generation of JVMs were entirely interpreted. Pure interpretation is good for proof-of-concept, but too slow for executing real code.
  • Second generation JVMs used JIT (Just-in-time) compilers to convert bytecodes into machine codes before execution in a lazy fashion.
  • Hotspot is the 3rd generation technology. It combines interpretation, profiling and dynamic compilation. It compiles only the frequently executed code. It also comes with 2 compilers: server compiler (optimize for speed) and client compiler (optimize for start-up and memory footprint).
  • New dynamic compilation techniques for JVMs are CPO (Continuous Program Optimization) or continuous recompilation and OSR (On-Stack-Replacement) which can switch a code from interpretation mode to compiled versions.
dynamic compilation
Dynamic Compilation
  • Traditional model for languages like Java
  • Rapidly maturing technology
  • Exploitation of current invocation behaviour on exact CPU model
  • Recompilation and other dynamic techniques enable aggressive speculations
  • Profile feedback to optimizer is performed online (transparent to user/application)
  • Compile time budget is concentrated on hottest code with the most (perceived) opportunities


dynamic compilation the downsides
Dynamic compilation…the downsides
  • Some important analyses not affordable at runtime even if applied only to the hottest code (array data flow, global scheduling, dependency analysis, loop transformations, …)
  • Non-determinism in the compilation system can be problematic
    • For some users, it severely challenges their notions of quality assurance
    • Requires new approaches to RAS and to getting reproducible defects for the compiler service team
  • Introduces a very complicated code base into each and every application
  • Compile time budget is concentrated on hottest code with the most (perceived) opportunities and not on other code, which in aggregate may be as important a contributor to performance
    • What do you do when there’s no hot code?
the best of both worlds
The best of both worlds










MIL, etc

Java / .NET


Portable High

Level Optimizer










Feedback (PDF)




more boxes but is it better
More boxes, but is it better?
  • If ubiquitous, could enable a new era in CPU architectural innovation by reducing the load of the dusty deck millstone
    • Deprecated ISA features supported via binary translation or recompilation from “IL-fattened” binary
    • No latency effect in seeing the value of a new ISA feature
    • New feature mistakes become relatively painless to undo
there s more
There’s more
  • Transparently bring the benefits of dynamic optimization to traditionally static languages while still leveraging the power of static analysis and language-specific semantic information
    • All of the advantages of dynamic profile-directed feedback (PDF) optimizations with none of the static pdf drawbacks
      • No extra build step
      • No input artifacts skewing specialization choices
      • Code specialized to each invocation on exact processor model
      • More aggressive speculative optimizations
      • Recompilation as a recovery option
    • Static analyses inform value profiling choices
      • New static analysis goal of identifying the inhibitors to optimizations for later dynamic testing and specialization
  • A crossover point has been reached between dynamic and static compilation technologies.
  • They need to be converged/combined to overcome their individual weaknesses
  • Hardware designers struggle under the mounting burden of maintaining high performance backwards compatibility