Optimal Phase Ordering in Compiler Optimization

Exhaustive Phase Order Search Space Exploration and Evaluation by PrasadKulkarni (Florida State University)

Compiler Optimizations • To improve efficiency of compiler generated code • Optimization phases require enabling conditions • need specific patterns in the code • many also need available registers • Phases interact with each other • Applying optimizations in different orders generates different code

Phase Ordering Problem • To find an ordering of optimization phases that produces optimal code with respect to possible phase orderings • Evaluating each sequence involves compiling, assembling, linking, execution and verifying results • Best optimization phase ordering depends on • source application • target platform • implementation of optimization phases • Long standing problem in compiler optimization!!

Phase Ordering Space • Current compilers incorporate numerous different optimization phases • 15 distinct phases in our compiler backend • 15! = 1,307,674,368,000 • Phases can enable each other • any phase can be active multiple times • 1515 = 437,893,890,380,859,375 • cannot restrict sequence length to 15 • 1544 = 5.598 * 1051

Addressing Phase Ordering • Exhaustive Search • universally considered intractable • We are now able to exhaustively evaluate the optimization phase order space.

Re-stating of Phase Ordering • Earlier approach • explicitly enumerate all possible optimization phase orderings • Our approach • explicitly enumerate all function instances that can be produced by any combination of phases

Outline • Experimental framework • Exhaustive phase order space evaluation • Faster conventional compilation • Conclusions • Summary of my other work • Future research directions

Experimental Framework • We used the VPO compilation system • established compiler framework, started development in 1988 • comparable performance to gcc –O2 • VPO performs all transformations on a single representation (RTLs), so it is possible to perform most phases in an arbitrary order • Experiments use all the 15 re-orderable optimization phases in VPO • Target architecture was the StrongARM SA-100 processor

VPO Optimization Phases

Disclaimers • Did not include optimization phases normally associated with compiler front ends • no memory hierarchy optimizations • no inlining or other interprocedural optimizations • Did not vary how phases are applied • Did not include optimizations that require profile data

Benchmarks • 12 MiBench benchmarks; 244 functions

Terminology • Activephase – An optimization phase that modifies the function representation • Dormantphase – A phase that is unable to find any opportunity to change the function • Functioninstance – any semantically, syntactically, and functionally correct representation of the source function (that can be produced by our compiler)

Naïve Optimization Phase Order Space • All combinations of optimization phase sequences are attempted L0 d a c b L1 d a d a d a d a b c b c b c b c L2

Eliminating Consecutively Applied Phases • A phase just applied in our compiler cannot be immediately active again L0 d a c b L1 d a d a d a d a b c b c b c b c L2

Eliminating Dormant Phases • Get feedback from the compiler indicating if any transformations were successfully applied in a phase. L0 d a c b L1 d a d a d a b c c b b c L2

Identical Function Instances • Some optimization phases are independent • example: branch chaining & register allocation • Different phase sequences can produce the same code • r[2] = 1; • r[3] = r[4] + r[2]; • instruction selection r[3] = r[4] + 1; • r[2] = 1; • r[3] = r[4] + r[2]; • constant propagation r[2] = 1; r[3] = r[4] + 1; • dead assignment elimination r[3] = r[4] + 1;

Equivalent Function Instances sum = 0; for (i = 0; i < 1000; i++ ) sum += a [ i ]; Source Code r[10]=0; r[12]=HI[a]; r[12]=r[12]+LO[a]; r[1]=r[12]; r[9]=4000+r[12]; L3 r[8]=M[r[1]]; r[10]=r[10]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L3; Register Allocation before Code Motion r[11]=0; r[10]=HI[a]; r[10]=r[10]+LO[a]; r[1]=r[10]; r[9]=4000+r[10]; L5 r[8]=M[r[1]]; r[11]=r[11]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L5; Code Motion before Register Allocation r[32]=0; r[33]=HI[a]; r[33]=r[33]+LO[a]; r[34]=r[33]; r[35]=4000+r[33]; L01 r[36]=M[r[34]]; r[32]=r[32]+r[36]; r[34]=r[34]+4; IC=r[34]?r[35]; PC=IC<0,L01; After Mapping Registers

Efficient Detection of Unique Function Instances • After pruning dormant phases there may be tens or hundreds of thousands of unique instances • Use a CRC (cyclic redundancy check) checksum on the bytes of the RTLs representing the instructions • Used a hash table to check if an identical or equivalent function instance already exists in the DAG

Eliminating Identical/Equivalent Function Instances • Resulting search space is a DAG of function instances L0 a c b L1 a d a d d c L2

Static Enumeration Results

Exhaustively enumerated the optimization phase order space tofind an optimal phase ordering with respect to code-size [Published in CGO ’06]

Determining Program Performance • Almost 175,000 distinct function instances, on average • largest enumerated function has 2,882,021 instances • Too time consuming to execute each distinct function instance • assemble  link  execute more expensive than compilation • Many embedded development environments use simulation • simulation orders of magnitude more expensive than execution • Use data obtained from a few executions to estimate the performance of all remaining function instances

Determining Program Performance (cont...) • Function instances having identical control-flow graphs execute each block the same number of times • Execute application once for each control-flow structure • Statically estimate the number of cycles required to execute each basic block • dynamic frequency measure = S (static cycles * block frequency)

Predicting Relative Performance – I 20 20 4 cycles 4 cycles 5 5 27 cycles 25 cycles 15 15 22 cycles 20 cycles 2 2 2 cycles 2 cycles 5 5 5 cycles 10 cycles 20 20 10 cycles 10 cycles Total cycles = 789 Total cycles = 744

Dynamic Frequency Results

Correlation – Dynamic Frequency Counts Vs. Simulator Cycles • Static performance estimation is inaccurate • ignored cache/branch misprediction penalties • Most embedded systems have simpler architectures • estimation may be sufficiently accurate • simulator cycles are close to executed cycles • We show strong correlation between our measure of performance and simulator cycles

Complete Function Correlation • Example: init_search in stringsearch

Leaf Function Correlation • Leaf function instances are generated when no additional phases can be successfully applied • Leaf instances provide a good sampling • represents the only code that can be generated by an aggressive compiler, like VPO • at least one leaf instance represents an optimal phaseordering for over 86% of functions • significant percent of leaf instances among optimal

Leaf Function Correlation Statistics • Pearson’s correlation coefficient • Accuracy of our estimate of optimal perf. Sxy – (SxSy)/n Pcorr = sqrt( (Sx2 – (Sx)2/n) * (Sy2 - (Sy)2/n) ) cycle count for best leaf Lcorr = cy. cnt for leaf with best dynamic freq count

Leaf Function Correlation Statistics (cont…)

Exhaustively evaluated the optimization phase order space tofind a near-optimal phase orderingwith respect to simulator cycles [Published in LCTES ’06]

Phase Enabling Interaction • b enables a along the path a-b-a a c b b c c a b a d

Phase Enabling Probabilities

Phase Disabling Interaction • b disables a along the path b-c-d a c b b c c a b a d

Disabling Probabilities

Faster Conventional Compiler • Modified VPO to use enabling and disabling phase probabilities to decrease compilation time # p[i] - current probability of phase i being active # e[i][j] - probability of phase j enabling phase i # d[i][j] - probability of phase j disabling phase i Foreach phase i do p[i] = e[i][st]; While (any p[i] > 0) do Select j as the current phase with highest probability of being active Apply phase j If phase j was active thenFor each phase i, where i != j do p[i] += ((1-p[i]) * e[i][j]) - (p[i] * d[i][j]) p[j] = 0

Probabilistic Compilation Results

Conclusions • Phase ordering problem • long standing problem in compiler optimization • exhaustive evaluation always considered infeasible • Exhaustively evaluated the phase order space • re-interpretation of the problem • novel application of search algorithms • fast pruning techniques • accurate prediction of relative performance • Analyzed properties of the phase order space to speedup conventional compilation • published in CGO’06, LCTES’06, submitted to TOPLAS

Challenges • Exhaustive phase order search is a severe stress test for the compiler • isolate analysis required and invalidated by each phase • produce correct code for all phase orderings • eliminate all memory leaks • Search algorithm needs to be highly efficient • used CRCs and hashes for function comparisons • stored intermediate function instances to reduce disk access • maintained logs to restart search after crash

VISTA • Provides an interactive code improvement paradigm • view low-level program representation • apply existing phases and manual changes in any order • browse and undo previous changes • automatically obtain performance information • automatically search for effective phase sequences • Useful as a research as well as teaching tool • employed in three universities • published in LCTES ’03, TECS ‘06

VISTA – Main Window

Faster Genetic Algorithm Searches • Improving performance of genetic algorithms • avoid redundant executions of the application • over 87% of executions were avoided • reduce search time by 62% • modify search to obtain comparable results in fewer generations • reduced GA generations by 59% • reduce search time by 35% • published in PLDI ’04, TACO ’05

Heuristic Search Algorithms • Analyzing the phase order space to improve heuristic algorithms • detailed performance and cost comparison of different heuristic algorithms • demonstrated the importance and difficulty of selecting the correct sequence length • illustrated the importance of leaf function instances • proposed modifications to existing algorithms, and new search algorithms • Will be published in CGO ‘07

Dynamic Compilation • Explored asynchronous dynamic compilation in a virtual machine • demonstrated shortcomings of current popular compilation strategy • describe importance of minimum compiler utilization • discussed new compilation strategies • explored the changes needed to current compilation strategies to exploit free cycles • Submitted to VEE ‘07

Optimal Phase Ordering in Compiler Optimization

Optimal Phase Ordering in Compiler Optimization

Presentation Transcript

Space Exploration

Exhaustive search

Exhaustive search

Random/Exhaustive Search

Space Exploration

Space Exploration

Space Exploration

Space Exploration

Exhaustive Search Attacks

Space Exploration

Space Exploration

Space Exploration

Space and Space Exploration

Space exploration

Exhaustive Search (ES):

Problem Solving through Exhaustive State Space Search

Unifying Local and Exhaustive Search

Space Exploration

Exhaustive Optimization Phase Order Space Exploration

Space Exploration

Exhaustive Search

Space Exploration