390 likes | 552 Views
Run Time Optimization. 15-745: Optimizing Compilers Pedro Artigas. Motivation. A good reason Compiling a language that contains run-time constructs Java dynamic class loading Perl or Matlab eval(“statement”) Faster than interpreting A better reason
E N D
Run Time Optimization 15-745: Optimizing Compilers Pedro Artigas
Motivation • A good reason • Compiling a language that contains run-time constructs • Java dynamic class loading • Perl or Matlab eval(“statement”) • Faster than interpreting • A better reason • May use program information only available at run time
Example of run-time information • The processor that will be used to run the program • inc ax is faster on a Pentium III • add ax,1 is faster on a Pentium 4 • No need to recompile if generating code at run time • The actual program input/run-time behavior • Is my profile information accurate for the current program input? YES!
Compile Link Load/Run The life cycle of a program One Object File Global Analysis One Binary Whole Program Analysis One Process Analysis? No observation! Larger scope, better information about program behavior
New strategies are possible • Pessimistic x Optimistic approaches • Ex: Does int *a points to the same location as int *b ? • Compile time/Pessimistic: Prove that in ANY execution those pointers point to different addresses • Run Time/Optimistic: Up to now in the current execution a and b point to different locations • Assume this holds • If the assumption breaks, invalidate generated code and generate new code
A sanity check • Using run time information does not require run time code generation • Example: Versioning • ISA may allow cheaper tests • IA-64 • Transmeta if (a!=b) { <generate code assuming a!=b> } else { <generate code assuming a==b> }
Drawbacks • Code generation has to be FAST • Rule of thumb: almost linear on program size • Code quality: Compromise on quality to achieve fast code generation • shoot for good, not great • Also this usually means: • No time for classical Iterative Data Flow Analysis at run time
No classical IDFA: Solutions • Quasi-Static and/or Staged Compilation • Perform IDFA at compile time • Specialize the dynamic code generator for the obtained information • That is, encode the obtained data flow information in the “binary” • Do not rely on classical IDFA • Use algorithms that do not require it • Ex: Dominator based value numbering (coming up!) • Generate code in a style that does not require it • Ex: One entry multiple exits traces • as in deco and dynamo
Code generation Strategies • Compiling a language that requires run-time code generation: • Compile adaptively: • Use a very simple and fast code generation scheme • Re-compile frequently used regions using more advanced techniques
Adaptive Compilation: Motivation • Very simple code generation • Higher execution cost • Elaborate code generation • Higher compilation cost • Problem: • We may not know in advance how frequently a region will execute • Measure frequencies and re-compile dynamically Fast compiler Optimizing compiler Cost threshold
Code generation Strategies • Compiling selected regions that benefit from run-time code generation: • Pick only the regions that should benefit the most • Which regions? • Select them statically • Use profile information • Re-compile (that is select then dynamically) • Usually all of the above
1 2 3 4 Code Optimization Unit • What is the run-time unit of optimization? • Option: Procedures/static code regions • Similar to static compilers • Option: Traces • Start at the target of a backward branch • Include all the instructions in a path • May include procedure calls and returns • Branches • Fall through = remain in the trace • Target = exit the trace 1 2 3 4 4
Run-Time code generation:Case studies • Two examples of algorithms that are suitable for run-time code generation • Run time CSE/PRE replacement: • Dominator based value numbering • Run time Register Allocation: • Linear scan register allocation
A+B A+B A+B A+B Sidebar • With traces CSE/PRE become almost trivial • No need for register allocation if optimizing a binary (ex: dynamo) PRE CSE
Review: Local value numbering • Store expressions already computed (in a hash table) • Store variable nameVN mapping in the VN array • Store VNvariable name mapping in the Name array • Same value numbersame value for each basic block Table.empty() for each computed expression (“x=y op z”) if V=Table.lookup(“y op z”) VN[“x”]=V if VN[Name[V]]==V //expression is still there replace “x = y op z” with “x = Name[V]” else Name[V]=“x” else VN[“x”]=new_value_number() Table.insert(“y op z”,VN[“x”]) Name[VN[“x”]]=“x” Expression was computed in the past, check if result is available New expression, add to the table
Local value numbering • Works in linear time on program size • Assuming accesses to the array and the hash table occur in constant time • Can we make it work in a scope larger than a basic block? (Hint: Yes) • What are the potential problems?
Problems • How to propagate the hash table contents across basic blocks? • How to make sure that is safe to access the location containing the expression in other basic blocks? • How do we make sure if the location containing the expression is fresh? • Remember: no IDFA
Control flow issues • On split points things are simple • Just keep the content of the hash table from the predecessor • What about merge points? • We do not know if the same expression was computed in all incoming paths • We do not want to check the fact anyway (why?) • Reset the state of the hash table to a safe state it had in the past • Which program point in the past? • The immediate dominator of the merge block
Data flow issues • Making sure the def of an expression is fresh and reaches the blocks of interest • How? • By construction! SSA • All names are fresh (Single Assignment) • All defs dominate its’ uses (regular uses not functions) • As, by construction, we introduce new defs using functions at every point this would not hold
Dominator/SSA based value numbering DVN(Block B) Table.PushScope() for each exp “n=(…)” if (exp is redundant or meaningless) //meaningless: (x0,x0) VN[“n”]= Table.lookup(“(…)” or “x0”) remove(“n=(…)”) else VN[“n”]=“n” Table.insert(“(…)”,VN[n]) for each exp “x=y op z” if (“v”=Table.lookup(“y op z”)) VN[“x”]=“v” remove(“x=y op z”) else VN[“x”]=“x” Table.insert(“x=y op z”,VN[“x”]) for each successor s of B Adjust the inputs for each dominator tree child c in CFG reverse post-order DVN(c) Table.PopScope() First process the expressions Them the regular ones Propagate info about inputs and call DVN recursively
1 u0=a0+b0 v0=c0+d0 w0=e0+f0 3 2 u1=a0+b0 x1=e0+f0 y1=e0+f0 x0=c0+d0 y0=c0+d0 4 u2= (u0,u1) x2=(x0,x1) y2=(y0,y1) u3=a0+b0 VN Example
Problems • Does not catch • But it performs almost as well as CSE • And runs much faster • linear time ? (YES? NO?) x0=a0+b0 x0=a0+b0 x1=a0+b0 x1=(x0,x2) x2=a0+b0 x2=(x0,x1)
Homework #4 • The DVN algorithm scans the CFG in a similar way as the second phase of SSA translation • SSA translation phase #1 • Placing functions • SSA translation phase #2 • assigning unique numbers to variables • Combine both and save one pass • Gives us a smaller constant • But, at run time, it pays of!
Run time register allocation • Graph Coloring? Not an option • Even the simple stack based heuristic shown in class is O(n2) • Not even counting: • Building the graph • Move coalescing optimization • But register allocation is VERY important in terms of performance • Remember, memory is REALLY slow • We need a simple but effective (almost) linear time algorithm
Let’s start simple • Start with a local (basic block) linear time algorithm • Assuming only one def and one use per variable (More constrained than SSA) • Assuming that if a variable is spilled it must remain spilled (Why?) • Can we find an optimum linear time algorithm? (Hint: Yes) • Ideas? • Think about liveness first …
Simple Algorithm:Computing Liveness • One def and one use per variable, only one block • A live range is merely the interval between the def and the use • Live Interval: Interval between the first def and the last use • OBS: Live Range = Live Interval if there is no control flow, only one def and use • We could compute live intervals using a linear scan if we store the def instructions (beginning of the interval) in a hash table
Example S1: A=1 S2: B=2 S3: C=3 S4: D=A S5: E=B S6: use(E) S7: use(D) S8: use(C)
Now Register Allocation • Another linear scan • Keep the active intervals in an list (active) • Assumption: an interval, when spilled, will remain spilled • Two scenarios • #1: • No problem • #2: • Must spill • Which interval?
Spilling heuristic • Since there is no second chance: • That is a spilled variable will always remain spilled • Spill the interval that ends last • Intuition: As one spill must occur … • Pick the one that makes the remaining allocation least constrained • That is, the interval that ends last • This is the provably optimum solution (given all the constraints)
Linear Scan Register Allocation active = {} freeregs = {all_registers} for each interval I (in order of increasing start point) for each interval J in active if J.end>I.start continue active.remove(J) freeregs.insert(J.register) end for each interval J if active.length()==R spill_candidade=active.last(); if (spill_candidate.end>I.end) I.register = spill_candidate.register spill(spill_candidate) active.remove(spill_candidate) active.insert_sorted(I) //sorted by end point else spill(I) else I.register = freeregs.pop() //get any register from the free list active.insert_sorted(I) //sorted by end point end for each interval I Expire old intervals Must spill, pick either the last interval in active or the new interval No constraints
Example (R=2) A S1: A=1 S2: B=2 S3: C=3 S4: D=A S5: E=B S6: use(E) S7: use(D) S8: use(C) B C D E
Is the second pass really linear? • Invariant: active.length()<=R • Complexity O(R*n) • R is usually a small constant (128 at most) • Therefore: O(n)
And we are done! Right? • YES and NO • Use the same algorithm as before for register assignment • Program representation: Linear list of instructions • Live intervals are not precise anymore given control flow and multiple def/uses • Not optimum, but still FAST • Code quality: within 10% of graph coloring for spec95 benchmarks (One problem with this claim)
The Worst problem: Obtaining precise live intervals • How to obtain precise live interval information FAST? • Claim of 10% relies on live interval information obtained using liveness analysis (IDFA) • IDFA is SLOW, O(n3) • Most recent solutions: • Use the local interval algorithm for variables that only live inside one basic block • Use liveness analysis for more global variables • Alleviates the problem, does not fully solve it
More problems: Live intervals may not be precise OBS: The idea of lifetime holes leads to allocators that also try to use this holes to assign the same register to other live ranges (bin-packing) Such an allocator is used in the Alpha family of compilers (GEM compilers)
Other problems: Linearization order • Register allocation quality depends on chosen block linearization order • Choose a good order in practice • layout order • depth first traversal of the CFG • Both only 10% slower than graph coloring
Graph coloring versus Linear scan Compilation cost scaling
Conclusion • Run time code generation provides new optimization opportunities • Challenges • Identify new optimization opportunities • Design new compilation strategies • example: optimistic versus conservative • Design algorithms and implementations that: • minimize run time overhead • Do not compromise much on code quality • Recent examples indicate: • extending fast local methods is a promising way to obtain fast run-time code generation