1 / 64

Towards a More Principled Compiler: Progressive Backend Compiler Optimization

Towards a More Principled Compiler: Progressive Backend Compiler Optimization. David Koes 8/28/2006. Performance Gains Due to Compiler (gcc). 2.8Ghz Pentium 4, 1GB RAM, -O3 …. The Future of Compiler Optimization. is this possible?. How do we exploit the existing optimization potential?.

cheng
Download Presentation

Towards a More Principled Compiler: Progressive Backend Compiler Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a More Principled Compiler:Progressive Backend Compiler Optimization David Koes 8/28/2006

  2. Performance Gains Due to Compiler (gcc) 2.8Ghz Pentium 4, 1GB RAM, -O3 …

  3. The Future of Compiler Optimization is this possible? How do we exploit the existing optimization potential? Yes! Need a more principled compiler 10-30% improvement just from reordering compiler phases http://www.cs.rice.edu/~keith/Adapt/

  4. Compiler code size improvement

  5. A Principled Compiler • A compiler that • knows right from wrong • (less optimal from more optimal) • follows a rigorous procedure to get the desired output

  6. Problems some phases not internally optimal purely heuristic solution machine description mostly ignored lack of integration between phases target independent copy prop … GVN DCE strength reduct SCCP inlining PRE code motion const prop loop unroll Today’s Compiler target dependent insn sched machine description reg alloc insn select branch opt peephole … optimized program

  7. each phase locally optimal makes full use of machine description tight integration between phases copy prop DCE PRE loop unroll const prop code motion GVN inline strength reduct CSE SCCP peep-hole reg alloc branch opt … insn select Ideal Compiler Absolutely no idea how to do this or if it’s even possible machine description optimized program

  8. each phase locally optimal makes full use of machine description tight integration between phases reg alloc insn select Towards a More Principled Compiler copy prop DCE PRE loop unroll const prop code motion GVN inline strength reduct CSE SCCP peep-hole reg alloc branch opt machine description … optimized program

  9. Outline • Motivation • Related Work • Completed Work • Proposed Work • Contributions & Timeline

  10. Related Work eax ebx ecx edx esi edi esp ebp Register Allocation Problem unbounded number of program variables limited number of processor registers + slow memory spill code optimization … v = 1 w = v + 3 x = w + v u = v t = u + x print(x); print(w); print(t); print(u); … register preferences rematerialization register allocator live range splitting memory operands

  11. Related Work Register Allocation Previous Work

  12. Related Work ? Instruction Selection Problem IR Assem instruction selector IR Representation movl (p),t1 leal (x,t1),t2 leal 1(y),t3 leal (t2,t3),r minimum cost tiling

  13. Related Work Instruction Selection Previous Work

  14. Outline • Motivation • Related Work • Completed Work • Proposed Work • Contributions & Timeline

  15. Completed Work fully utilize machine description explicit and expressive model of costs of allocation for given architecture optimal solutions A More Principled Register Allocator reg alloc machine description

  16. Completed Work Multi-commodity Network Flow: An Expressive Model • Given network (directed graph) with • cost and capacity on each edge • sources & sinks for multiple commodities • Find lowest cost flow of commodities • NP-complete for integer flows b a Example: edges have unit capacity 1 0 b a

  17. Completed Work r0 r0 r1 r1 mem mem 1 1 Register Allocation as a MCNF Variables  Commodities Variable Definition  Source Variable Last Use  Sink Nodes  Allocation Classes (Reg/Mem/Const) Registers Limits  Node Capacities Spill Costs  Edge Costs Allocation  Flow a a r1 mem 1 3

  18. Completed Work Example Source Code int example(int a, int b) { int d = 1; int c = a - b; return c+d; } load cost insn pref cost Pre-alloc Assembly MOVE 1 -> d SUB a,b -> c ADD c,d -> c MOVE c -> r0 mem access cost

  19. Completed Work a: %eax a: %eax a: mem a: mem Control Flow • MCNF can only represent straight-line code • need to link together networks from basic blocks Extend MCNF model with merge and split nodes to implement boundary constraints. a: %eax details in proposal document… along with modeling persistence of values in memory a: mem a: mem

  20. Completed Work fully utilize machine description explicit and expressive model of costs of allocation for given architecture: Global MCNF locally optimal NP-hard, so use progressive solution technique A Better Register Allocator reg alloc machine description

  21. Completed Work fully utilize machine description explicit and expressive model of costs of allocation for given architecture: Global MCNF locally optimal NP-hard, so use progressive solution technique A Better Register Allocator reg alloc machine description

  22. Completed Work Technique: Lagrangian relaxation directed allocators Progressive Solution Technique • Quickly find a good allocation • Then progressively find better allocations • until optimal allocation found • or time limit is reached Allocation Quality Compile Time

  23. Completed Work b a 1 0+1 b a b a 1 0 with price, solution to single commodity flow can be solution to multicommodity flow b a Lagrangian Relaxation: Intuition • Relaxes the hard constraints • only have to solve single commodity flow • Combines easy subproblems using a Lagrangian multiplier (price) • an additional price on each edge • a price on each split/merge node Example: edges have unit capacity

  24. Completed Work b a 1 0+1 b a b a 1 0+1 b a Solution Procedure • Compute prices with iterative subgradient optimization • guaranteed converge to optimal prices • optimal for linear relaxation • At each iteration, construct a feasible integer solution using current prices • iterative allocator in document • simultaneous allocator • trace-based simultaneous allocator

  25. Completed Work X X Simultaneous Allocator Edges to/from memory cost 3 Current cost: -1 -3 -2

  26. Completed Work Trace-Based Allocation • Decompose function into traces of basic blocks • run simultaneous allocator on each trace • control flow internal to trace presents difficulty addressed in proposal document

  27. Completed Work Evaluation • Implemented in gcc 3.4.4 targeting x86 • Optimize for code size • perfect static evaluation • important metric in its own right • MediaBench, MiBench, Spec95, Spec2000 • over 10,000 functions

  28. Completed Work Progressiveness squareEncrypt

  29. Completed Work Progressiveness quicksort

  30. Completed Work Progressive! Code Size

  31. Completed Work Optimality Proven optimality

  32. Completed Work Compile Time Slowdown :-( 9.2x slower

  33. Completed Work fully utilize machine description explicit and expressive model of costs of allocation for given architecture: Global MCNF locally optimal approach optimality using progressive solution technique: Lagrangian directed allocators reg alloc machine description A Better Register Allocator

  34. Outline • Motivation • Related Work • Completed Work • Proposed Work • Contributions & Timeline

  35. Proposed Work A Better Better Register Allocator • Solver Improvements • Improve initial solution • Improve quality as prices converge • Hope to prove approximation bounds • Model Improvements • Improve accuracy of model • Model simplification • Represent uniform register sets efficiently

  36. Proposed Work Model Simplification Summarize overly expressive sections of the model • Conservative simplification • does not change optimal value • Aggressive simplification • explore tradeoff between model complexity and optimality

  37. Proposed Work Instruction Selection Interaction • which instruction is best • depends on the register allocator • so let register allocator decide perform same operation

  38. Proposed Work Register Allocation Aware Instruction SElection (RA2ISE) • Instruction selection not finalized until register allocation • IR tiled with Register Allocation Aware Tiles (RAATs) • A RAAT represents several instruction sequences • different costs • a sequence for every possible register allocation

  39. Proposed Work RA2ISE RAAT IR tiling model creation register allocation cwtl %eax

  40. Proposed Work Implementing RA2ISE • Add side-constraints to Global MCNF model • implement inter-variable preferences and constraints • “if x allocated to r1 and y allocated to r2, then save three bytes” • “x and y must be allocated to the same register” • Implement x86 RAATs • RAAT tables created manually • GMCNF RAAT representation automatically generated from RAAT table with minimum use of side constraints • Algorithms for tiling RAATs • leverage existing algorithms • exploit feedback between passes

  41. Proposed Work 3 3 1 1 1 4 tiling 1 1 1 1 1 1 1 2 2 5 5 1 3 3 3 1 2 4 registerallocate 3 4 1 mem mem edx 3 2 1 4 eax 3 Tiling RAATs 4 2 2 4 1 3 feedback

  42. Proposed Work Evaluation • Implement in production quality compiler (gcc) • Evaluate code size and simple code speed metric • Evaluate on three different architectures • x86 (8 registers) • 68k/ColdFire (16 registers) • PPC (32 registers)

  43. Outline • Motivation • Related Work • Completed Work • Proposed Work • Contributions & Timeline

  44. Contributions • RA2ISE • register allocation aware tiles (RAATs) explicitly encode effect of register allocation on instruction sequence • algorithms for tiling RAATs • expressive model of register allocation that operates on RAATs and explicitly represents all important components of register allocation • progressive solver for this model that can quickly find decent solution and approaches optimality as more time is allowed for compilation • Comprehensive evaluation of RA2ISE

  45. Thesis Statement • RA2ISE is a principled and effective system for performing instruction selection and register allocation.

  46. reg alloc insn select One Step Towards a More Principled Compiler copy prop DCE PRE loop unroll const prop code motion GVN inline strength reduct CSE SCCP peep-hole reg alloc branch opt machine description … optimized program

  47. Timeline

  48. Andrew Richard Koes

  49. ? Questions?

  50. Processor Performance

More Related