CSE P501 – Compiler Construction

CSE P501 – Compiler Construction Compiler Backend Organization Instruction Selection Instruction Scheduling Registers Allocation Instruction Selection Peephole Optimization Peephole Instruction Selection Jim Hogg - UW - CSE - P501

A Compiler ‘Middle End’ Back End Target Source Front End chars IR IR Scan Select Instructions Optimize tokens IR Allocate Registers Parse IR AST Emit Convert IR IR Machine Code Instruction Selection: processors don't support IR; need to convert IR into real machine code AST = Abstract Syntax Tree IR = Intermediate Representation Jim Hogg - UW - CSE - P501

The Big Picture Compiler = lots of fast stuff, followed by hard problems • Scanner: O(n) • Parser: O(n) • Analysis & Optimization: ~ O(n log n) • Instruction selection: fast or NP-Complete • Instruction scheduling: NP-Complete • Register allocation: NP-Complete • Recall: approach in P501 to describing backend is 'survey' level • A deeper dive would require a further full, 10-week course Jim Hogg - UW - CSE - P501

Compiler Backend: 3 Parts • Select Instructions • eg: Best x86 instruction to implement: vr17 = 12(arp) ? • Schedule Instructions • eg: Given instructions a,b,c,d,e,f,g would the program run faster if they were issued in a different order, such as d,c,a,b,g,e,f ? • Allocate Registers • eg: which variables to store in registers? which to store in memory? Jim Hogg - UW - CSE - P501

Instruction Selection is . . . Mid-Level IR Low-Level IR Select Real Instructions • TAC - Three Address Code • t1  aop b • Tree or Linear •  supply of temps • Array address calcs. explicit • Optimizations all done • Storage locations decided • Specific to one chip/ISA • Use chip addressing modes • But still not decided on registers Jim Hogg - UW - CSE - P501

Instruction Scheduling is . . . a b c d e f g h b a d f c g h f Schedule • Execute in-order to get correct answer • Issue in new order • eg: memory fetch is slow • eg: divide is slow • Overall faster • Still get correct answer! • Originally devised for super-computers • Now used everywhere: • in-order procs - Atom, older ARM • out-of-order procs - newer x86 • Compiler does 'heavy lifting' - reduce chip power Jim Hogg - UW - CSE - P501

Register Allocation is . . . R1 = = R1 push R1 R1 = = R1 pop R1 Allocate • Real machine registers • eg: EAX, EBP • Very finite supply! • Enregister/Spill • Virtual registers or temps •  supply After allocating registers, may do a second pass of scheduling to improve speed of spill code Jim Hogg - UW - CSE - P501

Instruction Selection • Instruction selection chooses which target instructions (eg: for x86, for ARM) to use for each IR instruction • Selecting the best instructions is massively difficult because modern ISAs provide: • huge choice of instructions • wide choice of addressing modes • Eg: Intel's x64 Instruction Set Reference Manual = 1422 pages Jim Hogg - UW - CSE - P501

Choices, choices ... • Most chip ISAs provide many ways to do the same thing: • eg: to set eax to 0 on x86 has several alternatives: moveax, 0 xoreax, eax sub eax, eaximuleax, 0 • Many machine instructions do several things at once – eg: register arithmetic and effective address calculation. Recall: lea rdst, [rbase + rindex*scale + offset] Jim Hogg - UW - CSE - P501

Overview: Instruction Selection • Map IR into near-assembly code • For MiniJava, we emitted textual assembly code • Commercial compilers emit binary code directly • Assume known storage layout and code shape • ie: optimization phases have already done their thing • Combine low-level IR operations into machine instructions (take advantage of addressing modes, etc) Jim Hogg - UW - CSE - P501

Criteria for Instruction Selection • Several possibilities • Fastest • Smallest • Minimal power (eg: don’t use a function-unit if leaving it powered-down is a win) • Sometimes not obvious • eg: if one of the function-units in the processor is idle and we can select an instruction that uses that unit, it effectively executes for free, even if that instruction wouldn’t be chosen normally • (Some interaction with scheduling here…) Jim Hogg - UW - CSE - P501

Instruction Selection: Approaches • Two main techniques: • Tree-based matches (eg: maximul munch algorithm) • Peephole-based generation • Note that we select instructions from the target ISA. • We do not decide on which physical registers to use. That comes later during Register Allocation • We have a few generic registers: ARP, StackPointer, ResultReg Jim Hogg - UW - CSE - P501

Tree-Based Instruction Selection e  f How to generate target code for this simple tree? We could use a template approach - similar to converting an AST into IR:  ID <f,arp,8> ID <e,arp,4> case ID: t1 = offset(node) t2 = base(node) reg = nextReg() emit(loadA0, t1, t2, res) Resulting codegen is correct, but dumb Doesn't even cover: call-by-val, call-by-ref, enregistered, different data type, etc Jim Hogg - UW - CSE - P501

Template Code Generation e  f  case ID: t1 = offset(node) t2 = base(node) reg = nextReg() emit(loadA0, t1, t2, res) ID <f,arp,8> ID <e,arp,4> Naive Ideal loadI 4 => r5 loadAOrarp, r5=> r6 loadI 8 => r7 loadAOrarp, r7=> r8 mult r6, r8 => r9 loadAIrarp, 4 => r5 loadAIrarp, 8 => r6 mult r5, r6 => r7 Jim Hogg - UW - CSE - P501

IR (3-address Code) Tree-lets Rules for Tree-to-Target Conversion (prefix notation) Production ILOC Template 5 ... 6 Reg  Lab load l => rn 8 Reg Num 9 Reg  Reg1 load r1 => rn 10 Reg   + Reg1 Reg2loadA0 r1, r2 => rn 11 Reg   + Reg1 Num2loadAI r1, n2 => rn 14Reg  + Lab1 Reg2loadAI r2, l1 => rn 15Reg + Reg1 Reg2add r1, r2 => rn 16 Reg  + Reg1 Num2addI r1, 22=> rn 19 Reg + Lab1 Reg2addI r2, l1 => rn 20 ... 11  + Num2 Reg1  + Reg1 Num2  Memory Dereference Jim Hogg - UW - CSE - P501

Example: Tiling a tiny 4-node tree Load variable <c,@G,12>  + Num 12 Lab @G • The tree above shows code to access variable, c, stored at offset 12 bytes from label G • How many ways can we tile this tree into equivalent ILOC code? Jim Hogg - UW - CSE - P501

Potential Matches  11  14   10 10 + + + + 6 8 8 8 6 6 Num 12 Lab @G Num 12 Lab @G Num 12 Lab @G Lab @G Num 12 <6,11> <8,14> <6,8,10> <8,6,10> loadI l2 => ri loadAIri @G => rj loadI @G => ri loadAIri 12 => rj loadI @G => ri loadI l2 => rj loadAOrirj => rk loadI l2 => ri loadI @G => rj loadAOrirj => rk 9  9  9 9   16 19 15 15 + + + + 8 6 6 8 6 8 Num 12 Lab @G Num 12 Lab @G Num 12 Lab @G Num 12 Lab @G <8,19,9> <6,16,9> <6,8,15,9> <8,6,15,9>

Tree-Pattern Matching • A given tiling “implements” a tree if: • covers every node in the tree, and • overlap between any two tiles (trees) is limited to a single node • If <node,op> is in the tiling, then node is also covered by a leaf in another operation tree in the tiling – unless it is the root • Where two operation trees meet, they must be compatible (ie: expect the same value in the same location) Jim Hogg - UW - CSE - P501

IR AST for Simple Expression = a = b - 2  c - + Num 4 Val arp   a local var, offset 4 from arp b call-by-ref param cvar, offset 12 from label @G Num 2   + + Val arp Num -16 Num 12 Val @G Prefix form - same info as IR tree: = + Val1 Num1 -   + Val2 Num2  Num3  + Lab1 Num4 Jim Hogg - UW - CSE - P501

IR AST as Prefix Text String = + Val1 Num1 -   + Val2 Num2  Num3  + Lab1 Num4 No parentheses? - don't need them: evaluate this expression from right-to-left, using simple stack machine Production ILOC Template 5 ... 6 Reg Lab1 load l1 => rn 7Reg  Val1 8 Reg Num1loadl n1 => rn 9 Reg  Reg1 load r1 => rn 10 Reg   + Reg1 Reg2loadA0 r1, r2 => rn 11 Reg   + Reg1 Num2loadAI r1, n2 => rn 14Reg  + Lab1 Reg2loadAI r2, l1 => rn 15Reg + Reg1 Reg2add r1, r2 => rn 16 Reg  + Reg1 Num2addI r1, 22=> rn 19 Reg + Lab1 Reg2addI r2, l1 => rn 20 ... Rewrite Rules - Another BNF, but including ambiguity! Jim Hogg - UW - CSE - P501

Select Instructions with LR Parser = + Val1 Num1 -   + Val2 Num2  Num3  + Lab1 Num4 Production ILOC Template 5 ... 6 Reg Lab1load l1 => rn 7Reg  Val1 8 Reg Num1loadl n1 => rn 9 Reg  Reg1 load r1 => rn 10 Reg   + Reg1 Reg2loadA0 r1, r2 => rn 11 Reg   + Reg1 Num2loadAI r1, n2 => rn 14Reg  + Lab1 Reg2loadAI r2, l1 => rn 15Reg + Reg1 Reg2add r1, r2 => rn 16 Reg  + Reg1 Num2addI r1, 22=> rn 19 Reg + Lab1 Reg2addI r2, l1 => rn 20 ... • Use LR parser for Grammar to parse prefix expression • Ambiguous, so lots of parse-action conflicts: resolve with tie-breaker rules: • Lowest cost (need to augment Grammar productions with cost) • Favor large reductions over smaller ("maximal munch") • reduce-reduce conflict - choose longer reduction • shift-reduce conflict - choose shift • => largest number of prefix ops translated into machine instruction Jim Hogg - UW - CSE - P501

Peephole Optimization • Originally devised as the last optimization pass of a compiler • Examine a small, sliding window of target code (a few adjacent instructions) and optimize Reminder Cooper&Torczon "ILOC" Memory[rarp + 8] = r1 r15 = Memory[rarp + 8] storeAIr1 => rarp, 8 loadAIrarp, 8 => r15 Original Memory[rarp + 8] = r1 r15 = r1 storeAIr1 => rarp, 8 i2i r1=> r15 Optimized Jim Hogg - UW - CSE - P501

Peeps, 2 Reminder r7 = r2 + 0 r10 = r4 * r7 addI r2, 0 => r7 mult r4, r7 => r10 Original Optimized r10 = r4 * r2 mult r4, r2 => r10 jumpI L10 L10: jumpI L20 Original jumpIL20 L10: jumpI L20 Optimized Jim Hogg - UW - CSE - P501

Modern Peephole Optimizer • Modern ISAs are enormous • Linear search for a match no longer fast enough • So . . . Expander Simplifier Matcher IR LIR LIR LIR • Like a miniature compiler • Use for both peeps and instruction-selection Jim Hogg - UW - CSE - P501

Instruction Selection via Peeps IR LIR LIR Simplified LIR r10 = 2 r11 = @G r14 = M(r11+r12) r15 = r10 * r14 r18 = M(rarp-16) r19 = M(r18) r20 = r19 - r15 M(rarp+4) = r20 loadI 2 => r10 loadI @G => r11 loadAI r11, 12 => r14 mult r10, r14 => r15 loadAIrarp, -16 => r18 load r18 => r19 sub r19, r15 => r20 storeAI r20 => rarp,4 t1 = 2  c a = b - t1 r10 = 2 r11 = @G r12 = 12 r13 = r11 + r12 r14 = M[r13] r15 = r10 * r14 r16 = -16 r17 = rarp + r16 r18 = M[r17] r19 = M[r18] r20 = r19 - r15 r21 = 4 r22 = rarp + r21 M[r22] = r20 8 Instructions a 4(arp) Local Variable b -16(arp) Call-by-reference parameter c 12(@G) Variable 14 Instructions Jim Hogg - UW - CSE - P501

The Simplifier r11 = @G r12 = 12 r13 = r11 + r12 r11 = @G r14 = M[r11 + 12] r15 = r10 * r14 r10 = 2 r11 = @G r12 = 12 r11 = @G r13 = r11 + 12 r14 = M[r13] r15 = r10 * r14 r16 = -16 r17 = rarp + r16 r15 = r10 * r14 r18 = M[rarp-16] r19 = M[r18] r14 = M[r11+12] r15 = r10 * r14 r16 = -16 r15 = r10 * r14 r17 = rarp- 16 r18 = M[r17] r19 = M[r18] r20 = r19 - r15 r21 = 4 r20 = r19 - r15 r22 = rarp + 4 M[r22] = r20 r18 = M[rarp-16] r19 = M[r18] r20 = r19 - r15 r20 = r19 - r15 r21 = 4 r22 = rarp + r21 Roll r20 = r19 - r15 M[r22] = r20 Const Prop + DCE This example is simplified - never re-uses a virtual register: so we can delete the instruction after a constant propagation. In general, we would have to generate liveness info for each virtual register to perform safe DCE Jim Hogg - UW - CSE - P501

Next • Instruction Scheduling • Register Allocation • And more…. Jim Hogg - UW - CSE - P501

CSE P501 – Compiler Construction