Code Generation CS 480
Can be complex • To do a good job of teaching about code generation I could easily spend ten weeks • But, don’t have ten weeks, so I’ll do it in one lecture • (Obviously, I’ll omit a lot of material)
Remember Reverse Polish Notation • Remember RPN? • To evaluate y * (x + 4) Push Y on stack Push x on stack Push 4 on stack Do addition Do multiplication
Code for (x + 12) * 7 Push fp Push -8 Add Get value (i.e., address is on stack, get value) Push 12 Add Push 7 multiply
Addressing modes • Early discovery, some patterns, such as adding fp + constant, occur really often. Why not make then part of instruction • Common address modes Absolute address (i.e., global) #Constant (small integer values) Register – contents of register (reg) – value at register as an address Con(reg) – value at c plus register (e.g., 8(fp) )
More unusual addressing modes • PDP had a number of unusual addressing modes -(reg) use register, then decrement (reg)+ use register, then increment Look familiar? Could be used to build a great stack system
Code for (x + 12) * 7, again Move -8(fp), -(sp) Move #12, -(sp) Add (sp)+, (sp) Move #7, -(sp) Mult (sp)+, (sp) Still easy to do, but still lots of memory access
Generating stack-style code is easy • We can use the AR stack just like the stack in a RPN calculator • Simply do a post-order traversal of the AST • Operands (variables, constants) push on stack • Operators generate code to take arguments from stack, compute the result, and push back on stack • Simple. This is what you are doing in prog 6
Then what’s the problem? • Remember the von Neumann machine design? • CPU is separated from memory by a thin slow wire, which is costly to traverse • Every time you do a memory access things slow down
Memory access in stack code generation • One memory access to get instruction • One or two memory access to get arguments • Then you can do the operation • One more memory access to write result back to memory • That’s a lot of memory access
A few things you can do to speed up • There are a few things that you can do with hardware to speed this up just a bit • Caching – (speeds instruction fetch, operand fetch not so much) • Pipelines – (but results coming from memory cause frequent stalls)
Can be done much faster Move -8(fp), r1 Add #12, r1 Mult #7, r1 Registers live on-chip, allow much faster access, code will run much faster.
Registers can be used to hold intermediate results • Example (a + b) * (c + d) Move a, r1 Add b, r1 Move c, r2 Add d, r2 Mult r2, r1 But notice we need to use two registers
No matter now many you have • Can always come up with an expression that will be more complicated than the number of registers you have (a + b) * (c + d) * (e + f) * …. But are such things common? Still. Need to handle it, called a register spill. Solution: Save temp to memory, load when you need it (basically going back to stack style).
Making effective use of addressing modes • Addressing modes give rise to two classes of arguments • Those that can be represented as addressing modes (local variables, constants), • And those that can not (expressions) • Need to make effective use of those
Four cases for addition X + y (that is, two addressing mode expressions) move x, r1 add y, r1 X + (…) (that is, one addressing mode expression) compute (…) and place into r1 add x, r1 (…) + x compute (…) and place into r1 add x, r1 (…) + (….) (that is, neither side addressing mode) compute left (…) and place into r1 compute right (…) and place into r2 add r2, r1 (and now r2 is free)
What about subtraction? Non commutative X - y (that is, two addressing mode expressions) move x, r1 sub y, r1 X - (…) (that is, one addressing mode expression) compute (…) and place into r1 sub x, r1 (NO! this is the inverse of the result we want) (…) - x compute (…) and place into r1 add x, r1 (…) - (….) (that is, neither side addressing mode) compute left (…) and place into r1 compute right (…) and place into r2 sub r2, r1 (and now r2 is free)
Choices on subtraction • Use two registers, or • Use one register and create an invert instruction • Depends upon if you have enough registers • Lots of special cases (division, multiplication, shifts).
But registers have many other uses • Can use registers to hold variables that are being accessed a lot (such as loop index values) • Can use registers for passing parameters (save memory access) • So a good machine design would have lots of registers, right? Not so fast
Using registers for variables • Can make your code dramatically faster • But need to discover WHICH variables are commonly used – complicated data flow analysis • Or (as C does), allow the user to give you hints (which are frequently wrong)
Using registers for parameters • If the callee can use the values directly in registers, then the code can be much faster • If not, when they they were going to save register values anyway, nothing is lost • But callee might not know if they want parameter is memory or in register
Cost to save an restore • Remember the parameter calling sequence? • The more registers you have, the more you need to save and restore on procedure entry and exit • So there is a trade-off – faster function invocation, slower execution of body of function, or faster execution, slower invocation
Worse yet – don’t know ahead of time • don’t know until you have seen entire function body how many registers you will need • Save them all – too much execution time • Don’t save enough – run out of registers • (old C trick, branch to end of function, and generate saves after you have seen everything else).
Interesting idea Register Window • Interesting idea. Put a huge amount of registers (say, 256) on-chip, but only allow process to see a few (say, 16) at time • When you execute a function, simply move up window on registers • Saves and restores are automatic, make function invocation faster
Even better, overlapping windows • What if six of those registers are shared with caller, and ten are new? • What might you use those six for? • Can make for really function invocation
Downside of register windows • Eventually you run out, and they must be spilled as well (and restored) • Context switch is now really slow, as you need to save an restore ALL 256 registers • Still an interesting idea
How is it really done? • You can spend a lot of time worrying about things that hardly ever happen (expressions with 32 temporary values in the middle, functions with 17 arguments) • And programming styles change (OOP has smaller functions, but fewer arguments) • So there is never a clear answer, and it keeps changing