Dynamic Binary Optimization

Dynamic Binary Optimization Kim Sung-moo

Contents • Optimization Framework • Code Reordering • Code Optimizations • Same-ISA Optimization Systems

Optimization Framework Optimizedtarget code Intermediateform OriginalSource code Opt. A B C A B C A A B B C Comp C Comp Collect basic blocks using profile information Convert to intermediate form; place in buffer Schedule and optimize Generate target code Add compensation code; place in code cache

Optimization of Trace & Superblock Trace Superblock Detailed in Section 4.5.1 Optimize Optimize Compensationcode Compensationcode Compensationcode Compensationcode Compensationcode Compensationcode

Caution during Optimization • The program’s execution after optimization can’t be the same as before. (detailed in Section 4.6.2) Source r4 ← r6 + 1r1← r2 + r3r1 ← r4 + r5r6 ← r1 * r7 Target r4 ← r6 + 1r1 ← r4 + r5r6 ← r1 * r7 Remove dead assignment(no trap) trap optimize Target withsaved reg.state r1 ← r2 + r3s1 ← r1 * r7r9 ← r1 + r5r6 ← s1r3 ← r6 + 1 Source r1 ← r2 + r3r9 ← r1 + r5r6 ← r1 * r7r3 ←r6+ 1 Target r1 ← r2 + r3r6 ← r1 * r7r9 ← r1 + r5r3 ← r6 + 1 trap reschedule solution trap

Consistent Register Mapping Superblock A Superblock B • When one superblock branches to another or jumps to interpreter • The source-to-target register mapping must be correctly managed. • Solution : Section 4.6.3 R1 ↔ r1 R2 ↔ r1 R3 ↔ r1 interpreter Superblock C

Code Reordering • Basis of all optimizations • Easy to understand • Deep relation to pipelining

Primitive Instruction Reordering • reg : instruction that produce a register result • load, ALU, shift • easily undone • mem : instruction that place a value in memory • store • not easily undone • br : branch instruction • join : the point where branch target enters the code sequence

Move instr. around branch reg mem br br br br reg reg(compensation) mem mem(compensation) R1 ← mem(R6) R2 ← mem(R6+4) R3 ← R1 + 1 R4 ← R1 << 2 br exit if R7 == 0 R7 ← R7 + 1 mem(R6) ← R3 R1 ← mem(R6) R2 ← mem(R6+4) R3 ← R1 + 1 br exit if R7 == 0 R4 ← R1 << 2 R7 ← R7 + 1 mem(R6) ← R3 R4 ← R1 << 2

Move instr. from below to above branch reg(T) Previous memory value is unrecoverable. br br br R ← T reg(R) mem R2 ← R1 << 2 br exit if R8 == 0 R6 ← R7 * R2 mem(R6) ← R3 R6 ← R2 + 2 R2 ← R1 << 2 T1 ← R7 * R2 br exit if R8 == 0 R6 ← T1 mem(T1) ← R3 R6 ← R2 + 2 R2 ← R1 << 2 T1 ← R7 * R2 br exit if R8 == 0 mem(T1) ← R3 R6 ← R2 + 2

Move instr. above join point reg(compensation) mem (compensation) reg mem join point join point join point join point reg mem R1 ← R1 + 1 R7 ← mem(R6) R7 ← R7 + 1 R1 ← R1 + 1 R7 ← mem(R6) R7 ← R7 + 1 R7 ← mem(R6)

Move instr. in straight-line reg(R) reg(R) reg reg mem mem reg(R) R ← T reg(R) R ← T R1 ← R1 * 3 mem(R6) ← R1 R7 ← R7 << 3 R9 ← R7 + R2 R1 ← R1 * 3 T1 ← R7 << 3 mem(R6) ← R1 R7 ← T1 R9 ← R7 + R2

Implementing a Scheduling Algorithm • Translate to Single-Assignment Form • Form Register Map • Reorder Code • Determine Checkpoints • Assign Register • Add Compensation Code

1. Translate to Single-Assignment Form • Single-Assignment Form → A register is assigned a new value only once. Original Source Code add %eax, %ebx bz L1 mov %ebx, 4(%eax) mul %ebx, 10 add %ebx, 1 add %ecx, 1 bz L2 add %ebx, %eax br L3 Single-Assignment Form t5 ← r1 + r2, set CR0 bz CR0, L1 t6 ← mem(t5 + 4) t7 ← t6 * 10 t8 ← t7 + 1 t9 ← r3 + 1, set CR0 bz CR0, L2 t10 ← t8 + t5 b L3

2. Form Register Map • Register map (RMAP) → enable to track the values as assigned in the original source code. Single-Assignment Form t5 ← r1 + r2, set CR0 bz CR0, L1 t6 ← mem(t5 + 4) t7 ← t6 * 10 t8 ← t7 + 1 t9 ← r3 + 1, set CR0 bz CR0, L2 t10 ← t8 + t5 b L3 Register Map(RMAP)eax ebx ecx edx t5 r2 r3 r4 t5 r2 r3 r4 t5 t6 r3 r4 t5 t7 r3 r4 t5 t8 r3 r4 t5 t8 t9 r4 t5 t8 t9 r4 t5 t10 t9 r4 t5 t10 t9 r4

3. Reorder Code • To run code efficiently. • ex) reduce stalling. (in pipeline) Before Scheduling a: t5 ← r1 + r2, set CR0 b: bz CR0, L1 c: t6 ← mem(t5 + 4) d: t7 ← t6 * 10 e: t8 ← t7 + 1 f: t9 ← r3 + 1, set CR0 g: bz CR0, L2 h: t10 ← t8 + t5 i: b L3 After Scheduling a: t5 ← r1 + r2, set CR0 c: t6 ← mem(t5 + 4) b: bz CR0, L1 d: t7 ← t6 * 10 f: t9 ← r3 + 1, set CR0 g: bz CR0, L2 e: t8 ← t7 + 1 h: t10 ← t8 + t5 i: b L3 Register Map(RMAP)eax ebx ecx edx t5 r2 r3 r4 t5 t6 r3 r4 t5 r2 r3 r4 t5 t7 r3 r4 t5 t8 t9 r4 t5 t8 t9 r4 t5 t8 r3 r4 t5 t10 t9 r4 t5 t10 t9 r4 L2: t8 ← t7 + 1

4. Determine Checkpoints • Commit : all preceding instr. in original code are completed. • Checkpoint : committed closest instr. • If it traps, checkpoint is backup point. → precise state recovery. After Scheduling a: t5 ← r1 + r2, set CR0 c: t6 ← mem(t5 + 4) b: bz CR0, L1 d: t7 ← t6 * 10 f: t9 ← r3 + 1, set CR0 g: bz CR0, L2 e: t8 ← t7 + 1 h: t10 ← t8 + t5 i: b L3 Register Map(RMAP)eax ebx ecx edx t5 r2 r3 r4 t5 t6 r3 r4 t5 r2 r3 r4 t5 t7 r3 r4 t5 t8 t9 r4 t5 t8 t9 r4 t5 t8 r3 r4 t5 t10 t9 r4 t5 t10 t9 r4 Commit Checkpoint a @ a b,c a d c d d e,f,g d h g i h

5. Assign Register • “X” : where live range have been extended. • branch or trap Register Live Ranges After Assignment a: r1 ← r1 + r2, set CR0 c: r5 ← mem(r1 + 4) b: bz CR0, L1 d: r2 ← r5 * 10 f: r5 ← r3 + 1, set CR0 g: bz CR0, L2 e: r2 ← r2 + 1 h: r2 ← r2 + r1 i: b L3 Register Map(RMAP)eax ebx ecx edx r1 r2 r3 r4 r1 r5 r3 r4 r1 r2 r3 r4 r1 r2 r3 r4 r1 r2 r5 r4 r1 r2 r5 r4 r1 r2 r3 r4 r1 r2 r5 r4 r1 r2 r5 r4

6. Add Compensation Code • Consider compensation code and consistent register mapping. After Assignment a: r1 ← r1 + r2, set CR0 c: r5 ← mem(r1 + 4) b: bz CR0, L1 d: r2 ← r5 * 10 f: r5 ← r3 + 1, set CR0 g: bz CR0, L2 e: r2 ← r2 + 1 h: r2 ← r2 + r1 i: b L3 Compensation Code Added a: r1 ← r1 + r2, set CR0 c: r5 ← mem(r1 + 4) b: bz CR0, L1 d: r2 ← r5 * 10 f: r5 ← r3 + 1, set CR0 g: bz CR0, L2’ e: r2 ← r2 + 1 h: r2 ← r2 + r1 i: b L3 r3 ← r5 PowerPC Code a: add. r1, r1, r2 c: lwz r5, 4(r1) b: beq CR0, L1 d: muli r2, r5, 10 f: addic. r5, r3, 1 g: beq CR0, L2’ e: addi r2, r2, 1 h: add r2, r2, r1 i: b L3 mr r3, r5 L2’: r3 ← r5 r2 ← r2 + 1 L2’: mr r3, r5 addi r2, r2, 1

5a. Assign Register with Condition Codes • “y” : where live range must be extended. • Condition code must be materialized. Register Live Ranges After Assignment a: r6 ← r1 + r2, set CR0 c: r5 ← mem(r6 + 4) b: bz CR0, L1 d: r2 ← r5 * 10 f: r5 ← r3 + 1, set CR0 g: bz CR0, L2 e: r2 ← r2 + 1 h: r2 ← r2 + r6 i: b L3 r1 ← r6 Register Map(RMAP)eax ebx ecx edx r6 r2 r3 r4 r6 r5 r3 r4 r6 r2 r3 r4 r6 r2 r3 r4 r6 r2 r5 r4 r6 r2 r5 r4 r6 r2 r3 r4 r6 r2 r5 r4 r6 r2 r5 r4 r1 r2 r5 r4

Superblocks vs Traces

Basic Optimizations • Constant Propagation & Constant Folding • Strength Reduction • Code Sinking • Dead-Assignment Elimination • Copy Propagation • Common-Subexpression Elimination (CSE) • Hoisting a loop invariant expression out of a loop (Loop Invariant Code Motion)

Basic Optimizations (1) R1 ← 6 R5 ← R1 + 2 R6 ← R7 * R5 R1 ← 6 R5 ← 6 + 2 R6 ← R7 * R5 Constant Propagation Constant Folding R1 ← 6 R5 ← 8 R6 ← R7 * R5 R1 ← 6 R5 ← 8 R6 ← R7 << 3 R1 ← 6 R5 ← 8 R6 ← R7 * 8 Strength Reduction Constant Propagation

Basic Optimizations (2) R1← 28 R1← 6 R5 ← R1 + 2 R6 ← R7 * R5 Join point inhibits code optimization.

Basic Optimizations (3) R1 ← 1 R3 ← R3 + R2 Br L1 if R7 != 0 R3 ← R7 + 1 partially dead (code sinking) L1: R3 ← R3 + 1 R1 ← 1 Br L1 if R7 != 0 R3 ← R3 + R2 R3 ← R7 + 1 fully dead (can be removed) L1: R3 ← R3 + R2 R3 ← R3 + 1 R1 ← 1 Br L1 if R7 != 0 R3 ← R7 + 1 L1: R3 ← R3 + R2 R3 ← R3 + 1

Basic Optimizations (4) R1 ← R2 + R3 R4 ← R1 R5 ← R5 * R4 ∙ ∙ ∙ R4 ← R7 + R8 R1 ← R2 + R3 R4 ← R1 R5 ← R5 * R1 ∙ ∙ ∙ R4 ← R7 + R8 R1 ← R2 + R3 R5 ← R5 * R1 ∙ ∙ ∙ R4 ← R7 + R8 Copy Propagation Dead-Assignment Elimination (We can do this because there is not R4 at RHS.)

Basic Optimizations (5) R1 ← R2 + R3 R5 ← R2 R6 ← R5 + R3 R1 ←R2 + R3 R5 ← R2 R6 ← R2 + R3 Copy Propagation R1← R2 + R3 R5 ← R2 R6 ← R1 Common-Subexpression Elimination (CSE)

Basic Optimizations (6) L1 : R1 ← R2 + R3 mem(R4) ← R1 R4 ← R4 + 4 ∙ ∙ ∙ br L1 if R7 != 0 R1 ← R2 + R3 L1 : mem(R4) ← R1 R4 ← R4 + 4 ∙ ∙ ∙ br L1 if R7 != 0 Hoisting a loop invariant expression out of a loop(Loop Invariant Code Motion)

Compatibility Issues • It is very important whether an optimization is safe or unsafe. • Usually safe optimizations • don’t remove trapping instructions • ex) copy-propagation, constant-propagation, constant-folding • More care required optimizations • remove trapping instructions • ex) dead-assignment elimination, loop invariant code motion, strength reduction

Inter-superblock Optimizations • We want additional optimizations between superblocks. • Solutions • Use tree group : Section 4.3.5 • Remove some of register copies at exit points • Use epilog & prolog side table Superblock 1 r2 ← r7 ∙ br L1 if r4==0 ∙ r2 ← r1 + 2 Need to be eliminated.(there is not r2 at RHS after here.) Superblock 2 L1: r2 ← r3 + 2 ∙ ∙

Epilog & Prolog Side Table Epilog side table r1 r2 r3 …….. rn 0 1 1 …….. 0 Superblock 1 0 1 1 ….. 0AND0 1 0 ….. 1-----------------------0 1 0 ….. 0 Prolog side table r1 r2 r3 …….. rn 0 1 0 …….. 1 Superblock 2 Register r2 is dead along path from superblock 1 to 2. Instruction r2←r7 in superblock 1 can be removed.

Epilog & Prolog Side Table • Epilog side table • when a superblock is exited, • keeps a mask indicating the dead registers. • Prolog side table • when a superblock is entered, • keeps a mask indicating the written registers before being read. (first encounter at LHS) • When 2 superblocks are linked, AND the bit masks. • Any bits that remain set → dead register.

Instruction-Set-Specific Optimizations • Why do? : Because each instruction set has its own features. • Example 1. When ISA using alignment accesses unaligned data. • Invoke trap → trap handler → use multiple instructions : extremely slow • Use inlined multi-instruction sequence.

If-conversion (1) • Example 2. if-conversion. • Instruction set can be enhanced by adding new instruction. • If conditional move instruction (cmovgt) is added, hammock region can be removed. if (r4 > 0) If (r4 > 0) then r5 = r5 + 1 else r5 = r5 - 1; r6 = r6 + r5 hammock then r5 = r5 + 1 else r5 = r5 - 1 region r6 = r6 + r5

If-conversion (2) • Assembly code with branch cmpi cr0, r4, 0 ;compare r4 with zero bgt cr0, skip ;branch to skip if r4>0 addi r5, r5, 1 ;add 1 to r5 b next ;branch to next skip : subi r5, r5, 1 ;sub 1 from r5 next : add r6, r6, r5 ;accumulate r5 values in r6 • Assembly code after if-conversion cmpi cr0, r4, 0 ;compare r4 with zero addi r30, r5, 1 ;add 1 to r5 subi r5, r5, 1 ;sub 1 from r5 cmovgt r5, r30, cr0 ;conditional move r30 to r5 if r4>0 add r6, r6, r5 ;accumulate r5 values in r6 • hammock region is removed.

Same-ISA Optimization Systems • Easy to perform fast initial emulation of source binary • Dynamic optimization is not a necessity. • Sample-based profiling is more attractive. • No instruction semantic mismatch problems.

Optimization Using Basic Block Cache Source Binary Basic Block Cache Superblock Cache A W A stub Map Table B B link C stub X D E stub E Y indirect jump stub stub

Code Patching Source Binary Superblock Cache W A B patch link C X D E patch Y patch indirect jump

Dynamic Binary Optimization

Dynamic Binary Optimization

Presentation Transcript

A Novel Binary Particle Swarm Optimization

Dynamic Batch Bayesian Optimization

Dynamic Binary Optimization: The Dynamo Case

Dynamic Binary Translation

Fast Dynamic Binary Translation for the Kernel

Dynamic Compilation and Optimization

Binary Particle Swarm Optimization (PSO)

Optimization in Dynamic Environments

Potential of Dynamic Binary Parallelization

Dynamic Binary Optimization

Background Optimization in Full System Binary Translation

Dynamic Binary Optimization – Part 1

Dynamic Set ADT Binary Trees

Dynamic Optimization

Dynamic Optimization and Automatic Differentiation

Dynamic Query Optimization

Debunking Dynamic Optimization Myths

Dynamic Route Optimization

Dynamic Binary Translators and Instrumenters

Dynamic Optimization and Automatic Differentiation

Planning using dynamic optimization