1 / 31

# Dynamic Binary Optimization - PowerPoint PPT Presentation

Dynamic Binary Optimization. Presenter Kim Jin Chul. Contents. 1. Overview of Applying Optimization on VMs. 2. Dynamic Program Behavior. 3. Profiling. 4. Optimizing Translation Blocks. addi r16, r4, 4 ; add 4 to %eax lwzx r17, r2, r16 ; load operand from memory

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Dynamic Binary Optimization ' - nydia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Dynamic Binary Optimization

Presenter

Kim Jin Chul

1

Overview of Applying Optimization on VMs

2

Dynamic Program Behavior

3

Profiling

4

Optimizing Translation Blocks

lwzx r17, r2, r16 ; load operand from memory

add r7, r17, r7 ; perform add of %edx

addi r16, r4, 4 ; add 4 to %eax

stwx r7, r2, r16 ; store %edx value into memory

Classical Optimizations

movl 4(%eax), %edx

Translation from IA-32 to PowerPC code.

Adopt a Common Subexpression Elimination

addi r16, r4, 4 ; add 4 to %eax

lwzx r17, r2, r16 ; load operand from memory

add r7, r17, r7 ; perform add of %edx

stwx r7, r2, r16 ; store %edx value into memory

Basic Block A ...

...

R3 ← ...

R7 ← ...

R1 ← R2 + R3

Br L1 if R3 == 0

Basic Block A ...

...

R3 ← ...

R7 ← ...

Br L1 if R3 == 0

Basic Block A ...

...

R3 ← ...

R7 ← ...

Br L1 if R3 == 0

Compensation code

R1 ← R2 + R3

Basic Block B

...

R6 ← R1 + R6

...

...

Basic Block B

...

R6 ← R1 + R6

...

...

Basic Block B

...

R6 ← R1 + R6

...

...

use

Basic Block C

L1: R1 ← 0

...

...

Basic Block C

L1: R1 ← 0

...

...

Basic Block C

L1: R1 ← 0

...

...

def

R1 ← R2 + R3

Basic Block B

L2:...

R6 ← R1 + R6

...

...

Optimization Based on Profiling

Basic Block A ...

...

R3 ← ...

R7 ← ...

R1 ← R2 + R3

Br L1 if R3 == 0

Superblock ...

...

R3 ← ...

R7 ← ...

Br L2 if R3 != 0

R1 ← 0

...

...

Basic Block B

...

R6 ← R1 + R6

...

...

Basic Block C

L1: R1 ← 0

...

...

Stages: Interpret Basic translation Optmized block Highly optimized blocks

Fast startup Very slow startup

Simple profiling Extensive profiling

A staged optimization system

Interpreter

Binary memory

image

Basic block

cache

Code cache

Profile data

Optimizer

Translator

Emulation

manager

Dynamic Program Behavior block Highly optimized blocks

• Dynamic control flow is highly predictable

.

.

R3 ← 100

loop: R1 ← mem(R2)

Br found if R1 == –1

R2 ← R2 + 4

R3 ← R3 – 1

Br loop if R3 != 0

.

.

found: .

.

.

50% block Highly optimized blocks

40%

30%

20%

10%

0%

0-10%

10-20%

20-30%

30-40%

40-50%

50-60%

60-70%

70-80%

80-90%

>90%

Dynamic Program Behavior

• Distribution of taken conditional branches

Fraction of static conditional branches

Percent taken

Predominantly not taken : 28%

Predominantly taken : 42%

Back...

100% block Highly optimized blocks

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

176.gcc

181.mcf

197.parser

252.eon

256.bzip2

171.swim

173.applu

177.mesa

187.facerec

189.lucas

Dynamic Program Behavior

• Consistency of conditional branches

• The high percentage consists of backward branches

Dynamic branches decided same as previous time

Benchmark

SPEC

25% block Highly optimized blocks

20%

15%

10%

5%

0%

1

2

3

4

5

6

7

8

9

>9

Percent of indirect jumps

Number of different destinations

Dynamic Program Behavior

• The predictability of indirect jumps

• Some jump destination addresses seldom change

0.7 block Highly optimized blocks

0.6

0.5

0.4

Fraction with constant value

0.3

0.2

0.1

0

All

Logic

Shift

Set

Instruction type

Dynamic Program Behavior

• The predictability of data value

Static instructions always compute the same value

Static

Dynamic instructions execute the static instructions

Dynamic

Profiling block Highly optimized blocks

• The process of collecting instruction and data statistics for an executing program

• Optimization based on profiling work

Interpreter

Binary memory

image

Basic block

cache

Code cache

Profile data

Optimizer

Translator

Emulation

manager

Back...

A block Highly optimized blocks

B

C

D

E

F

The Role of Profiling

HLL

Program

Compiler

Frontend

Compiler

Backend

Instrumented

Code

Instrumented

Code

Program

Execution

Program

Statistics

Optimizing

Compiler

Optimized

Binary

Test Data

A block Highly optimized blocks

B

D

E

The Role of Profiling

• On-the-fly profiling in a dynamic optimizing VM

Partial

Program

Statistics

Translator/

Optimizer

Program

Binary

Interpreter

Program

Data

Types of Profiles block Highly optimized blocks

• Several types of profile data

• How frequently different code regions are being executed?

• It can be used to decide the level of optimization

• Is control flow predictability?

• It may be used as the basis for gathering and rearranging basic blocks

• Rearranged basic blocks get a chance to be merged superblock

A block Highly optimized blocks

A

65

50

15

B

C

B

C

50

15

50

12

13

17

48

D

D

38

25

10

2

E

E

15

48

F

F

17

Types of Profiles

A basic block profile

A edge profile

Collecting Profiles block Highly optimized blocks

• Instrumentation-based profiling

• Specific program-related events and counts all instances of the events being profiled

• Software-based Vs Hardware-based

• Speed? Support? Flexibility?

• Sampling-based profiling

• Program runs in its unmodified form, the program is interrupted and event is captured

• Instrumentation Vs Sampling

• Overhead : Instrumentation < Sampling

• Sampling causes traps!

Branch PC block Highly optimized blocks

HASH

Takencount

Not-takencount

PC

Profiling During Interpretation

Instruction function list..branch_conditional(inst) { BO = extract(inst, 25, 5);

BI = extract(inst, 20, 5);

displacement = extract(inst, 15, 14) * 4;

.

.

// code to compute whether branch should be taken

.

.

if (branch_taken)

PC = PC + displacement;

Else

PC = PC + 4;

}

Profile Table for Collecting an Edge Profile During Interpretation

PowerPC Branch Conditional Interpreter Routine

Profiling Translated Code block Highly optimized blocks

increment edge counter (i)if (counter (i) > trigger) then invoke optimizerelse branch to fall-through basic block

increment edge counter (j)if (counter (j) > trigger) then invoke optimizerelse branch to target basic block

Edge Profiling Code Inserted into Stubs of a Binary Translated Basic Block

Emulation Stages

Profiling Overhead block Highly optimized blocks

• For profiling during interpretation, occurring 10-20% overhead

• Profiling overheads can be reduced

• To reduce the number of instrumentation points by selecting a smaller set of key points

Optimizing Translation Blocks block Highly optimized blocks

• Two-part strategy for optimzing

• Using dominant control flow for enhancing memory locality

• Making a translation blocks larger

• Traces, Superblocks, Tree groups

• Two parts of the strategy are actually relatively independent

Improving Locality block Highly optimized blocks

• Two kinds of memory localities

• Spatial locality

• Access to a memory location is soon followed by a memory access to an adjacent memory location

• Temporal locality

• Access to a memory location is accessed again in the near future

3 block Highly optimized blocks

A

30

70

D

B

1

29

68

2

E

F

C

29

68

1

G

97

1

Improving Locality

• Example code sequence

A

Br cond1 == true

B

Br cond2 == false

C

Br uncond

D

Br cond3 == true

E

Br uncond

F

G

Br cond4 == true

3 block Highly optimized blocks

A

30

70

D

B

1

29

68

2

B

E

F

C

29

68

1

G

97

1

Improving Locality

• Rearrange the blocks in memory

A

Br cond1 == false

D

Br cond3 == true

E

G

Br cond4 == true

Br uncond

Br cond2 == false

C

Br uncond

F

Br uncond

Improving Locality block Highly optimized blocks

A

• Procedure Inlining

• Positive & NegativeEffect?

A

X

X

Y

A

Y

Z

Call proc xyz

Proc xyz

B

B

X

B

...

...

...

Y

K

K

Z

K

X

X

Return

Call proc xyz

L

Z

Y

L

Z

L

3 block Highly optimized blocks

A

Trace 1

Trace 2

30

70

Traces

D

B

Superblocks

Trace 3

1

29

68

2

E

F

C

29

68

1

Relations between Superblocks and Traces

G

97

1

Traces

• Trace

• A contiguous sequence

• Both side entrances and side exits

3 block Highly optimized blocks

A

A

30

70

D

D

B

B

1

29

68

2

E

E

F

C

F

C

29

68

1

G

G

G

G

97

1

Superblocks

• Superblocks

• Regions of code with only one entry and one or more exit points

B block Highly optimized blocks

B

Superblocks

A

A

Br cond1 == false

Br cond1 == false

D

D

Br cond3 == true

Br cond3 == true

E

E

G

G

Br cond4 == true

Br cond4 == true

Br uncond

Br uncond

Br cond2 == false

Br cond2 == false

C

C

G

Br uncond

Br cond4 == true

Br uncond

F

F

G

Br cond4 == true

Br uncond

Br uncond

A block Highly optimized blocks

D

B

E

F

C

G

G

G

Tree Groups

• Tree groups

• Regions of code with only one entry and one or more exit points

Figure 4.7

Thank You ! block Highly optimized blocks

SPEC benchmarks block Highly optimized blocks

• Integer SPEC benchmark

• 176.gcc – GNU Compiler

• 181.mcf – Combinatorial Optimization

• 197.parset – Word Processor

• 252.eon – Computer Visualization

• 256.bzip2 – Compression

• Floating-Point SPEC benchmark

• 171.swim – Shallow Water Modeling

• 173.applu – Parabolic

• 187.facerec – Imageprocessing

• 189.lucas – Number Theory

Back...