Dynamo vs adore a tale of two dynamic optimizers
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on
  • Presentation posted in: General

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers. Wei Chung Hsu 徐慰中 Computer Science Department 交通大學 (work was done in University of Minnesota, Twin Cities ) 3/05/2010. Dynamo. Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited 612 times

Download Presentation

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Dynamo vs adore a tale of two dynamic optimizers

DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers

Wei Chung Hsu

徐慰中

Computer Science Department

交通大學

(work was done in University of Minnesota, Twin Cities)

3/05/2010


Dynamo

Dynamo

  • Dynamo is a dynamic optimizer

  • It won the best paper award in PLDI’2000, cited 612times

  • Work started by the HP lab and the HP system lab.

  • MIT took over and ported it to x86, called it DynamoRIO. This group later started a new company, called Determina (now acquired by VMware)

  • Considered revolutionary since optimizations were always performed statically (i.e. at compile time)


Spec cint2006 for opteron x4

SPEC CINT2006 for Opteron X4

Time=CPI x Inst x Clock period

Very high cache miss rate rates

Ideal CPI should be 0.33


Where have all the cycles gone

Where have all the cycles gone?

  • Cache misses

    • Capacity, Compulsory/Cold, Conflict, Coherence

    • I-cache and D-cache

    • TLB misses

  • Branch mis-predictions

    • Static and dynamic prediction

    • Mis-speculation

  • Pipeline stalls

    • Ineffective code scheduling

      often caused by memory aliasing

Unpredictable

Hard to deal with

at compile time


Trend of multi cores

Trend of Multi-cores

Intel Core i7 die photo

Exploiting these potentials demands thread-level parallelism


Exploiting thread level parallelism

Exploiting Thread-Level Parallelism

Sequential

Traditional Parallelization

Load *q

Store *p

Store *p

Time

Time

Load *q

p != q ??

Compiler gives up

Thread-LevelSpeculation (TLS)

p != q

p == q

dependence

Load 20

Load 88

Time

Store 88

Store 88

Speculation Failure

Load 88

Parallel execution

But Unpredictable

Potentially more parallelism with speculation


Dynamic optimizers

Dynamic Optimizers

Dynamic optimizers

Dynamic Binary

Optimizers (DBO)

Java VM (JVM)

with JIT compiler

(dynamic compilation

or adaptive optimization)

Native-to-native

dynamic binary

optimizers

(x86 x86,

x86-32  x86-64

IA64 IA64)

Non-native

dynamic binary

translators

(e.g. x86  IA64,

ARM  MIPS,

PPC  x86, QEMU

Vmware, Rosetta)


More on why dynamic binary optimization

More on why dynamic binary optimization

New architecture/micro-architecture features offer more opportunity for performance, but are not effectively exploited by legacy binary.

x86 P5/P6/PII/PIII, x86-32/x86-64, PA 7200/8000, …

Software evolution and ISV behaviors reduce effectiveness of traditional static optimizer

DLL, middleware, binary distribution, …

Profile sensitive optimizations would be more effective if performed at runtime

predication, speculation, branch prediction, prefetching

Multi-core environment with dynamic resource sharing makes static optimization challenging

shared cache, off-chip bandwidth, shared FU’s


How dynamo works

How Dynamo Works

Interpret until

taken branch

Lookup branch

target

Start of trace

condition?

Jump to code

cache

Increment counter

for branch target

Counter exceed

threshold?

Interpret +

code gen

Signal

handler

Code Cache

End-of-trace

condition?

Emit into

cache

Create trace

& optimize it

Dynamo is VM based


Trace selection

Trace Selection

trace

selection

A

A

trace

layout in

trace

cache

C

B

C

D

F

D

call

G

F

E

I

G

H

E

I

back to

runtime

to B

return

to H


Backpatching

Backpatching

A

When H becomes hot,

a new trace is selected

starting from H, and the

trace exit branch in block

F is backpatched to branch to the new trace.

C

D

H

F

I

G

E

I

E

back to

runtime

to B

to H


Execution migrates to code cache

Execution Migrates to Code Cache

interpreter/

emulator

1

0

2

trace

selector

3

1

a.out

4

2

optimizer

3

Code cache


Trace based optimizations

Trace Based Optimizations

  • Full and partial redundancy elimination

  • Dead code elimination

  • Trace scheduling

  • Instruction cache locality improvement

  • Dynamic procedure inlining (or procedure outlining)

  • Some loop based optimizations


Summary of dynamo

Summary of Dynamo

  • Dynamic Binary Optimization customizes performance delivery:

    • Code is optimized by how the code is used

      • Dynamic trace formation and trace-based optimizations

    • Code is optimized for the machine it runs on

    • Code is optimized when all executables are available

    • Code should be optimized only the parts that really matters


Adore

ADORE

  • ADORE means ADaptive Object code RE-optimization

  • Was developed at the CSE department, U. of Minnesota, Twin Cities

  • Applied a very different model for dynamic optimization systems

  • Considered evolutionary, cited by 61


Dynamic binary optimizer s models

Dynamic Binary Optimizer’s Models

Application Binaries

Application Binaries

DBO

DBO

Operating System

Operating System

Hardware Platform

Hardware Platform

  • Translate only hot execution paths and keep in code cache

  • Lower overhead

  • ADORE (IA64, SPARC)

  • COBRA (IA64, x86 – ongoing)

  • Translate mostexecution paths and keep in code cache

  • Easy to maintain control

  • Dynamo (PA-RISC)

  • DynamoRIO (x86)


Adore framework

ADORE Framework

Patch traces

Code Cache

Deployment

Init Code $

Optimized Traces

Main Thread

Dynamic

Optimization

Thread

Optimization

Pass traces to opt

Trace Selection

On phase change

Phase Detection

Int on K-buffer ovf

Kernel

Init PMU

Int. on Event

Hardware Performance Monitoring Unit (PMU)


Thread level view

Thread Level View

Thread 1

Thread 2

Init ADORE

User buffer full is maintained for 1 main event. This event is usually CPU_CYCLES

sleep

User buffer full

Application

ADORE invoked

sleep

User buffer full

K-buffer overflow handler

ADORE invoked


Perf of adore itanium on spec2000

Perf. of ADORE/Itanium on SPEC2000


Performance on blast

Performance on BLAST


Adore vs dynamo

ADORE vs. Dynamo


Adore on multi cores

ADORE on Multi-cores

  • COBRA (Continuous Object code Re-Adaptation) framework is a follow up project, implemented on Itanium Montecito and x86’s new multi-core machines.

  • ADORE on SPARC Panther (Ultra Sparc IV+) multi-core machines.

  • ADORE for TLS tuning


Cobra framework

COBRA Framework

Optimization Thread

Centralized Control

Initialization

Trace Selection

Trace Optimization

Trace Patching

Monitor Threads

Localized Control

Per-thread Profile

23


Startup of 4 thread openmp program

Startup of 4 thread OpenMP Program

24


Prefetch vs noprefetch

Prefetch vs. NoPrefetch

The prefetch version when running with 4 threads suffers significantly from L2_OZQ_FULL stalls.

26%

34%

25


Prefetch vs prefetch with excl

Prefetch vs. Prefetch with .excl

.excl hint: prefetch a cache line in exclusive state instead of shared state. (Invalidation based cache coherence protocol)

15%

12%

26


Execution time on 4 way smp

Execution time on 4-way SMP

  • noprefetch: up to 15%, average 4.7% speedup

  • prefetch.excl: up to 8%, average 2.7% speedup

27


Execution time on cc numa

Execution time on cc-NUMA

  • noprefetch: up to 68%, average 17.5% speedup

  • prefetch.excl: up to 18%, average 8.5% speedup

28


Summary of results from cobra

Summary of Results from COBRA

We showed that coherent misses caused by aggressive prefetching could limit the scalability of multithreaded program on scalable shared memory multiprocessors.

With the guide of runtime profile, we experimented two optimizations.

Reducing aggressiveness of prefetching

Up to 15%, average 4.7% speedup on 4-way SMP

Up to 68%, average 17.5% speedup on SGI Altix cc-NUMA

Using exclusive hint for prefetch

Up to 8%, average 2.7% speedup on 4-way SMP

Up to 18%, average 8.5% speedup on SGI Altix cc-NUMA

29


Adore sparc

ADORE/SPARC

ADORE has been ported to Sparc/Solaris platform since 2005.

Some porting issues:

ADORE uses the libcpc interface on Solaris to conduct runtime profiling. A kernel buffer enhancement is added to Solaris 10.0 to reduce profiling and phase detection overhead

Reachability is a true problem. (e.g. Oracle, Dyna3D)

Lack of branch trace buffer is painful. (e.g. Blast)


Performance of in thread opt usiii

Performance of In-Thread Opt. (USIII+)


Helper thread prefetching for multi core

time

Helper Thread Prefetching for Multi-Core

First Core

L2

Cache

Miss

Main thread

Cache miss avoided

Trigger to activate (About 65 cycles delay)

Second core

Prefetches initiated

Spin Waiting

Spin again waiting for the next trigger


Performance of dynamic helper thread on sun ultrasparc iv

Performance of Dynamic Helper Thread(on Sun UltraSparc IV+)


Evaluation environment for tls

C

C

C

C

C

Evaluation Environment for TLS

Benchmarks

  • SPEC2000 written in C, -O3 optimization

    Underlying architecture

  • 4-core, chip-multiprocessor (CMP)

  • speculation supported by coherence

    Simulator

  • Superscalar with detailed memory model

  • simulates communication latency

  • models bandwidth and contention

P

P

P

P

Interconnect

Detailed, cycle-accurate simulation


Dynamic tuning for tls

Dynamic Tuning for TLS

1.37x

1.23x

1.17x

Parallel Code Overhead


Summary of adore

Summary of ADORE

  • ADORE uses Hardware Performance Monitoring (HPM) capability to implement a light weight runtime profiling system. Efficient profiling and phase detection is the key to the success of dynamic native binary optimizers.

  • ADORE can speed up real-world large applications optimized by production compilers.

  • ADORE works on two architectures: Itanium and SPARC. COBRA is a follow-up system of ADORE. It works on Itanium and x86.

  • ADORE/COBRA can also optimize for multi-cores.

  • ADORE has recently been applied to dynamic TLS tuning.


Conclusio n

Conclusion

“It was the best of times,

it was the worst of times…”

-- opening line of “A Tale of Two Cities”

best of times for research:

new areas where innovations are needed

worst of times for research:

saturated area where technologies are matured or well-understood, hard to innovate, …


  • Login