Day 4 symbiotic optimization
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

Day 4: Symbiotic Optimization PowerPoint PPT Presentation


  • 40 Views
  • Uploaded on
  • Presentation posted in: General

Day 4: Symbiotic Optimization. Kim Hazelwood ACACES Summer School July 2009. Modern Computing Challenges. Performance Power & temperature Reliability Parallelism (multicore) Heterogeneity Limited resources (embedded computing). SW. HW. Typical Approaches.

Download Presentation

Day 4: Symbiotic Optimization

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Day 4 symbiotic optimization

Day 4: Symbiotic Optimization

Kim Hazelwood

ACACES Summer School

July 2009


Modern computing challenges

Modern Computing Challenges

  • Performance

  • Power & temperature

  • Reliability

  • Parallelism (multicore)

  • Heterogeneity

  • Limited resources (embedded computing)


Typical approaches

SW

HW

Typical Approaches

  • Optimize using SW or HW techniques in isolation

  • Performance

    • SW: Compile-time optimizations, OS scheduling

    • HW: Architectural improvements, VLSI technology

  • Reliability: Code/data duplication (HW or SW)

  • Power & Temperature

    • HW control mechanisms

    • Profile + recompile cycle


Modern design constraints

Modern Design Constraints

  • Compilers – “Compile once, run anywhere”

    • Cannot ship “MS Office for 3Q08 batch of Core2 3GHz, > 8GB RAM, BrandX power supply, located in high altitudes…”

  • OSes – Scheduling algorithms don’t scale to modern architectures

  • Microarchitecture – Limited window of application knowledge

    • Past must predict the future

  • Circuits – Guaranteed correctness, reliability,

    • Must design for the worst case


Tortola symbiotic optimization

Tortola: Symbiotic Optimization

  • Enable HW/SW Communication

  • What small changes can we make to HW/OS to enable collaborative solutions?

SW Applications

runtime traits

Binary Modifier

hardware feedback;

os load

scheduling hints

OS/HW


The power of virtualization

x86

Initially

x86

SWI

Eventually

HWI

The Power of Virtualization

  • No longer restricted to a fixed ISA

  • Reduce hardware complexity

    • No more backwards compatibility warts

    • Fix bugs after shipment

    • Reduce time to market for new architectures

SW Applications

Binary Modifier

HW


Tortola applications

Tortola Applications

  • Combine global program information with run-time feedback

    • System-specific power usage

    • Application-specific heat anomalies

    • Workload/input specific performance optimization

  • Two case studies

    • The di/dt problem – solved using hardware feedback and dynamic optimization

    • Heterogeneous multicore scheduling – solved using hardware feedback, OS feedback, and OS hints from the binary modifier


The di dt problem

The di/dt Problem

  • Voltage stability is important for reliability, performance

  • Low-power techniques have a negative side effect: current variation

  • ITRS cites noise management as a Grand Challenge for 5-10 year time frame

  • Dips(undershoots) in supply voltage – can cause incorrect values to be calculated or stored

  • Spikes (overshoots) in supply voltage – can cause reliability problems


Di dt solutions

Co-Designed MicroArch & SW Binary Modifier

di/dt Solutions

Software

MicroArch

Circuit-Level

Compiler Optimizations

Sensor/Actuator Mechanisms

Decoupling capacitors More Vdd Gnd pins on package


Sensor actuator mechanisms

Sensor-Actuator Mechanisms

  • On-chip voltage sensors detect abnormally high/low voltage levels

  • On-chip actuator then attempts to quickly raise/lower the processor’s current draw

    • Phantom firing

      • increases current (at the expense of power)

    • Resource throttling

      • reduces current (at the expense of performance)


Detecting imminent emergencies

Detecting Imminent Emergencies

Hard

Emergency

Soft

Emergency

Control

Threshold

1.05V

1.03V

1V

0.97V

0.95V

Operating

Voltage Range


Targeting mid frequency di dt

20 cycles

60 cycles

Minimum Voltage

Maximum Voltage

Minimum Voltage

Targeting Mid-Frequency Di/dt

  • Problematic:wide current spike

  • Worst case: pulse at the resonant frequency

Processor Current (A)

Processor Current (A)

*From:

Joseph et al.

HPCA 2003

Supply Voltage (V)

Supply Voltage (V)

Time (Cycles)

Time (Cycles)


A di dt stressmark

A di/dt Stressmark

  • But…Actuator engages every loop iteration degrading performance

  • Why not correct the problem in the code?

BEGIN_LOOP:

ldt$f1, ($4)

divt$f1, $f2, $f3

divt$f3, $f2, $f3

stt$f3, 8($4)

ldq$7, 8($4)

cmovne$31, $7, $3

stq$3, $(4)

stq$3, $(4)

stq$3, $(4)

stq$3, $(4)

JUMP BEGIN_LOOP

Sequential

Low Power

Parallel

High Power


Why use dynamic binary modification

Why use Dynamic Binary Modification?

  • Modify the instruction stream at run time

  • Much easier to react to an emergency than to predict one

  • Emergencies are processor dependent, but software should not be!

  • Enables run-time guarantees


Proposed solution

Proposed Solution

  • Leverage our additional software layer to supplement existing solutions

  • Microarchitectureprovides feedback to our software-based virtual layer

Altered

Executable

Binary

Modifier

VL

Executable

SW

HW

Sensor+Actuator Ext

Microprocessor


Required investigations

Required Investigations

  • Characterizing emergencies

    • How often do we see di/dt emergency loops?

    • Are emergencies usually code-based?

  • Communication between the microarchitecture and the virtual layer

    • What information should be passed to virtual layer during an emergency?

  • Fixing di/dtvia binary modification

    • Will existing techniques help?

    • New algorithms?


Last executed branch

Last-Executed Branch

Data suggests modifying a few code sequences will eliminate many voltage emergencies


Possible compiler optimizations

Possible Compiler Optimizations

  • Our goal is to

    • Smooth out current profile, or

    • Knock pulses off of the resonant frequency

  • Some existing options

    • Software pipelining, code motion, instruction padding

Executable

Apply

Optimizations

Altered

Executable

Binary

Modifier

Sensor+Actuator Ext’ns

Microprocessor


Loop unrolling sw pipelining

A

A

B

B

Iteration=1

A

A

B

B

Iteration=2

Software pipelining

smoothes profile

A

A

B

B

Iteration=3

Current

Loop Unrolling & SW Pipelining

A

A

B

B

Problematic

loop:

Current

A

A

A

A

B

B

B

B

Loop unrolling disrupts resonance pulse

Unrolled

loop:

Current


Unrolling the di dt stressmark

Unrolling the Di/dt Stressmark

H1

H

H2

L

L1

L2


Two case studies

Two Case Studies

  • The di/dt problem – solved using hardware feedback and dynamic optimization

  • Heterogeneous multicore scheduling – solved using hardware feedback, OS feedback, and hints from the binary modifier


Heterogeneous multicores

Heterogeneous Multicores

  • Process Heterogeneity

    • Process variation

  • Design Heterogeneity

    • Specialized processor cores

Today’s Multicore Designs

Future Multicore Designs


The challenge scheduling

The Challenge: Scheduling

  • Many OSes assume identical core resources

    • Bad assumption, even today (hyperthreading)

  • The OS may not have enough information to make the best scheduling decisions

    • depending on the type of heterogeneity

  • Ideal process-to-core mappings change dynamically

  • Should this be a task for the OS alone?


Our approach

Our Approach

  • Claim: OSes could benefit from scheduling hints

  • Solution: Combine historical performance data with application phase information

SW Applications

phase information

Binary Modifier

performance counter info;

os load

scheduling hints

OS/HW


Initial investigation

Initial Investigation

  • Hardware configuration

    • In-order core

    • Out-of-order core

  • Software configuration

    • Two applications executing (SPEC all combinations)

  • Migration indicators

    • IPC (from HW)

    • Phase (from SW) 1M cycle granularity

App1

App2

out of order

in order


Results

Results

  • Calculated ideal (omniscient) scheduling

  • Calculated random scheduling

  • Explored two migration heuristics

    • IPC-threshold scheduling

    • IPC-delta scheduling

  • Metric: Distance from ideal


Random scheduling

Random Scheduling

60%

40%

20%

0%

Distance from Ideal

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Probability of Swapping at Each 1M Timeslice


Ipc threshold scheduling

IPC Threshold Scheduling

30

20

10

0

Percent Distance from Ideal

0 Backoff 0.5 Backoff 1.0 Backoff 1.5 Backoff 1.9 Backoff 2.0 Backoff

0.5 IPC

0.75 IPC

1.0 IPC

1.25 IPC

1.5 IPC

1.75 IPC

2.0 IPC


Ipc delta scheduling

IPC Delta Scheduling

30

20

10

0

Percent Distance from Ideal

0.05 0.1 0.15 0.2 0.3 0.4 0.5 0.75 1 1.25 1.5 1.75 2

Migration Trigger - IPC Change


Ongoing investigations

Ongoing Investigations

  • Varying heterogeneity

    • cache sizes

    • floating-point availability

  • Determining migration indicators

    • cache misses

  • Combinations

  • Other software feedback


Symbiotic optimization

Symbiotic Optimization

  • Cross-layer approaches can be powerful

  • Minor changes enable communication channels

  • Hardware feedback and binary modification can help solve the di/dt problem

  • Hardware feedback and program phases can guide OS scheduling decisions

  • The Tortola design can also target power reduction, temperature reduction, reliability, etc.

  • http://www.tortolaproject.com/


Course summary

Course Summary

  • Now you know …

  • Process VMs versus system VMs

  • Research issues in building process VMs

  • How to use process VMs for a variety of other research projects

  • Why cross-layer approaches can be powerful and how to use process VMs to get there

  • Thanks and enjoy the rest of your summer!


  • Login