Cross stack energy optimization fact or fiction
Download
1 / 10

Cross-stack Energy Optimization: Fact or Fiction? - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Cross-stack Energy Optimization: Fact or Fiction?. Kevin Skadron University of Virginia Dept. of Computer Science. Flavors of X-Stack. “Up” the stack Circuits Microarchitecture HWSW eg , sensorsthrottling Ideally, application itself can adapt (algorithm, precision, QoS, etc.) …

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Cross-stack Energy Optimization: Fact or Fiction?' - roddy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cross stack energy optimization fact or fiction

Cross-stack Energy Optimization: Fact or Fiction?

Kevin Skadron

University of Virginia

Dept. of Computer Science


Flavors of x stack
Flavors of X-Stack

  • “Up” the stack

    • CircuitsMicroarchitecture

    • HWSW

      • eg, sensorsthrottling

      • Ideally, application itself can adapt (algorithm, precision, QoS, etc.)

  • “Down” the stack

    • Often overlooked, but OS, HW can benefit from application knowledge

    • SWHW

      • eg, access patterns, thread priorities, private/shared, etc.

        • GPU example: texture (APIdriverHW)

      • eg, reconfigurable hardware


Up dymaxion index transformation
Up: Dymaxion: Index Transformation

  • SIMD/SIMT: Because SIMD requires contiguous access for efficiency, data layout/traversal needs to be transformed

  • Usermiddleware(device driver)(hardware)

feature’[transform(index)]

feature[index]

8


Code example
Code Example

Original Version

DEVICE

__global__ kmeans_kernel_orig(float*feature_d, ...){

inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x;

/* ... */

for (intl = 0; l < nclusters; l++) {

index = point_id * nfeatures + l;

...feature_d[index]...

}

}

DEVICE

__global__ kmeans_kernel_map(float*feature_remap, ...){

inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x;

/* ... */

for (intl = 0; l < nclusters; l++) {

index = point_id * nfeatures + l;

...feature_remap[transform_row2col(index,

npoints, nfeatures)]... }

}

HOST

cudaMemcpy(feature_d, feature, …);

kmeans_kernel_orig<<<dimGrid,dimBlock>>>(

feature_d,

...

);

HOST

map_row2col(feature_remap, feature, …);

kmeans_kernel_map<<<dimGrid,dimBlock>>>(

feature_remap,

...

);

Dymaxion Version


Down lack of sensors and actuators
Down: Lack of Sensors and Actuators

  • Feedback control: sensors and actuators

  • Chicken and egg problem

  • Lack of sensors is a big problem now

    • Can’t control what we can’t measure

    • Performance monitors not designed for this

      • Too coarse-grained, can’t monitor enough

    • Moving in the right direction

  • Need more actuators, too

    • Currently mainly have just DVFS and scheduling/placement

    • Some HDDs offer DRPM

    • Reconfiguration is a form of actuation, too


Wish list
Wish List

  • Sensors/constraint communication

    • Up: Structure occupancies, interval behavior, fine-grained/instruction-level responsiveness, physical location, etc.

      • Expand perf-counter system, add informing loads (ISCA ~00), allow HW to query microarchitectural state, expose chip/rack/datacenter/geographic location, etc.

    • Down: Access patterns, private/shared, priority/performance expectations, etc.

      • Requires new programming constructs and new (possibly privileged) instructions

  • Actuators

    • Many system components hard to control

      • e.g., HDDs, DRAM, power supply

    • Control memory behavior, light sleep modes

      • Ordering/buffering/prefetching/contention

    • More reconfigurability, coarse-grained architectures

      • Why use cache when you can use scratchpad; registers, routed network when you can do direct producer-consumer, etc.?


Summary
Summary

  • Turn fiction into non-fiction!

  • Some good ideas already in papers

    • Revisit: why weren’t they adopted?

  • New ideas:

    • Imagine ideal sensing and actuation

    • Show a promising control/adaptation/reconfiguration algorithm

    • Propose plausible sensors/actuators



What is cross stack
What is “Cross Stack”?

  • Layer X adapts based on information in Layer Y

    • Example: OS uses hardware info

      • e.g., temp sensors, structure occupancies, # pending cache misses guide thread co-location

    • Or hardware uses OS info

      • e.g., thread priorities, task deadlines guide hardware DVFS policy

    • Important—leverage information across layers to make globally efficient decisions

    • Ultimately: break down costly interfaces

      • Unnecessary copies, extra state, redundant computation

  • Different than energy optimization happening independently in multiple layers

    • e.g., hardware DVFS (based on instruction flow)+ OS DVFS (based on task deadlines)

    • Risky—control loops can fight


Fact or fiction
Fact or Fiction

  • Should be fact!

  • But mostly fiction

    • Can’t measure power/energy effectively in many systems and components

    • Control options are typically high-overhead

      • DVFS, task migration, etc.

    • Most solutions are single-layer

  • Baby steps

    • Cluster/datacenter front end monitors per-node activity, temperature—schedules accordingly

    • Autotuning

    • Reducing copies


ad