cross stack energy optimization fact or fiction
Download
Skip this Video
Download Presentation
Cross-stack Energy Optimization: Fact or Fiction?

Loading in 2 Seconds...

play fullscreen
1 / 10

Cross-stack Energy Optimization: Fact or Fiction? - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

Cross-stack Energy Optimization: Fact or Fiction?. Kevin Skadron University of Virginia Dept. of Computer Science. Flavors of X-Stack. “Up” the stack Circuits Microarchitecture HWSW eg , sensorsthrottling Ideally, application itself can adapt (algorithm, precision, QoS, etc.) …

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Cross-stack Energy Optimization: Fact or Fiction?' - roddy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cross stack energy optimization fact or fiction

Cross-stack Energy Optimization: Fact or Fiction?

Kevin Skadron

University of Virginia

Dept. of Computer Science

flavors of x stack
Flavors of X-Stack
  • “Up” the stack
    • CircuitsMicroarchitecture
    • HWSW
      • eg, sensorsthrottling
      • Ideally, application itself can adapt (algorithm, precision, QoS, etc.)
  • “Down” the stack
    • Often overlooked, but OS, HW can benefit from application knowledge
    • SWHW
      • eg, access patterns, thread priorities, private/shared, etc.
        • GPU example: texture (APIdriverHW)
      • eg, reconfigurable hardware
up dymaxion index transformation
Up: Dymaxion: Index Transformation
  • SIMD/SIMT: Because SIMD requires contiguous access for efficiency, data layout/traversal needs to be transformed
  • Usermiddleware(device driver)(hardware)

feature’[transform(index)]

feature[index]

8

code example
Code Example

Original Version

DEVICE

__global__ kmeans_kernel_orig(float*feature_d, ...){

inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x;

/* ... */

for (intl = 0; l < nclusters; l++) {

index = point_id * nfeatures + l;

...feature_d[index]...

}

}

DEVICE

__global__ kmeans_kernel_map(float*feature_remap, ...){

inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x;

/* ... */

for (intl = 0; l < nclusters; l++) {

index = point_id * nfeatures + l;

...feature_remap[transform_row2col(index,

npoints, nfeatures)]... }

}

HOST

cudaMemcpy(feature_d, feature, …);

kmeans_kernel_orig<<<dimGrid,dimBlock>>>(

feature_d,

...

);

HOST

map_row2col(feature_remap, feature, …);

kmeans_kernel_map<<<dimGrid,dimBlock>>>(

feature_remap,

...

);

Dymaxion Version

down lack of sensors and actuators
Down: Lack of Sensors and Actuators
  • Feedback control: sensors and actuators
  • Chicken and egg problem
  • Lack of sensors is a big problem now
    • Can’t control what we can’t measure
    • Performance monitors not designed for this
      • Too coarse-grained, can’t monitor enough
    • Moving in the right direction
  • Need more actuators, too
    • Currently mainly have just DVFS and scheduling/placement
    • Some HDDs offer DRPM
    • Reconfiguration is a form of actuation, too
wish list
Wish List
  • Sensors/constraint communication
    • Up: Structure occupancies, interval behavior, fine-grained/instruction-level responsiveness, physical location, etc.
      • Expand perf-counter system, add informing loads (ISCA ~00), allow HW to query microarchitectural state, expose chip/rack/datacenter/geographic location, etc.
    • Down: Access patterns, private/shared, priority/performance expectations, etc.
      • Requires new programming constructs and new (possibly privileged) instructions
  • Actuators
    • Many system components hard to control
      • e.g., HDDs, DRAM, power supply
    • Control memory behavior, light sleep modes
      • Ordering/buffering/prefetching/contention
    • More reconfigurability, coarse-grained architectures
      • Why use cache when you can use scratchpad; registers, routed network when you can do direct producer-consumer, etc.?
summary
Summary
  • Turn fiction into non-fiction!
  • Some good ideas already in papers
    • Revisit: why weren’t they adopted?
  • New ideas:
    • Imagine ideal sensing and actuation
    • Show a promising control/adaptation/reconfiguration algorithm
    • Propose plausible sensors/actuators
what is cross stack
What is “Cross Stack”?
  • Layer X adapts based on information in Layer Y
    • Example: OS uses hardware info
      • e.g., temp sensors, structure occupancies, # pending cache misses guide thread co-location
    • Or hardware uses OS info
      • e.g., thread priorities, task deadlines guide hardware DVFS policy
    • Important—leverage information across layers to make globally efficient decisions
    • Ultimately: break down costly interfaces
      • Unnecessary copies, extra state, redundant computation
  • Different than energy optimization happening independently in multiple layers
    • e.g., hardware DVFS (based on instruction flow)+ OS DVFS (based on task deadlines)
    • Risky—control loops can fight
fact or fiction
Fact or Fiction
  • Should be fact!
  • But mostly fiction
    • Can’t measure power/energy effectively in many systems and components
    • Control options are typically high-overhead
      • DVFS, task migration, etc.
    • Most solutions are single-layer
  • Baby steps
    • Cluster/datacenter front end monitors per-node activity, temperature—schedules accordingly
    • Autotuning
    • Reducing copies
ad