Exploring Cross-Stack Energy Optimization: Fact or Fiction?
100 likes | 235 Views
This paper by Kevin Skadron from the University of Virginia dives into the concept of cross-stack energy optimization. It discusses how information can be shared between layers—from circuits to microarchitecture—to enhance energy efficiency. The paper highlights both upward and downward adaptation within the stack, examining the interdependencies between software and hardware. It addresses the current challenges in sensor and actuator deployment as well as the need for innovative programming constructs. Ultimately, it advocates for leveraging layer information for efficient decision-making in system-wide energy management.
Exploring Cross-Stack Energy Optimization: Fact or Fiction?
E N D
Presentation Transcript
Cross-stack Energy Optimization: Fact or Fiction? Kevin Skadron University of Virginia Dept. of Computer Science
Flavors of X-Stack • “Up” the stack • CircuitsMicroarchitecture • HWSW • eg, sensorsthrottling • Ideally, application itself can adapt (algorithm, precision, QoS, etc.) • … • “Down” the stack • Often overlooked, but OS, HW can benefit from application knowledge • SWHW • eg, access patterns, thread priorities, private/shared, etc. • GPU example: texture (APIdriverHW) • eg, reconfigurable hardware
Up: Dymaxion: Index Transformation • SIMD/SIMT: Because SIMD requires contiguous access for efficiency, data layout/traversal needs to be transformed • Usermiddleware(device driver)(hardware) feature’[transform(index)] feature[index] 8
Code Example Original Version DEVICE __global__ kmeans_kernel_orig(float*feature_d, ...){ inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /* ... */ for (intl = 0; l < nclusters; l++) { index = point_id * nfeatures + l; ...feature_d[index]... } } DEVICE __global__ kmeans_kernel_map(float*feature_remap, ...){ inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /* ... */ for (intl = 0; l < nclusters; l++) { index = point_id * nfeatures + l; ...feature_remap[transform_row2col(index, npoints, nfeatures)]... } } HOST cudaMemcpy(feature_d, feature, …); kmeans_kernel_orig<<<dimGrid,dimBlock>>>( feature_d, ... ); HOST map_row2col(feature_remap, feature, …); kmeans_kernel_map<<<dimGrid,dimBlock>>>( feature_remap, ... ); Dymaxion Version
Down: Lack of Sensors and Actuators • Feedback control: sensors and actuators • Chicken and egg problem • Lack of sensors is a big problem now • Can’t control what we can’t measure • Performance monitors not designed for this • Too coarse-grained, can’t monitor enough • Moving in the right direction • Need more actuators, too • Currently mainly have just DVFS and scheduling/placement • Some HDDs offer DRPM • Reconfiguration is a form of actuation, too
Wish List • Sensors/constraint communication • Up: Structure occupancies, interval behavior, fine-grained/instruction-level responsiveness, physical location, etc. • Expand perf-counter system, add informing loads (ISCA ~00), allow HW to query microarchitectural state, expose chip/rack/datacenter/geographic location, etc. • Down: Access patterns, private/shared, priority/performance expectations, etc. • Requires new programming constructs and new (possibly privileged) instructions • Actuators • Many system components hard to control • e.g., HDDs, DRAM, power supply • Control memory behavior, light sleep modes • Ordering/buffering/prefetching/contention • More reconfigurability, coarse-grained architectures • Why use cache when you can use scratchpad; registers, routed network when you can do direct producer-consumer, etc.?
Summary • Turn fiction into non-fiction! • Some good ideas already in papers • Revisit: why weren’t they adopted? • New ideas: • Imagine ideal sensing and actuation • Show a promising control/adaptation/reconfiguration algorithm • Propose plausible sensors/actuators
What is “Cross Stack”? • Layer X adapts based on information in Layer Y • Example: OS uses hardware info • e.g., temp sensors, structure occupancies, # pending cache misses guide thread co-location • Or hardware uses OS info • e.g., thread priorities, task deadlines guide hardware DVFS policy • Important—leverage information across layers to make globally efficient decisions • Ultimately: break down costly interfaces • Unnecessary copies, extra state, redundant computation • Different than energy optimization happening independently in multiple layers • e.g., hardware DVFS (based on instruction flow)+ OS DVFS (based on task deadlines) • Risky—control loops can fight
Fact or Fiction • Should be fact! • But mostly fiction • Can’t measure power/energy effectively in many systems and components • Control options are typically high-overhead • DVFS, task migration, etc. • Most solutions are single-layer • Baby steps • Cluster/datacenter front end monitors per-node activity, temperature—schedules accordingly • Autotuning • Reducing copies