Introduction

Introduction • Minimizing energy consumption is crucial for computing systems • Battery operated systems • Data centers • Wide variety of techniques have been proposed • Static (Offline) optimizations • Compiler optimizations • Accelerators • Dynamic (Online) optimizations • OS scheduling • C-state management in Intel processors

Introduction • Static techniques • Can afford to take a global view of the problem • More complex algorithms can be used • Dynamic techniques • Fast – low overhead • Have more information about the current state of the system • Hybrid techniques

Our contributions • Static + dynamic optimizations for energy efficiency • Exploiting workload variation in DVFS capable systems • Assuring application-level correctness for programs • Fine-grained accelerator integration with processors

Energy efficient multiprocessor task scheduling under input-dependent variation

Outline • Introduction and motivation • Related work • Problem formulation • Proposed algorithm • Experimental results

Introduction and Motivation • Embedded systems are typically required to meet a fixed performance target • Example – frame rate for decoding of streaming video • A system that has better performance provides no significant benefit • Dynamic Voltage + Frequency Scaling (DVFS) is an effective technique for reducing dynamic energy consumption of processors • Quadratic dependence of energy on voltage (almost) • Linear dependence of performance (frequency) on voltage (almost)

Introduction • DVFS problem: Given a task graph G: • Edges - precedence constraints, • Latency constraint L • Determine the schedule and voltage assignment for each task to minimize energy consumption • Traditional techniques consider worst-case computation time of every task • Ensures that the latency constraint is satisfied. • Real-world applications exhibit significant variation in execution times.

Example: Huffman Encoder in JPEG • Probability distribution of the execution time of the Huffman encoder • Shows significant variation in execution time • Variation in execution time can be exploited to further minimize energy consumption Probability # cycles

Example – Energy Consumption in Worst-case and Typical Case • Using equations based on CMOS for modeling relation between energy, frequency and voltage • Workload - # cycles that a task takes to complete • Input dependent • 4 processor system - Latency constraint of 300 time units • Worst-case scheduling – 1 time unit for clock period of each task • Energy consumption 400*C • Typical case – 75 cycles per task • Energy consumption 168.75*C • Potentially 58% reduction in energy!

Outline • Introduction and motivation • Related work • Problem formulation • Proposed algorithm • Experimental results

Related Works • Single processor systems • List scheduling based heuristics – Gruian 2003. • Minimizing expected energy consumption by exhaustive search – Leung 2005, Xu 2005, Xu 2007 • Convex optimization – Andrei DATE 2005 • Multiprocessor systems • Dynamic slack reclamation - Zhu 2001, Chen 2004 • Partitioning for expected energy minimization - Xian 2007 • Schedule table based • For conditional task graphs – Shin 2003, Wu 2003 • Restricted to conditional task graphs • Convex optimization – Andrei DATE 2005 • Exponential enumeration if applied to multi-processor systems • Dynamic programming – Qiu DATE 2007 • Exponential enumeration

Exploiting Variation • Schedule table • Provides a list of scenarios and how to scale voltage/frequency when a particular scenario becomes active • How to build schedule table? • Enumerate all possible scenarios and optimize separately • Enumerate all possible combinations of number of cycles consumed by tasks • Number of scenarios explodes very quickly! • For a 10 node task graph with 4 possible execution times for each task, the number of scenarios is 410 • Our contribution – method to build schedule table efficiently without exponential enumeration • Optimal for task chains

Processor and Application Model • Processor model • Homogeneous multiprocessor system • Voltage of each processor can be tuned independently in the range [Vlower, Vupper] • Use quadratic approximation to model relation between energy and frequency • Application model • Task graph G with nodes representing tasks • Edges represent precedence constraints • Mapping of tasks to processors assumed to be given • If not, use a priority based mapping heuristic

Idea - Task Chains • What would an (imaginary) Oracle do? • For tasks 4 and 5, the voltage to use is not dependent on individual cycles consumed by tasks 1, 2 and 3 • Depends only on the total number of cycles consumed by sub-chain • Task 4 will start at the same time for a given value of sub-chain length • No need to enumerate #cycles for individual tasks 70 1 90 100 2 60 120 3 140 Total = 290 4 Total = 290 5

Exploiting Variation – Schedule Table • W(v) • Number of cycles for v to execute • Different from execution time (which can vary with voltage) • Cycles elapsed – CE(v) • Number of cycles elapsed when a task v is ready to start • Schedule Table • One row for each task • Each entry in a row is a tuple of the form <ce, cp> • cp is the clock period of task v when the value of CE(v) is ce • Constructed statically (offline) • At run-time, a table look-up is performed to determine the clock period to use for a particular task • Goal: Construct a schedule table such that the average energy consumption of the system for the given task graph is minimized.

Example – Schedule Table • Latency constraint of 650 time units Start(v1)=0, cp(v1) = 2 Finish(v1)=150 W(v1)=75, CE(v1) = 0 cp(v) – clock period for task v W(v) – #cycles for task v CE(v) – cycles elapsed when v is ready Start(v3)=150, cp(v3) = 3 Finish(v2)=450 W(v3)=100, CE(v3) = 75 Start(v2)=150, cp(v2) = 3 Finish(v2)=450 W(v2)=100, CE(v2) =75 Start(v4)=450, cp(v4) = 2 Finish(v4)=600 W(v4)=75, CE(v4) = 175 Vector <ce, cp>

Constructing the Schedule Table • Based on J. Cong, W. Jiang and Z. Zhang ASP-DAC’07 formulation • Time budgeting for operations to minimize energy consumption in high level synthesis • Latency constraint • Variable definitions • bi is the latency of task i • siis the start time of task i • cp(i) is the clock period to use while running task I • Convex optimization with linear constraints • Does not consider variation in latency of individual operations

Constructing the Schedule Table • Idea: Instead of maintaining a single start and finish time associated with every task, maintain a list of start and finish times • One start time and clock period for distinct values of CE(v) – sv,j, cpv,j • One finish for distinct values of CE(v) + W(v) – fv,j • CE(v) helps decide the precedence constraints between the finish times of a task and the start times of its successors • Precedence constraints only between finish time variables and start time variables associated with permitted combinations of workload and CE(v) • Avoids enumeration of all possible workloads

Constructing the Schedule Table v1 f2 = Finish time of v1 when v1 takes 100 cycles f1 = Finish time of v1 when v1 takes 75 cycles • Each task maintains a list of start and finish times • Each start time (and finish time) is associated with the number of cycles elapsed. • Constraints imposed only on valid combinations of start and finish times. s1 = Start time of v2 when v1 takes 75 cycles s2 = Start time of v2 when v1 takes 100 cycles v2 Precedence constraint s1 ≥ f1 s2 ≥ f2 v3 No constraint needed between f2 and s1 ! v4 v2 can start earlier if v1 takes 75 cycles!

Constructing the Schedule Table • Determine the valid combinations of CE(v) for every pair of tasks connected by an edge • where sv,j is the start time of task v when CE(v) is cev,j and fu,m is the finish time of task u when CE(u) is ceu,k and W(u) is wu,l and cev,j ≥ ceu,k + wu,l(valid combination) • Objective function: Average energy consumption

Determining the Values of CE(v) • To keep the problem size from exploding, we keep a constant number of values (K) of CE(v) at each task • Profiling to determine the probability distribution of workload of a task v and CE(v) • Heuristics to determine values of CE(v) to use at each node • Divide the range of CE(v) into K equal parts • Divide the area under the probability v/s CE(v) graph into K equal regions K=5 Probability # cycles

Complexity • No more than K values of CE(v) per node • Number of constraints • Upto K2 precedence constraints per edge • Upto K2 latency constraints per task • O(K2(m+n)) linear constraints • Number of variables • Upto K start, clock time and finish variables per task • O(Kn) variables • Corresponds to the size of the table to be stored • Convex objective function • Solved in polynomial time

Results – Random Task Graphs • Random task graphs generated by TGFF • Compared to • Greedy, dynamic slack reclamation algorithm • Oracle which can correctly predict workload for each task (before execution) • 15% worse (on average) than Oracle • 20% better than dynamic slack reclamation technique

Real-world Applications • Experimentation methodology • SESC+Wattch for energy of processor cores – 90nm • CACTI for caches • Energy values for ALU, decoder etc obtained by scaling to 180nm values provided by Wattch • CACTI provides energy values for 90nm for SRAM based array structure in CPU • FIFOs for communication between processors • Similar to Fast Simplex Links provided by Xilinx • Processors modeled similar to Intel XScale • 7 voltage levels with speeds varying from 100MHz to 800MHz

MJPEG Encoder-Variation • Only the Huffman encoder module shows variation • Unpredictable variation

MJPEG Encoder - Results • Only 4% energy savings (because variation is low) • 15% energy savings when workload can be predicted

Results – MPEG-4 Decoder • Main components • Parser (P), Copy-Controller (CC), Inverse-DCT (IDCT), Motion Compensation (MC) and Texture Update (TU) • IDCT shows no variation • Upto 6 MC and 6 IDCT per macroblock • Task graph unrolled • Performance constraint of 20 frames/s

MPEG-4 Variation • CC and MC show nice variation

MPEG-4 Decoder - Results • Comparison with dynamic slack reclamation algorithm • Upto 20% savings in energy over dynamic slack reclamation • We measure the effect of the number of values in the schedule table

Summary • Exploiting variation in execution time provides significant opportunity for energy minimization • Schedule table based approach • Construction of schedule table in polynomial time • Formulated as convex optimization problem with polynomial number of linear constraints • Optimal for certain special graphs – chains and trees • Average of 20% improvement over dynamic slack reclamation algorithm • Only 15% away from Oracle method • 20% energy saving for MPEG-4 decoder compared to dynamic slack reclamation algorithm

Assuring application-level correctness against soft errors

Motivation • Soft errors – issue for correct operation of CMOS circuits • Problem becomes more severe – ITRS 2009 • Smaller device sizes • Low supply voltages • Effect of soft errors on circuits • Karnik 2004, Nguyen 2003 • Effect of soft errors on software and processors • Li et al 2005, Wang et al 2004

Motivation • Traditional notion of correctness • Every last bit of every variable in a program should be correct • Referred to as numerical correctness • Application-level correctness • Several applications can tolerate a degree of error • Image viewer, video decoding etc • However, there exist critical instructions even in such applications • Example: state machine in video decoder

Motivation • Goal: Detect all “critical” instructions in the program • Protect “critical” instructions in the program against soft errors • Using duplication

Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results

Defining critical instructions • Elastic outputs – program outputs which can tolerate a certain amount of error • Media applications – image, video etc • Heuristics – Support vector machine • Characterizing quality of elastic outputs – Fidelity metric • Example: PSNR (peak signal to noise ratio) for JPEG, bit error rate,

Defining critical instructions • Given application A: • I is the input to the application • A set of outputs Oc - numerical correctness required • A set of elastic outputs O • Fidelity metric F(I,O) for elastic outputs • T – threshold for acceptable output • An execution of A is said to satisfy application-level correctness if: • All outputs εOc are numerically correct • F(I,O) ≥ T for elastic outputs • Nmin – the minimum number of elements of O that need to erroneous for F(I,O) to fall below T

Example: JPEG decoder • PSNR of 35dB is assumed to be good quality • MSE = 20.56 • Using 8-bit pixel values (MAX=255), • Max error = 255 • For a 1024x768 pixel image, Nmin ~ 251

Defining critical instructions • An instruction X is said to be critical if • X affects one of the outputs of Oc (numerical correctness required) OR • X affects Nmin elastic output elements O

Program representation • LLVM compiler infrastructure • LLVM intermediate representation • Weighted program dependence graph (PDG) – G

Example LLVM IR – 3 address code

Example PDG - based on LLVM IR

Example Node for computing X

Example Node for computing X Node (out_i) to compute C[Z]+X Node (so) to store C[Z]+X into array output

Example Node for computing X Node (so) to write to output array Node (so) to store C[Z]+X into array output Edge to represent dependence between X and out_i Edge to represent dependence between out_i and so

Assigning edge weights • Edge weight u→v - how many instances of node v are affected by 1 instance of u? • Example: • X outside the loop, out_i inside the loop • Edge weight N • Nodes out_i and so are in the same basic block – • Edge weight 1

Static analysis for detecting critical instructions • Find how many instances of output O are affected by node x • propagate(x →v) is the number of instances of v that are affected by an instance of x

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction