Energy Efficiency in Advanced Embedded Systems

Advanced Embedded Systems Lecture7 Advances in Embedded Systems CPUs

Advanced Embedded Systems • The first part of the course is an analyze of the paper Efficient Embedded Computing, authors W. J. Dally, J. Balfour, D. Black-Schaffer, J. Chen, R. Curtis Harting, V. Parikh, J. Park and D, Sheffield, found in Computer, July 2008, pp. 27 – 32; • The paper approaches the problem of ASICs vs. processors; • ASICs are 50 times more efficient than programmable processors; they meet perfectly the efficiency requirements of ESs, in terms of speed and energy consumption; the cost is the lack of flexibility because there is no programmability; • Programmable processors have the advantage of flexibility but are slower than ASICs and use more energy; • The paper presents a processor at which techniques were used for reducing the energy consumption; the gap to ASICs was shortened 3 times;

Advanced Embedded Systems • Computations weight have moved, in the last period, from desktops, laptops and data centers in embedded devices; • More than one billion cell phones are sold each year and a 3G cell phone performs more operations/ second than a typical desktop CPU; • A 3G mobile phone receiver requires 35 – 40 GOPS of performance to handle a 14.4 Mbps channel; a typical desktop computer has a peak performance of a few GOPS; • Considering a power dissipation of about 1 W for an embedded media device, an efficiency of 25 mW/ GOP, or 25 pJ/ op is obtained for a 3G receiver; • A well designed ASIC can achieve an efficiency of 5 pJ/ op; • Very efficient processors, including DSPs, require about 250 pJ/ op and a medium laptop processor requires 20 nJ/ op, meaning 4000 times more energy consumption than an ASIC;

Advanced Embedded Systems • ASICs meet the energy efficiency demands of embedded applications but are difficult to design and inflexible; • 2 years are needed for designing a typical ASIC and the cost is about 20 million $; as a consequence, ASICs are adequate only for large volume applications; • The long design cycle causes ASICs to lag far behind the latest developments in algorithms, modems and codecs; • Inflexibility increases an ASIC’s area and complexity; for example, if a system needs several parallel interfaces, an ASIC implementation will foresight separate hardware for each interface, even if only one will be used at any time; a processor implementation will need a single hardware and multiple software; • It is more and more difficult for ASICs to cover all the new and dynamic requirements of the embedded applications; • As a consequence, embedded applications require increasing efficiency but also flexibility;

Advanced Embedded Systems • An embedded processor spends most of its energy on fetching instructions and reading and writing data; • The figure shows how is spent the energy of a processor: • The processor spends 42 % of the energy for fetching instructions and 28 % of the energy for supplying data; only 6 % is used for arithmetic operations; • Of the 6% for arithmetic operations only 59% is for useful arithmetic, the rest being overhead, such as updating loop indices and calculating memory addresses;

Advanced Embedded Systems • For every 10 pJ arithmetic operation, the processor spends 70 pj on instruction fetching and 47 pj on data supply; • There is also an overhead meaning 1.7 instructions must be fetched and supplied with data for every useful instruction; • The figure gives a deeper view of the instruction fetching energy: • The 8 Kbyte instruction cache consumes the most of the energy; • Fetching an instruction requires to access both ways of the two way set-associative cache and reading two tags; the operations cost 107 pJ of energy;

Advanced Embedded Systems • The figure details the data supply energy; • The 8 Kbyte data cache accounts for 50 % of the data supply energy; • The 40 word multiported general purpose register file accounts for 41 % of the energy and pipeline registers account for the balance; • 131 pJ of energy are necessary for supplying a word of data from the data cache; supplying the same word from the register file requires only 17 pj of energy; • Two words must be supplied and one consumed for every 10 pj arithmetic operation;

Advanced Embedded Systems • The next table lists each component’s energy cost; • Pipeline registers consume an additional 12 pJ, passing each instruction down the five-stage RISC pipeline; • The total energy for fetching an instruction is 119 pJ for a 10 pJ arithmetic operation; • Because of overhead instructions, 1.7 instructions must be fetched for each useful instruction; • The energy required for fetching instructions and supplying data to the arithmetic units from a RISC processor ranges from 15 to 50 times the energy for actually carrying out the instructions; • In order to improve the efficiency of programmable processors, efforts must be focused on instruction fetching and data supply; • Instruction fetching energy can be reduced, 50 times, by using a deeper hierarchy with explicit control, eliminating overhead instructions and exposing the pipeline; • The processor must supply instructions without cycling the cache;

Advanced Embedded Systems

Advanced Embedded Systems • The figure shows the proposed processor, called ELM: • Instructions are fetched from a small set of distributed instruction registers and not from the cache; • The cost of reading an instruction bit from this instruction file, IRF, is 0.1 pJ versus 3.4 pJ for the cache, meaning a reduction of 34 times; • Problems: how are placed the instructions in the IRF? Isn’t it an overhead? What is the cost in terms of energy consumption? What is the cost in terms of speed?

Advanced Embedded Systems • The IRF is another, smaller, level of the instruction memory hierarchy; why haven’t it be introduced in the past? • When caches were introduced the purpose was the performance and not the energy consumption; • To maximize performance, the hierarchy’s lowest level is sized as large as possible while still being accessible in a single cycle; • Making the cache smaller will lead to decrease the performance by increasing the miss rate, without affecting cycle time; • That is why level 1 instruction caches are typically 8 to 64 Kbytes; • Optimizing for energy requires minimizing the hierarchy’s bottom level while still capturing the critical loops of kernels that dominate media applications; • The proposed processor has an IRF with 64 registers that can be partitioned so that smaller loops need only cycle the registers’ bit lines as needed to hold the loop;

Advanced Embedded Systems • The ELM processor manages the IRF as a file and not as a cache; • It means the compiler is allocating registers and performs transfers and not the hardware; • This is an explicit management of the IRF and has two main advantages: • It avoids stalls by prefetching a block of instructions into the IRF as soon as the processor identifies the block to be executed; • In contrast , a cache waits until the first instruction is needed, then stalls the processor while it fetches the instruction from backing memory; • This problem is avoided by some caches which use hardware prefetch blocks but the cost is in: • More energy consumed, • Unneeded instruction fetches and • Rare anticipation of branches off the straight-line instruction sequence; • Cases where the working set does not quite fit in the IRF are better managed leading to higher efficiency;

Advanced Embedded Systems • The energy for fetching each bit of instruction is reduced from 3.4 pJ to 0.1 pJ; moving down, through the pipeline, this bit costs 0.4 pJ which is too much; • The pipeline is eliminated by exposing it; • With a conventional, hidden pipeline, the processor fetches an instruction that takes many cycles all at once, than delays it via a series of pipeline registers until each part is needed; • The system uses the register read address during the register read pipeline stage, the opcode during the execute pipeline stage and so on; • This is classic but with high energy consumption; • An exposed pipeline splits up instructions and fetches each part of the instruction during the cycle when it is needed; • This requires more work on the compiler’s part but eliminates the power hungry instruction pipeline; • The IRF and exposed pipeline reduce the cost of fetching each instruction bit;

Advanced Embedded Systems • Eliminating overhead instructions reduces the number of instruction bits which must be supplied; • The processor uses most overhead instructions to manage loop indices and calculate addresses; • The instruction set can be modified so that the system performs the most common cases of these overhead functions automatically, as side effects of other instructions; • For example, a proposed load instruction allows loading a word from memory with the address postincremented by a constant and wrapped to a start address when it reaches a limit; • This allows implementing a circular buffer using a single load instruction rather than a sequence of several instructions; • Adding these side effects to instructions represents a selective return to complex instruction sets; this is an acceptable approach when energy is the major concern; • The test for the ELM processor showed it fetches 63 % fewer dynamic instruction bits than the bits fetched for the RISC;

Advanced Embedded Systems • Data supply energy can be reduced 21 times by using a deeper storage hierarchy with indexed register files; • Most data in the RISC processor is supplied from the general register file at a cost of 17 pJ/ word; • Accessing the register file is costly because of its 40 word size and multiple read and write ports; • Next figure shows the solution applied in the ELM processor:

Advanced Embedded Systems • The ELM reduces the data supply energy by placing a small, four-entry register operand file (ORF) on each arithmetic logic unit’s input; • Because of its small size and port count, reading a word from an ORF requires only 1.3 pJ, meaning a savings of 13 times; • The ORF’s small size also allows reading the ORF in the same clock cycle as the arithmetic operation, without appreciably lengthening the cycle time; this helps simplify the exposed pipeline; • The figure also shows the use of an indexed register file (XRF) as the hierarchy’s next level; • The system can access the XRF either directly (a field of the instruction specifies the register to access) or indirectly (a field of the instruction specifies an index register which in turn specifies the register to access); • Allowing indirect, or indexed, access to this register file eliminates the need for many references to the data cache or memory; • Many media kernels include many accesses to small arrays that can fit in the register file but require indirect access; on ELM these arrays can be kept in the indexed register file, reducing their access energy;

Advanced Embedded Systems • The tests showed that the ELM data memory reads 77% fewer words than the RISC data cache; • Exposing the movement of data and instructions with IRFs, ORFs and XRFs imposes new tasks for the compiler, such as: • It must manage the transfer of instruction blocks into the IRFs; • It must coordinate the movement of data between XRFs, ORFs and ALUs; • It must map arrays into XRFs; • The exposed pipeline creates new compilation challenges, particularly at the boundaries of basic blocks where the compiler can overlap operations from multiple blocks; • The exposed pipeline lets the compiler precisely control the movement of data through the data path and avoid unnecessarily cycling data through costly register files; • These tasks are not part of conventional compilers but can be implemented with the advances of the technology;

Advanced Embedded Systems • As a consequence a new compiler was proposed; • In order to quantify the compiler’s efficiency, a benchmark was run and the compiled code was compared to hand-optimized code for both the ELM and RISC processors; • The results showed a degradation in performance of 1.7 times for both the processors when moving from hand–optimized code to the compiled code; • Much of the performance degradation on the ELM processor results from the current compiler performing only basic scheduling optimizations at the boundaries of basic blocks; • The current scheduling algorithms sometimes introduce brief bubbles in the pipeline when a branch instruction jumps to a new instruction; • The principal purpose when ELM was designed was to shorten the gap to ASICs;

Advanced Embedded Systems • Benchmarks were run for comparing the implementation of several kernels with the ELM and with ASICs; • The implementations used the same technology and design flow; • The results showed that: • On kernels such as AES encryption and discrete cosine transfer computation, where the ELM processor stores part of the data working set in its local memory, the ELM processor consumes about 3 times the energy of an ASIC; • On compute-intensive kernels such as FIR filtering, where the data register hierarchy captures the working sets, the ELM processor consumes only 1.5 times the energy of an ASIC; • Further optimization can be done in register hierarchies; • The ELM processor’s efficiency is close enough to that of an ASIC and more results can be obtained with more efficient custom circuits and layout, particularly in the instruction and data hierarchies; also improvements in the microarchitecture can be foresight;

Advanced Embedded Systems • Increasingly complex, modern media applications demand high performance and low power; • The high complexity and cost of ASICs lead to a programmable solution; • However, classical programmable processors do not have enough efficiency for these applications because of the high energy consume when fetching instructions and supplying data to their arithmetic units; • The ELM solution is a compromise and a starting point for more energy efficient programmable processing; • Future improvement directions can be delimitated: • In the area of instruction fetching, additional levels can be thought to the hierarchy; • A more efficient use of the code can be thought: instructions can share common parts and no-operation fields of the instructions can be compressed;

Advanced Embedded Systems • Data supply energy can also be improved; for example, compound operations that perform several instructions as a unit without incurring the cost of cycling intermediate data through even the smallest of register files, can be foresight; • The ELM processor was thought for embedded applications but the techniques and solutions implemented in it might be useful for other applications too; • For example, the servers used in large data centers have huge consumption and it must be limited; the energy cost when they operate exceeds their purchase price within 2 years; • In 2007, the US expended 1 % of its electricity supply operating large servers; • The applications running on these servers differ from the embedded media applications; in particular they have far less instruction locality, making IRFs less attractive; • However, some of the described techniques could lead to significant power savings.

Advanced Embedded Systems • The second part of the course deals with the paper Amdahl’s Law in the Multicore Era, authors M. D. Hill and M. R. Marty, found in Computer, July 2008, pp. 33 – 38; • The paper deals with the extension of the Amdahl’s law on multicore procesors; • Although multicore processors are seldom used in ESs, the increasing complexity of the embedded applications may require the integration of multiple cores on the same circuit; • For example, the TMS320DM6441 DaVinci processor is a dual-core developed for the portable media player media; • It includes a a TMS320C64x+ DSP core and an ARM926EJ-S core; • The ARM core is a 32 bit RISC processor and the DSP core is also 32 bit, with two cache levels; • The processor also includes a video/ imaging coprocessor;

Advanced Embedded Systems • The general version of Amdahl’s law states that if one enhances a fraction f of a computation by a speedup s, the overall speedup is: • The law was applied to systems with n processors running in parallel; it was considered that a fraction f of the program is parallelizable with no scheduling overhead and the remaining fraction, 1 – f, is totally sequential; • The speedup of a system with n processors is given by the equation: • Amdahl’s equations assume that the computation problem size doesn’t change when running on enhanced machines; it means that the fraction of a program that is parallelizable remains fixed;

Advanced Embedded Systems • Multicore processors designers has to address such questions as: • How many cores? • Should cores use simple pipelines or a single powerful multi-issue pipeline? • Should cores use the same or different micro architectures? • How can be managed the power for both dynamic and static sources? • The extension of the Amdahl’s law on the multicore processors too will allow designers to view the entire chip’s performance rather than focusing on core efficiencies; • First, it is assumed that a multicore chip of given size and technology can contain at most n base core equivalents (BCE); it doesn’t include chip resources expended on shared caches, interconnection networks, memory controllers and so on; it is simplistically assumed that these nonprocessor resources are roughly constant in the multicore variations which will be considered;

Advanced Embedded Systems • Second, it is assumed that the designers have techniques for using the resources of multiple BCEs to create a core with greater sequential performance; if the performance of a single BCE is 1, the resources of r BCEs can be expended to create a powerful core with sequential performance perf(r); • If perf(r) >r, resources must be foresight for speeding up both sequential and parallel execution; if perf(r) <r, only the sequential execution will have greater performance; • A paper considered that perf(r) = √r, meaning that a sequential performance of √r can be obtained with the resources of r BCEs; thus, performance can be doubled with the resources of 4 BCEs, can be tripled with the resources of 9 BCEs and so on; • Next, a symmetric multicore chip requires that all its cores have the same cost; • For example, a symmetric multicore chip with resources from n = 16 BCEs can support 16 cores of one BCE each, four cores of four BCEs each, or, in general, n/r cores of r BCEs each;

Advanced Embedded Systems • Figures a and b show two examples of symmetric multicore chips for n = 16: • Under Amdahl’s law, the speedup of a symmetric multicore chip, relative to using one single BCE core, depends on: • The software fraction that is parallelizable, f, • The total chip resources in BCEs, n, and • The BCE resources, r, devoted to increase each core’s performance; • The chip uses one core to execute sequentially at performance perf(r); • It uses all n/r cores to execute in parallel at performance perf(r) x n/r;

Advanced Embedded Systems • The overall equation is: • Part a of the next figure assumes a symmetric multicore chip of n = 16 BCEs and perf(r) = √r; • The x axis gives resources for increasing each core’s performance; a value 1 indicates that the chip has 16 base cores, while a value of r = 16 uses all resources for a single core; • Lines assume different values for the parallel fraction, f = 0.5, 0.9, …, 0.999; • The y axis gives the symmetric multicore chip’s speedup relative to its running on one single BCE base core; • For example, the maximum speedup for f = 0.9 is 6.7 using eight cores at a cost of two BCEs each; • Similarly, part b illustrates how tradeoffs change with n = 256 BCEs/ chip; with f = 0.975, the maximum speedup of 51.2 is obtained with 36 cores of 7.1 BCEs each;

Advanced Embedded Systems • Result 1: Amdahl’s law applies to multicore chips because achieving the best speedups requires values for f which are near 1; thus, finding parallelism is still critical; • Efforts should be targeted to increase f through architectural support, compiler techniques, programming model improvements and so on; • Result 2: using more BCEs/ core, r > 1, can be optimal, even when performance grows by only √r; for a given f, the maximum speedup can occur at one big core, n base cores, or with an intermediate number of middle sized cores; • Researchers should seek methods for increasing core performance even at a high cost; • Result 3: moving to denser chips increases the likelihood that cores will be nonminimal; even at f = 0.99, minimal base cores are optimal at chip size n = 16, but more powerful cores help at n = 256; • In multicore chips more powerful cores must be foresight;

Advanced Embedded Systems • An alternative to a symmetric multicore chip is an asymmetric, or heterogeneous, multicore chip, in which one or more cores are more powerful than the others; • With the simplistic assumptions of Amdahl’s law, it makes more sense to devote extra resources to increase only one core’s capability, as it is shown in part c; • For example, with the resources of n = 16 BCEs, an asymmetric multicore chip can have one 4 BCEs core and 12 one BCE cores, one 9 BCEs core and 7 one BCE cores and so on; • In general, the chip can have 1 + n – r cores because the single larger core uses r resources and leaves n – r resources for the one BCE cores; • Amdahl’s law has a different effect on an asymmetric multicore chip; this chip uses the one more powerful core to execute sequentially at performance perf(r); in the parallel fraction it gets performance perf(r) from the larger core and performance 1 from each of the n – r base cores;

Advanced Embedded Systems • The final equation is: • Part c shows asymmetric speedup curves for n = 16 BCEs and part d gives curves for n = 256 BCEs; • These curves are quite different from the corresponding symmetric curves; those typically show either immediate performance improvement or loss as the chip uses more powerful cores, depending on the level of parallelism; in contrast, asymmetric chips often reach a maximum speedup between the extremes; • Result 4: asymmetric multicore chips can offer potential speedups that are much greater than symmetric chips and never worse; for example, for n = 256 and f = 0.975, the best asymmetric speedup is 125, whereas the best symmetric speedup is 51.2; • Result 5: denser multicore chips increase both the speedup benefit of going asymmetric and the optimal performance of the single large core; for f = 0.975 and n = 1024, the best speedup is with one core of 345 BCEs and 679 single BCE cores;

Advanced Embedded Systems • An other approach is dynamic multicore chips: it means dynamically combining up to r cores to boost performance of only the sequential component, as it is shown in next figure: • In sequential mode, the dynamic multicore chip can execute with performance perf(r) when the dynamic techniques can use r BCEs; • In parallel mode, a dynamic multicore gets performance n using all base cores in parallel;

Advanced Embedded Systems • Part e displays dynamic speedups when using r cores in sequential mode, with perf(r) = √r for n = 16BCEs and part f shows curves for n = 256 BCEs; • As the graphs show performance always gets better as the software can exploit more BCE resources to improve the sequential component; • Result 5: dynamic multicore chips can offer speedups that can be greater (and never worse) than asymmetric chips with identical perf(r) functions; for example, with f = 0.99 and n = 256, a speedup of 223 will be obtained which is greater than the comparable asymmetric speedup of 165; however, achieving much greater speedup than asymmetric chips implies dynamic techniques that harness more cores for sequential mode than is possible today; • This happens because it is assumed that dynamic chips can both gang all resources together for sequential execution and for parallel execution;

Advanced Embedded Systems • The extension of the Amdahl’s law described is affected by real conditions which are more complex than the ones assumed; • First: • Hardware designers can’t build cores with arbitrary high performance by adding more resources; • There are poor knowledge on how to dynamically harness many cores for sequential use without undue performance and hardware resource overhead; • The effects of dynamic and static power were ignored; • The effects of on and off chip memories and interconnections were ignored; • Second: • Software is not just infinitely parallel and sequential; • Software tasks and data movements add overhead; • It’s more costly to develop parallel software than sequential software; • Scheduling software tasks in asymmetric and dynamic multicore chips could be difficult and add overhead.

Energy Efficiency in Advanced Embedded Systems

Energy Efficiency in Advanced Embedded Systems

Presentation Transcript

Advanced Embedded Systems Design

Advanced Embedded Systems Design

Advanced Embedded Systems Design

Advanced Embedded Systems Design

Advanced Embedded Systems Design

Advanced Embedded Systems Design

ECGR-6185 Advanced Embedded Systems

Advanced Embedded Systems Design

Advanced Embedded Systems Design

ECGR-6185 Advanced Embedded Systems

ECGR 6185 Advanced Embedded Systems

Advanced Embedded Systems

Advanced Embedded Systems

Advanced Embedded Systems

Advanced Embedded Systems

Advanced Embedded Systems

Advanced Embedded Systems

Advanced Embedded Systems

Advanced Embedded Systems