DRIM: A Low Power Dynamically Reconfigurable Instruction Memory Hierarchy for Embedded Systems

DRIM: A Low Power Dynamically Reconfigurable Instruction Memory Hierarchy for Embedded Systems Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April 2007

Abstract • Power consumption is of crucial importance to embedded systems. In such systems, the instruction memory hierarchy consumes a large portion of the total energy consumption. A well designed instruction memory hierarchy can greatly decrease the energy consumption and increase performance. The performance of the instruction memory hierarchy is largely determined by the specific application. Different applications achieve better energy-performance with different configurations of the instruction memory hierarchy. • Moreover, applications often exhibit different phases during execution, each exacting different demands on the processor and in particular the instruction memory hierarchy. For a given hardware resource budget, an even better energy-performance may be achievable if the memory hierarchy can be reconfigured before each of these phases.

Abstract – Cont. • In this paper, we propose a new dynamically reconfigurable instruction memory hierarchy to take advantage of these two characteristics so as to achieve significant energy-performance improvement. Our proposed instruction memory hierarchy, which we called DRIM, consists of four banks of on-chip instruction buffers. Each of these can be configured to function as a cache or as a scratchpad memory (SPM) according to the needs of an application and its execution phases. Our experimental results using six benchmarks from the MediaBench and the MiBench suites show that DRIM can achieve significant energy reduction.

What’s the Problem • The instruction delivery system constitutes a significant portion of the processor energy consumption • As instructions are fetched almost every cycle • Scratchpad Memory (SPM)is energy efficient than cache • However, the existing works on instruction SPM • Not consider the phased behavior of applications during execution

Related Works Reduce energy consumption in I-caches Shut down cache ways Reduce energy and instruction conflicts Use pure SPM or hybrid SPM and cache architecture Reconfigure cache that adapts to application [18, 1] Static mapping instructions into SPM [16, 9] Dynamic instruction replacement for SPM [7, 4, 14] Reconfigure memory hierarchy ($/SPM) for a given application[11, 15] Static architecture with static mapping Static architecture exploration with static mapping Static architecture with dynamic instr. replacemnt Dynamically reconfigurable data memory with $ and SPM [6] Dynamic architecture tuning (phases during execution) Reconfiguration management algorithm Dynamically reconfigurable instruction memory with $ and SPM This Paper:

Idea of the Dynamically Reconfigurable Instruction Memory (DRIM) • Reconfigure instruction memory architecture at runtime Exploit the different requirement between phases within an application The four banks can be dynamically reconfigured as cache or SPM

DRIM Architecture – Part 1 • Base on a four way associative cache • Configure four banks dynamically as cache or SPM Tag Bank will be gated when configured as a SPM 1 a7…a0 2 Set ci to 1 when used as a SPM

DRIM Architecture – Part 2 Decide whether an instruction is residing in SPM: the upper and lower bound addresses of instruction block that resides in SPM is checked Perform loading from Mem to SPM 3 4 5 a7…a0 0x000~ 0x0FF 0x100~ 0x1FF 0x200~ 0x2FF 0x300~ 0x3FF Use [a9:a8] to generate data bank selection Di Suppose size of each data bank is 256 bytes

DRIM Architecture – Part 3 • The SPM_hit controls the gating of the tag and data banks if (SPM_hit) then all tag banks will be gated; else only the tag banks configured as cache will be searched; if (SPM_hit) then the SPM bank will be selected by Di else only the data banks configured as cache will be searched; a7…a0 Data banki enable signal 1: enable; 0: disable

Compiler Support for Dynamic Reconfiguration & Instruction Load • Get the required execution statistics: • Execution counts of edge of CFG • # of procedure invocations • Optimize inst. layout within each procedure: • Bring the frequently executed • basic blocks together • Determine the architectural configuration for different phases: • When & what • Instruction allocation to SPM • Generate code chunk & load into SPM: • Group instruction blocks to SPM • Insert inst. for reconfiguration • Insert inst. for trace loading With an optimized inst. layout

Preface of Reconfiguration and Instruction Allocation • Loop Procedure Hierarchy Graph (LPHG) to represent a program • Capture all loops, procedure calls, and their relations • Suppose most of energy consumed by inst. fetch occurs inside loop • If (Loop iterations > threshold), then it is beneficial to use SPM • The deeper loop in LPHG has higher execution frequency • Start from leaf loops to their parent loops • If (Loop > SPM size), then cache is used to buffer rest of loop

Algorithm for Reconfiguration and Instruction Allocation Whether it is beneficial to allocate more SPM space from the free_banks Leaf node Internal node Allocate frequently executed inst. inside loop to SPM Delete all reconfig. points inserted in child loops and add a new reconfig. point to entry of loop Since only one code chunk can reside in SPM

Example of How to Evaluate Conflicts • The evaluation function • Consider it is beneficial • When reduce cache size does not severely increase the I-cache miss Total size of remaining banks (64x1) < C Total size of remaining banks (64x2) > each of B, C Total size of remaining banks (64x3) > each of B, C, D SPM $ SPM $ $ $ D E Severe $ Conflict No Conflict -> Safe • Try to configure one bank as SPM and allocate it to loop E: 2. Configure one more bank as SPM and move loop D: # inside circle: Loop Iterations # beside circle: Loop size 3. Configure one more bank as SPM and move loop B:

Optimization: Hoist Reconfiguration Position • Goal: reduce the number of reconfiguration • If a loop does not have any sibling loops • Hoist the reconfiguration point from inner loop to outer loop Load code chunk into SPM whenever execute the child loop Decide reconfig. points & inst allocated to SPM Original Reconfiguration at entry of loop B Optimized

Experimental Setup • The DRIM is based on a 4-way associative I-cache • Each bank is size of 256 bytes • Model energy consumption using CACTI for 0.13μm technology • The logic that performs address checking and SPM control is also included Energy Consumption Per Access Energy of the cache portion when DRIM is configured as 1, 2, 3, 4 banks cache and SPM Energy for one data bank + energy overhead for accessing SPM

Performance Improvement • The average improvement • 15.6% in I-cache miss rate • 10.2% in execution time • The improvement comes from • The frequently executed instructions of important loops are mapped into SPM

Energy Saving There is actually energy savings even there is no miss rate reduction • The reduction in energy consumption by DRIM • Range from 14.3% to 65.2% • The average reduction is 41% • The reduction comes from • The I-cache miss rate is improved • Fewer SDRAM accesses • The energy consumption per access of SPM is lower than that of cache

Conclusions • This paper proposed a low power Dynamically Reconfigurable Instruction Memory (DRIM) • The I-cache can be configured as SPM for • Different applications as well as different phases of application’s execution • Compilation flow to support DRIM • Determine reconfiguration point and instructions allocated to SPM • Experimental results show that DRIM • Reduce energy consumption up to 65.2%

Comment for This Paper • The DRIM architecture is clear and easy to understand • It also shows that the tag bank is not utilized when configured as SPM • The complex compiler framework makes it hard to migrate to other Instruction Set Architecture (ISA)

DRIM: A Low Power Dynamically Reconfigurable Instruction Memory Hierarchy for Embedded Systems