Dynamic and Application-Driven I-Cache Partitioning for Low-Power Embedded Multitasking

Dynamic and Application-Driven I-Cache Partitioning for Low-Power Embedded Multitasking Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July 2009

Abstract • The abundance of wireless connectivity and the increased workload complexity have further underlined the importance of energy efficiency for modern embedded applications. The cache memory is a major contributor to the system power consumption, and as such is a primary target for energy reduction techniques. Recent advances in configurable cache architecture have enabled an entirely new set of approaches for application-driven energy- and cost-efficient cache resource utilization. • We propose a run-time cross-layer specialization methodology, which leveragesconfigurable cache architectures to achieve an energy- and performance-conscious adaptive mapping of instruction cache resources to tasks in dynamic multitasking workloads.

Abstract – Cont. • Sizable leakage and dynamic power reductions are achieved with only a negligible and system-controlled performance impact. The methodology assumes no prior information regarding the dynamics and the structure of the workload. As the proposed dynamic cache partitioning alleviates the detrimental effects of cache interference, performance is maintained very close to the baseline case, while achieving 50%-70% reductions in dynamic and static leakage power for the on-chip instruction cache.

What’s the Problem • The cache memory is a major contributor to the total dynamic and leakage power • Occupy up to 50% of die area and 80% of transistor budget • How to customize the configurable cache dynamically to provide a task only its required cache volume • Goal:reduce power consumption with limited degradation in performance Normal Cache Energy Efficient Cache Task0 Task0 Idle Performance doesn’t improve noticeably beyond half of cache

The Proposed Methodology for Dynamic Cache Customization • Partition the instruction cache and adapt its utilization at run time • Cache partitioning: eliminate cache interference • Utilize configurable cache: only the required subsection of cache is active Dynamic multitasking workload Task0 From t2 ~ t3: Idle Only one task is active at a time Task1 Task 2 Idle Idle

Functional Overview Dynamic multitasking workload • Base on cache partitioning formation (initial partition) policy • Cache requirements of each task (detailed later) • Task0: 2K 2-way, Task1: 8K 4-way, Task2: 4K 2-way Active section during Task2 execution 16K 4-way Baseline Cache Map to subsection equal to the required $ size Low power drowsy mode

Functional Overview – Cont. • However, overlap cache partitioning is inevitable • Some tasks may require larger cache partitions • Overlap brings the problem of cache interference • Result in performance worse than the required miss rate bound • Handle such case through dynamic partition update • Update the overlapped partitions dynamically Dynamic Partition Update Ideal Case Initial Partition Task0 Task0 Task0 Task 2 Task1 Task1 Task 2 Task1 Task 2 Enlarge partition when performance worse Map to exclusive Overlapping

Dynamic Cache Customization • The mechanisms required for efficient cache utilization with minimal interference • Initial partition formation • Identify the individual task cache requirement at compile-time • Use the cache miss statistics information local to each task • Initial partition assignment • Assign the initial partition to a task at run-time • Set the “Cache Way Select Register (CWSR)” and the “mask register” to vary the # of sets • Dynamic partition update policy • Fine-tune the partition size when performance worse • Ensure miss-rate remain within the threshold bounds

Part1: Initial Partition Formation • Identify cache requirement and determine the initial partition size for each task • Aim at reducing energy while keep performanceclose to the baseline case, i.e., BASE(Ti) • Use the IND_BASE(Ti) instead • Then define a “Threshold” accounts for the cache interference • Hence, the miss rate bound for a task is IND_BASE(Ti) + Threshold • The starting cache configuration is picked such that • MISS(Pi,j) ≦ IND_BASE(Ti) + Threshold Not available at compiler-time Task-specific BASE(Ti) IND_BASE(Ti) task0 task0 task1 task2 task3 Task 4 Actual baseline miss rate of task Ti with interference Miss rate of task Ti when baseline cache is used in isolation Miss rate of task Ti for sharing the baseline cache

Part1: Initial Partition Formation - Example • MCS (Missrate Cache Space) Table • Cache miss statistics for each cache configuration • Obtain through profiling Find the minimal cache that satisfy condition Threshold Cache Way size Task0 # of Ways + 0.1% MISS(P0,j) ≦ IND_BASE(T0)= 0% Starting configuration for G721: 8K 2-way Task1 4K 512 1K 2K 8K # of Ways + MISS(P1,j) ≦ 0.1% IND_BASE(T1)= 0.15% Starting configuration for LAME: 4K 4-way Task2 4K 512 1K 2K 8K # of Ways + MISS(P2,j) ≦ 0.1% IND_BASE(T2)= 0.17% Starting configuration for GSM: 8K 2-way

Part2: Initial Partition Assignment • Assign the initial partition to a task at run-time • Set the control register and mask register of configurable cache • Attempt to assign partitions exclusive of each other • But not always possible • Total $ requirement of G721, LAME, and GSM is 20K but only 16K is available At time t1, allocate 4K 4-way to LAME (can’t exclusive of G721, and allow overlapping) The overlap of partitions can’t be avoided due to the nature of LAME partition At time t0, allocate 8K 2-way to G721 At time t2, allocate 8K 2-way to GSM (with a small portion being used by LAME)

Part3: Dynamic Partition Update • Tasks with overlapping partitions can’t be prevented • Interference and miss rates may exceed the bound HW miss counter inside CPU ＞ IND_BASE(Ti) + Threshold Trigger the dynamic partition update Trigger partition rescaling Enlarge the partition size until it is less than the miss rate bound

Part3: Dynamic Partition Update - Example • Partition rescaling trades-offpower savings for meeting performance requirement 4K 512 1K 2K 8K For LAME, the miss-rate bound is exceeded in the overlapped region LAME: IND_BASE(T1) + Threshold= 0.25% The next configuration after 4K 4-way with miss rate less than 0.25% is 6K 3-way GSM rescaled to 12K 3-way due to increased overlap with the rescaled LAME partition

Part3: Dynamic Partition Update - Example • Partition reshuffling • When a task leaves on completing execution • The cache resource is freed up and available to currently executing tasks • The previously rescaled partition is considered for reshuffling • Completely allocate this task’s starting configuration without overlap At time t4, both G721 and GSM complete only LAME is left executing Reshuffle to starting configuration (reverting to smaller partition results in reduced power) Reshuffling

Experiment Setup • Use the cache configurations found in high-end embedded processor (Intel XScale and ARM9) • 16K 4-way • 32K 4-way • Scheduling policy to model multitasking • Round-robin policy with a context-switch frequency of 33K Inst. • The miss-rate impact threshold is set to 0.1% • Evaluate two categories of benchmark • Static benchmarks: all tasks start and finish at the same time • Dynamic benchmarks: Structure of Dynamic Benchmarks

Miss-Rate Impact: Increase Miss-Rate Compared to Baseline Cache After rescaling, the miss-rate impact is reduced • Partitioning: apply the initial partition assignment only • Rescaling: apply partitioning + rescaling • Reshuffling: apply partitioning + rescaling + reshuffling • For some configuration, the rescaling and reshuffling are omitted • Since the miss-rate impact is within the threshold after initial assignment Better

BM_3 Individual Task Miss-Rates for 16K Cache Exceed miss-rate bound of 0.27% • GSM is subjected to rescaling • Miss-rate bound is exceeded due to interference in the overlapped • The partition reshuffling maximizes power reduction • Power reduction is achieved while keepingmiss-rate impact below the threshold value Improve performance, even low than baseline case Better

Shared cache task0 task1 task2 task3 Thrashing Task 4

Dynamic and Application-Driven I-Cache Partitioning for Low-Power Embedded Multitasking

Dynamic and Application-Driven I-Cache Partitioning for Low-Power Embedded Multitasking

Presentation Transcript

Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache

Low Power Silicon Microphotonic Communications for Embedded Systems

Dynamic Data-Driven Application Simulation (DDDAS)

Dynamic Partitioning Windows Server

Dynamic Hardware Software Partitioning

A Loop Accelerator for Low Power Embedded VLIW Processors

Impact of Cache Partitioning on Multi-Tasking Real Time Embedded Systems

Characterization and Dynamic Mitigation of Intra-Application Cache Interference

Block Cache for Embedded Systems

ROBTIC : On chip I-cache design for low power embedded systems

Low Power Embedded Security: Thumbpod embedded biometrics project

A Low-Power I-Cache Design with Tag-Comparison Reuse

A Decompression Architecture for Low Power Embedded Systems

Dynamic Data-Driven Application Systems (DDDAS)

Miss Reduction in Embedded Processors Through Dynamic, Power-Friendly Cache Design

Low-Power Design for Embedded Processor

High Performance, Low Power Reconfigurable Processor for Embedded Systems

Low-Power Dynamic Voltage Scaling System

Power Partitioning for Multiprocessors

Compressed Tag Architecture for Low-Power Embedded Cache Systems