Fine-Grain Power-Gating on STT-MRAM Peripheral Circuits with Locality-aware Access Control

Fine-Grain Power-Gatingon STT-MRAM Peripheral Circuits with Locality-aware Access Control EishiArima† Hiroki Noguchi* Takashi Nakada† Shinobu Miwa† Susumu Takeda* Shinobu Fujita* and Hiroshi Nakamura† †The University of Tokyo *Toshiba Corporation Normally-off Computing Project http://noff-pj.jp/en/ The Memory Forum 2014

Background • STT-MRAMis considered as the best candidate to substitute SRAM for LLC. • low leakage,high density, high write endurance • The write access energy has been regarded as the most critical problem inan STT-MRAM cache. • With state-of-the-art MTJ cells, it isreduced dramatically. • 43nJ -> 1.1nJ (1MB)† • On the other hand, we need to consider the leakage power of STT-MRAM peripheral circuits. • To drive write current, high-performance but leaky transistors are needed in peripherals. †E. Kitagawaet al. “Impact of Ultra Low Power and Fast Write Operation of Advance Perpendicular MTJ on Power Reduction for High-Performance Mobile CPU,” in IEDM, 2012 The Memory Forum 2014

Motivation • In an STT-MRAM LLC, the leakage energy accounts for 48% of the total LLC energy consumption. • It is mainly consumed in peripherals. • The leakage of memory cells is nearly equivalent to zero. We need to reduce them. There are many techniques for reducingthem. The Memory Forum 2014

Our goal and approaches • The goalof this research • Leakage reduction for STT-MRAM LLC’s peripheral circuits while maintaining processor performance • Our approaches • Fine-grained power-gating on peripheral circuits • Especially at the granularity of subarrays • Access control for further energy reduction • Particularly, gathering the cache accesses The Memory Forum 2014

Subarray level power-gating onSTT-MRAM peripheral circuits • We assume power-gating at the granularity of subarrays. • Finer granularity increases the chance of power-gating. • Finer granularity than subarray is difficult to be implemented. • A frequently accessed subarray should be kept awake. • It takes a few ns to wake a subarray up. • To solve this problem, we adopt time-out control. • The subarray not accessed for a while tends to be idle for a long time. The Memory Forum 2014

Our goal and approaches • The goalof this research • Leakage reduction for STT-MRAM LLC’s peripheral circuits while maintaining processor performance • Our approaches • Fine-grained power-gating on peripheral circuits • Especially at the granularity of subarrays • Access control for further energy reduction • Particularly, gathering the cache accesses The Memory Forum 2014

Access control strategyforfurther leakage reduction • We should improve subarray level access locality for further leakage reduction. • While minimizing performance degradation • Our Methodologies • Locality-aware subarraymapping • For spatial locality enhancement • Write aggregation with buffers • For temporal locality enhancement The Memory Forum 2014

Access control strategy forfurther leakage reduction • We should improve subarray level access locality for further leakage reduction. • While minimizing performance degradation • Our Methodologies • Locality-aware subarraymapping • For spatial locality enhancement • Write aggregation with buffers • For temporal locality enhancement The Memory Forum 2014

Locality-aware subarraymapping • There are two types of subarray mappings in a cache. • Way-division and set-division • Way-division is usually adopted • But, a set-divisioncache has better spatial locality in a successive data access sequence. All the subarrays are awake. Only one subarray is awake. The index of a line is decided according to its address. 64 sets So they are scattered like this. The way of a line is decided unpredictably. 8ways 256KB 8ways 8subarrays LLC subarray

Access control strategy forfurther leakage reduction • We should improve subarray level access locality for further leakage reduction. • While minimizing performance degradation • Our Methodologies • Locality-aware subarraymapping • For spatial locality enhancement • Write aggregation with buffers • For temporal locality enhancement The Memory Forum 2014

Write aggregation for temporallocality enhancement • Our technique gathers write access with buffers. • Each subarray has a set of buffers. • One set of buffer is flushed if the corresponding subbarray is read or the buffer set is filled with data. • Write access latency is not critical for performance. t t Write (LLC misses and writebacks) sleep Read (demand LLC hits) time-out interval active buffering The access sequence for a subarray

Experiment • We evaluate the effect of our methodologies using the processor simulator gem5. • Environment • We set the time-out interval and the number ofeachbuffer entry as 1K cycles and 8 respectively. • They are the optimal numbers in our simulations. LLC

Result • The figure shows the Leakage energy of an STT-MRAM LLC for each method. • On average, more than 80% of L2 cache leakage can be reduced with our techniques. average best case for each method 67% 83% 40% 30% contribution of set-division contribution of buffers The Memory Forum 2014

Summary • To reduce the leakage power of STT-MRAM’s peripheral circuits, we propose subarray level power-gating. • We also propose two locality-aware access control methodologies to achieve more leakage power reduction. • Our experimental result shows that on average more than 80% of L2 cache leakage can be reduced with our techniques. The Memory Forum 2014

The Memory Forum 2014

Performance Degradation • The performance degradation caused by our methodologies is almost negligible. The Memory Forum 2014

Access control strategyforfurther leakage reduction • Aim：Improving overall sleep rate forsubarrays. • To achieve this, we only need to improve spatial / temporal subarray level locality of cache accesses. t t t t Temporal Locality Improvement Spacial Locality Improvement Sleep Rate 50% access sleep active almost 0 leakage time-out interval 75%→94% 50%→75% Subarrays The Memory Forum 2014

Implementation • The buffers are constructed as an SRAM array. • The buffer array is accessed just after an LLC access. • accessed only when LLC miss The Memory Forum 2014

The optimal Timeout interval • The figure shows the relationship between the time-out interval and performance degradation. • We assume that we can accept a performance degradation of 1.5% and consider 1K cycles as the time-out interval. (of all the benchmarks) The Memory Forum 2014

The optimal number of buffer entries • The figure shows the total energy of L2 cache and write buffers for the subarrays. • We can see from the graph that the optimal number of entries is 8. The Memory Forum 2014

Access distributions • With our techniques, we can reduce the number of small access intervals. sleeprate active rate The Memory Forum 2014

Fine-Grain Power-Gating on STT-MRAM Peripheral Circuits with Locality-aware Access Control