1 / 17

Xiangrong Zhou and Peter Petrov Proceedings of the 16th ACM Great Lakes Symposium

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support. Xiangrong Zhou and Peter Petrov Proceedings of the 16th ACM Great Lakes Symposium on VLSI (GLSVLSI '06) pp. 398-403, Apr. 2006 Citation Count: 6 Presenter: Chun-Hung Lai.

boyd
Download Presentation

Xiangrong Zhou and Peter Petrov Proceedings of the 16th ACM Great Lakes Symposium

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings of the 16th ACM Great Lakes Symposium on VLSI (GLSVLSI '06) pp. 398-403, Apr. 2006 Citation Count: 6 Presenter: Chun-Hung Lai

  2. Abstract • In this paper we present a novel cache architecture for energyefficient data caches in embedded processors with virtual memory. Application knowledge regarding the nature of memory references is used to eliminate tag address translations for most of the cache accesses. We introduce a novel cache tagging scheme, where both virtual and physical tags co-exist in the cache tag arrays. Physical tags and special handling for the super-set cache index bits are used for references to shared data regions in order to avoid cache consistency problems. • By eliminating the need for address translation on cache access for the majority of references, a significant power reduction is achieved. We outline an efficient hardware architecture for the proposed approach, where the application information is captured in a reprogrammable way and the cache architecture is minimally modified. Our experimental results show energy reductions for the address translation hardware in the range of 90%, while the reduction for the entire cache architecture is within the range of 25%-30%.

  3. What’s the Problem • Cache organization with virtual memory support is very power consuming • The address translation (TLB lookup) is performed each time the cache (for PIPT and VIPT cache) is accessed • The TLB power constitutes 20-25% of the total cache power • Goal: reduce power by minimizing the number of address translation on cache accesses • Propose a selective tag translation cache architecture • For private data: can be handled with virtual tag (w.o. addr. translation) • For shared data: the physical tag is required (needaddr. translation) • Different virtual addresses are mapped to the same physical address Synonym problem:

  4. Background- VIVT & PIPT • Virtually Indexed Virtually Tagged (VIVT) cache • Pros: fast and low power (since no address trans. on cache access) • Cons: Synonym • Different virtual addresses (> 1 task) are mapped to the same physical address (shared data) • For inter-process communication • Physically Indexed Physically Tagged (PIPT) cache • Pros: Synonym is no longer an issue • Cons: delay and power overhead (address trans. for each cache access) Since virtual address is used to access $ - The shared data will in different blocks Cache Consistency Problem

  5. Background- VIPT • Virtually Indexed Physically Tagged (VIPT) cache • Pros: • Hide address translation latency • Since perform address translation only for tags • Cache indexing can be overlapped with the tag trans. • Can eliminate the cache synonym problem • By imposing certain restriction to the OS memory manager • Cons: • Power overhead (address trans. for each cache access) Most typical cache architecture for general-purpose processor - Discuss in this paper

  6. Selective Tag Translation Cache Architecture • Both virtual and physical tags are utilized at the same time • All cache lines are virtually-indexed • Non-shared data are tagged with virtual tags • Shared data are tagged with physical tags Save power since no addr. translation is required Can be identified in advanced; physically tagged when placed in cache Special Care Virtual Index VPN Virtual Tag Different VAs are mapped to the same PA Physical Tag Mode bit

  7. The Proposed Technique can Work Correctly When Synonyms are Aligned • What is the Aligned Synonym • The superset bits of virtual address are identical to superset bits of physical address • Thus, virtual index is the same as the physical index • What is the superset bits (or color bits) • The intersection bits of cache index and VPN • When the cache way size is larger than the page size Virtually Indexed Physically Tagged (VIPT) = Physically Indexed Physically Tagged (PIPT) The MSB of virtual index overlaps with VPN Virtual address: • To eliminate synonym problem in VIPT: • Align synonyms in OS • memory manager When used to access $:

  8. However, When the Synonyms are Not Aligned • The virtual superset bits are not identical to the physical superset bits • Conflict with other virtual indexes which don’t belong to the same synonym group Two virtual addresses have the same virtual superset bits The same virtual index If we use VIPT Indicate to the same $ line Same physical tag: - Misunderstand they are the same • However, they have different PPNs • Physical tag part is the same • Only the physical superset bits is different

  9. To Avoid the Previous Conflict When Synonyms are Not Aligned • Goal: translate the virtual superset bits to the physical superset bits with minimal cost • Add an offset to the virtual superset bits • Since shared data buffer is allocated in consecutive physical addresses • It is also mapped to consecutive virtual addresses Superset offset adder: - Translate physical superset bits with little delay Concatenate Physical Superset bits • Page offset adder: • Replace TLB to translate physical tag (power efficient) Physical Tag Virtual Tag

  10. Compiler and OS Support • To apply the proposed scheme • The shared data buffer and the hot-spots are identified • During program profile, compile, and load phases • Two extra bits are encoded in the memory reference instruction (which access shared data buffer) • Case1: for the most frequently accessed shared data buffer • Utilize the offset adjustment address translation method • Index of the offset table is encoded in the memory reference inst. as well • Case2: for the less frequently accessed shared data buffer • Translate physical tag by the D-TLB • Case3: for the non-shared data • Handle with virtual tag • For each shared buffer: • An entry is reserved • The offset is determined by OS Offset Table: Benefit comes from here

  11. Hardware Support • First, one additional bit is associated to each cache line • Indicate a physical tag or virtual tag • Second, implement the offset table • Third, superset offset adder and page offset adder • Translate the physical superset bits and PPN for synonym access • The introduced delay is small • The superset offset adder on cache access path is small (2 bits typically) • Though, the page offset adder is longer • The adder delay < TLB delay • TLB is replaced with adder • The Label Part (L) of each memory inst. • Index of offset table • The synonym bit (the previous 3 cases) • - Use virtual or physical tag Offset Table: Offset table access and cache access are pipelined -> not critical path No performance overhead

  12. Overall Hardware Organization • The different address translation path are controlled by the L field of memory inst. • Case1: • For frequently used shared data • - Use offset adjustment Virtual Superset bits Physical Superset bits • Case2: • For non-frequently used shared data • - Default D-TLB VPN Physical Tag VPN • Case3: • For non-shared data • - No power spend on • address translation Virtual Tag

  13. Experimental Results: Energy Reduction for Selective Tag Translation Cache- 1/2 • Assume we use D-TLB for physical tag translation • The energy reduction corresponding to a direct-mapped $ • For the address translation only: 77.8% ~ 99.3% • For the entire cache including address translation: 22.1% ~ 29.4% dm: direct-mapped cache 2way: 2-way set associative cache A pair of number: 1st number: address translation only 2nd number: entire cache

  14. Experimental Results: Energy Reduction for Selective Tag Translation Cache- 2/2 • Assume offset adjustment translation for physical superset bits and PPNs is applied • The energy reduction corresponding to a direct-mapped $ • For the address translation only: 82.1% ~ 99.9% • For the entire cache including address translation: 23.6% ~ 29.6% dm: direct-mapped cache 2way: 2-way set associative cache A pair of number: 1st number: address translation only 2nd number: entire cache

  15. Conclusions • This paper proposed a selectively tagged cache architecture for low-power processors with virtual memory support • References to private data • Virtual tags are used and power consuming address translation is eliminated • References to shared data • Physical tags are used to avoid synonym problems • Furthermore, due to the consecutive property of shared buffer allocation • The address translation can be performed by a adderinstead of TLB lookup • The energy reduction can be improved further • Results show that the proposed scheme • Energy reduction for the entire cache: 25%~30%

  16. Comment for This Paper • The instruction set extension may not easy • The proposed scheme will add a label field for memory instruction • Whether the unused bits in the instruction encoding are sufficient for the label field? • The relationship between related works and the proposed work is not connected • The step further with related works is not concrete • Lack comparison with related works in experimental results • The area and performance overhead are not listed

  17. Related Works Techniques for minimizing the power/performance overhead of TLB Reduce the amount of TLB activities Page Sharing table to TLB [2] Replace TLB with more scalable Synonym Lookaside Buffer [4] TLB supports up to two pages per entry [7] Redirect TLB accesses to a register which holds recent TLB entries [8] ??? Selective tag translation cache architecture This paper:

More Related