Memory Optimizations Research at UNT

Memory OptimizationsResearch at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems (Net-Centric IUCRC) Computer Science and Engineering The University of North Texas Denton, Texas 76203, USA kavi@cse.unt.edu http://csrl.unt.edu/~kavi

Motivation • Memory subsystem plays a key role in achieving performance on multi-core processors • Memory subsystem contributes to significant portions of energy consumed • Pin limitations limit bandwidth to off-chip memories • Shared caches may have non-uniform access behaviors • Shared caches may encounter inter-core conflicts and coherency misses • Different data types exhibit different locality and reuse behaviors • Different applications need different memory optimizations Memory Optimizations at UNT

Our Research Focus Cache Memory optimizations software and hardware solutions primarily at L-1 some ideas at L-2 Memory Management Intelligent allocation and user defined layouts Hardware supported allocation and garbage collection Memory Optimizations at UNT

Non-Uniformity of Cache Accesses Non-Uniform access to cache sets Some sets are accessed 100,000 time more often than other sets Cause more misses while some sets are not used Non-Uniform Cache Accesses For Parser Memory Optimizations at UNT

Non-Uniformity of Cache Accesses But, not all applications exhibit “bad” access behavior Non-Uniform Cache Accesses for Selected Benchmarks Need different solutions for different applications Memory Optimizations at UNT

Improving Uniformity of Cache Accesses • Possible solutions • Using Fully associative caches with perfect replacement policies • Selecting optimal addressing schemes • Dynamically re-mapping addresses to new cache lines • Partitioning caches into smaller portions • Each partition used by a different data object • Using Multiple address decoders • Static or dynamic data mapping and relocation Memory Optimizations at UNT

Associative Caches Improve Uniformity Direct Mapped Cache 16-Way Associative Cache Memory Optimizations at UNT

Data Memory Characteristics • Different Object Types exhibit different access behaviors • Arrays exhibit spatial localities • Linked lists and pointer data types are difficult to pre-fetch • Static and scalars may exhibit temporal localities • Custom memory allocators and custom run-time support can be used to improve locality of dynamically allocated objects • Pool Allocators (U of Illinois) • Regular Expressions to improve on Pool Allocators (Korea) • Profiling and reallocating objects (UNT) • Hardware support for intelligent memory management (UNT and Iowa State) Memory Optimizations at UNT

ABC’s of Cache Memories Multiple levels of memory – memory hierarchy DRAM (Main memory) L1- Instr Cache CPU and Registers L2 Cache (combined Data and Instr) DISK L1- Data Cache Memory Optimizations at UNT

ABC’s of Cache Memories Consider a direct mapped Cache An address can only be in a fixed cache line as specified by the 6-bit line number of the address Memory Optimizations at UNT

ABC’s of Cache Memories Consider a 2-way set associative cache An address is located in a fixed set of the cache. But the address can occupy either of the 2 lines of a set. We extend this idea to 4-way, 8-way,.. fully associative caches Memory Optimizations at UNT

ABC’s of Cache Memories Consider a fully associative cache An address is located in any line Or, there is only one set in the cache. Very expensive since we need to compare the address tag with each line tag. Also need a good replacement strategy. Tag Byte offset Can lead to more uniform of access to cache lines Memory Optimizations at UNT

Programmable Associativity Can we provide higher associativity only when we need it? Consider a simple idea Heavily accessed cache lines will be provided with alternate locations as indicated by “partner index” Memory Optimizations at UNT

Programmable Associativity Pier’s adaptive cache uses two tables Set-reference History Table (SHT) – tracks heavily used cache lines Out-of-position directory (OUT) – tracks alternate locations Zhang’s programmable associativity (B-Cache) Cache index is divided in to Programmable and Non-programmable indexes The NPI facilitates for varying associativities [Pier 98] J. Peir, Y. Lee, and W. Hsu, “Capturing Dynamic Memory Reference Behavior with Adaptive Cache Topology.” In Proc. of the 8th Int. Conf. on Architectural Support for Programming Language and Operating Systems, 1998, pp. 240–250 [Zhang 06] C. Zhang. Balanced cache: Reducing conflict misses of direct-mapped caches. ISCA, pages 155–166, June 2006 Memory Optimizations at UNT

Programmable Associativity Memory Optimizations at UNT

Multiple Decoders Tag Set Index Byte offset Tag Data Tag Set Index Tag Byte offset Set Index Tag Set Index Byte offset Different decoders may use different associativities Memory Optimizations at UNT

Multiple Decoders But how to select index bits? Memory Optimizations at UNT

Index Selection Techniques Different approaches have been studied Givargis quality bits X-Or some tag bits with index bits Add a multiple of tag to index Use prime modulo [Givargis 03] T. Givargis, “Improved Indexing for Cache Miss Reduction in Embedded Systems,” In Proc. of Design Automation Conference, 2003. [Kharbutli 04] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee, “Using PrimeNumbers for Cache Indexing to Eliminate Conflict Misses,” Proc.Int’lSymp. High Performance Computer Architecture, 2004 Memory Optimizations at UNT

Index Selection Techniques Memory Optimizations at UNT

Multiple Decoders Odd multiplier method Different multipliers for each thread Memory Optimizations at UNT

Multiple Decoders Here we split cache into segments, one per thread But, we used Adaptive cache techniques to “donate” underutilized sets to other threads Memory Optimizations at UNT

Other Cache Memory Research at UNT Use of a single data cache can lead to unnecessary cache misses Arrays exhibit higher spatial localities while scalar may exhibit higher temporal localities May benefit from different cache organizations (associativity, block size) If using separate instruction and data caches, why not different data caches -- either statically or dynamically partitioned And if separate array and scalar caches are included, how to further improve their performance Optimize the sizes of array and scalar caches for each application Memory Optimizations at UNT

Reconfigurable Caches M A I N M E M O R Y Array Cache Secondary Cache CPU Cache Scalar Cache Memory Optimizations at UNT

Percentage reduction of power, areaand cycles for data cache Conventional cache configuration: 8k, Direct mapped data cache, 32k 4-way Unified level 2 cache Scalar cache configuration: Size variable, Direct mapped with 2 lined Victim cache Array cache configuration: Size variable, Direct mapped Memory Optimizations at UNT

Summarizing For instruction cache 85% (average 62%) reduction in cache size 72% (average 37%) reduction in cache access time 75% (average 47%) reduction in energy consumption For data cache 78% (average 49%) reduction in cache size 36% (average 21%) reduction in cache access time 67% (average 52%) reduction in energy consumption when compared with an 8KB L-1 instruction cache and an 8KB L-1 unified data cache with a 32KB level-2 cache Memory Optimizations at UNT

Generalization Why not extend Array/Scalar split caches to more than 2 partitions? Each partition customized to a specific object type Partitioning can be achieved using multiple decoders with a single cache resource (virtual partitioning) Reconfigurable partitions is possible with programmable decoders Each decoder accesses a portion of the cache either physically restrict to a segment of cache or virtually limit the number of lines accessed by a decoder Scratchpad Memories can be viewed as cache partitions Dedicate a segment of cache for scratchpad Memory Optimizations at UNT

Scratch Pad Memories They are viewed as compiler controlled memories as fast as L-1 caches, but not managed as caches Compiler decides which data will reside in scratch pad memory A new paper from Maryland proposes a way of compiling programs for unknown sized Scratch pad memories Only Stack data (static and global variables) are placed in SPM Compiler views Stack as two stacks Potential SPM data stack DRAM data stack Memory Optimizations at UNT

Current and Future Research Extensive study of using Multiple Decoders Separate decoders for different data structures partitioning of L-1 caches Separate decoders for different threads and cores at L-2 or Last Level Caches minimize conflicts minimize coherency related misses minimize loss due to non-uniform memory access delays Investigate additional indexing or programmable associativity ideas Cooperative L-2 caches using adaptive caches Memory Optimizations at UNT

Program Analysis Tool • We need tools to profile and analyze • Data layout at various levels of memory hierarchy • Data access patterns • Existing tools (Valgrind, Pin) do not provide fine grained information • We want to relate each memory access back to a source level constructs • Source variable name, function/thread that caused the access Memory Optimizations at UNT

Gleipnir • Our tool is built on top of Valgrind • Can be used with any architecture that is supported by Valgrind • x-86, PPC, MIPS • and ARM Memory Optimizations at UNT

Gleipnir Memory Optimizations at UNT

Gleipnir How can we use Gleipnir. Explore different data layouts and their impact on cache accesses Memory Optimizations at UNT

Gleipnir Standard layout Memory Optimizations at UNT

Gleipnir Tiled matrices Memory Optimizations at UNT

Gleipnir Matrices A and C combined Memory Optimizations at UNT

Further Research • Restructuring memory allocation – currently in progress • Analyze cache set conflicts and relate them to data objects • Modify data placement of these objects • Reorder variables, include dummy variables, … • Restructure Code to improve data access patters (SLO tool) • Loop Fusion – combine loops that use the same data • Loop tiling – split loops into smaller loops to limit data accessed • Similar techniques to assure “common” data resides in L-2 (shared caches) • Similar techniques such that data is transferred to GPUs infrequently Memory Optimizations at UNT

Code Refactoring Loop Tiling Idea Too much data accessed in the loop double sum(…) { … for(int i=0; i<len; i++) result += X[i]; … } all cache misses occur here. Memory Optimizations at UNT

Code Refactoring Loop Fusion Idea double inproduct(…) { … for(int i=0; i<len; i++) result += X[i]*Y[i]; … } previous use occur here. double sum(…) { … for(int i=0; i<len; i++) result += X[i]; … } all cache misses occur here. Memory Optimizations at UNT

SLO Tool double inproduct(…) { … for(int i=0; i<len; i++) result += X[i]*Y[i]; … } double sum(…) { … for(int i=0; i<len; i++) result += X[i]; … } Memory Optimizations at UNT

Extensions Planned Key Factors Influencing Code and Data Refactoring Reuse Distance – reducing distance improves data utilization Can be used with CPU-GPU configurations Fuse loops so that all computations using the “same” data are grouped Conflict sets and conflict distances The set of variables that fall to the same cache line (or group of lines) Conflict between pairs of conflicting variables Increase conflict distance Memory Optimizations at UNT

Further Research We are currently investigating several of these ideas Using architectural simulators like SimICS explore multiple decoders with multiple threads, cores or for different data types Further extend Gleipnir and explore using Gleipnir with compilers and Gleipnir with other tools like SLO, evaluate the effectiveness of custom allocators Some hardware implementations of memory management using FPGAs And we welcome collaborations Memory Optimizations at UNT

The End Questions? More information and papers at http://csrl.cse.unt.edu/~kavi Memory Optimizations at UNT

Custom Memory Allocators Consider a typical pointer chasing programs node { int key; … data; /* complex data part node *next; } We will explore two possibilities pool allocation split structures Memory Optimizations at UNT

Custom Memory Allocators • Pool Allocator (Illinois) Heap Data type A Data type A Data type B Data type A Data type A Data type B Data type B Heap Data type A Data type A Data type A Data type A Data type B Data type B Data type B Memory Optimizations at UNT

Custom Memory Allocators • Further Optimization Consider a different definition of the data node { int key; node *next; data_node * data+ptr; } Consider a typical pointer chasing programs node { int key; … data; /* complex data part node *next; } The data part is accessed only if key matches while (..) { if (b->key == k) return h->data; h= h->next; } Key; *next; *datat_ptr Key; *next; *datat_ptr Key; *next; *datat_ptr Data_node Data_node Data_node Memory Optimizations at UNT

Custom Memory Allocators • Profiling (UNT) • Using data profiling, “flatten” dynamic data into consecutive blocks • Make linked lists look like arrays! Memory Optimizations at UNT

Memory Optimizations Research at UNT