Software and Hardware Support for Locality Aware High Performance Computing

Software and Hardware Support for Locality Aware High Performance Computing Xiaodong Zhang National Science Foundation College of William and Mary This talk does not necessarily reflect NSF`s official opinions

Acknowledgement • Participants of the project • David Bryan, Jefferson Labs (DOE) • Stefan Kubricht, Vsys Inc. • Song Jiang and Zhichun Zhu, William and Mary • Li Xiao, Michigan State University. • Yong Yan, HP Labs. • Zhao Zhang, Iowa State University. • Sponsors of the project • Air Force Office of Scientific Research • National Science Foundation • Sun Microsystems Inc.

50% per year CPU-DRAM Gap 60% per year 7% per year

Cache Miss Penalty • A cache miss = Executing hundreds of CPU instructions (thousands in the future). • 2 GHz, 2.5 avg. issue rate: issue 350 instructions in 70 ns access latency. • A small cache miss rate  A high memory stall time in total execution time. • On average, 62% memory stall time for SPEC2000.

I/O Bottleneck is Much Worse • Disk access time is limited by mechanical delays. • A fast Seagate Cheetah X15 disk (15000 rpm): • average seek time: 3.9 ms, rotation latency: 2 ms • internal transfer time for a strip unit (8KB): 0.16 ms • Total disk latency: 6.06 ms. • External transfer rate increases 40% per year. • from disk to DRAM: 160 MBps (UltraSCSI I/O bus) • To get 8KB from disk to DRAM takes 11.06 ms. • More than 22 million CPU cycles of 2GHz!

Memory Hierarchy with Multi-level Caching CPU Registers registers TLB TLB L1 L1 Algorithm implementation Compiler L2 L2 Micro architecture L3 L3 CPU-memory bus Row buffer Row buffer Bus adapter DRAM Micro architecture Micro architecture Controller buffer Controller buffer Buffer cache Buffer cache I/O bus I/O controller Operating system disk cache Disk cache disk

Other Systems Effects to Locality Locality exploitation is not guaranteed by the buffers! • Initial and runtime data placement. • static and dynamic data allocations, and interleaving. • Data replacement at different caching levels. • LRU is used but fails sometimes. • Locality aware memory access scheduling. • reorder access sequences to use cached data.

Outline • Cache optimization at the application level. • Designing fast and high associativity caches • Exploiting multiprocessor cache locality at runtime. • Exploiting locality inDRAM row buffer. • Fine-grain memory access scheduling. • Efficient replacement in buffer cache. • Conclusion

Application Software Effort: Algorithm Restructuring for Cache Optimization • Traditional algorithm design means: • to give a sequence of computing steps subject to minimize CPU operations. • It ignores: • inherent parallelizations and interactions (e.g. ILP, pipelining, and multiprogramming), • memory hierarchy where data are laid out, and • increasingly high data access cost.

Mutually Adaptive Between Algorithms and Architecture • Restructuring commonly used algorithms • by effectively utilizing caches and TLB, • minimizing cache and TLB misses. • A highly optimized application library is very useful. • Restructuring techniques • data blocking: grouping data in cache for repeat usage • data padding to avoid conflict misses • using registers as fast data buffers

Two Case Studies • Bit-Reversals: • basic operations in FFT and other applications • data layout and operations cause large conflict misses • Sortings: merge-, quick-, and insertion-. • TLB and cache misses are sensitive to the operations. • Our library outperforms systems approaches • We know exactly where to pad and block! • Usage of the two libraries (both are open sources) • bit-reversals: an alternative in Sun’s scientific library. • Sorting codes are used a benchmark for testing compilers.

Microarchitecture Effort:Exploit DRAM Row Buffer Locality • DRAM features: • High density and high capacity • Low cost but slow access (compared to SRAM) • Non-uniform access latency • Row-buffer serves as a fast cache • the access patterns here has been paid little attention. • Reusing buffer data minimizes the DRAM latency.

Locality Exploitation in Row Buffer CPU registers Registers TLB TLB L1 L1 L2 L2 L3 L3 CPU-memory bus Row buffer Row buffer Bus adapter DRAM Controller buffer Controller buffer Buffer cache Buffer cache I/O bus I/O controller disk cache Disk cache disk

Row Access Precharge DRAM Access=Latency+ Bandwidth Time Processor Bus bandwidth time Row Buffer Column Access DRAM Latency DRAM Core

Nonuniform DRAM Access Latency • Case 1: Row buffer hit (20+ ns) • Case 2: Row buffer miss (core is precharged, 40+ ns) • Case 3: Row buffer miss (not precharged, ≈ 70 ns) col. access row access col. access precharge row access col. access Row buffer misses come from a sequence of accesses to different pages in the same bank.

Amdahl’s Law applies in DRAM • Time (ns) to fetch a 128-byte cache block: • As the bandwidth improves, DRAM latency will decide cache miss penalty.

Row Buffer Locality Benefit Objective: serve memory requests without accessing the DRAM core as much as possible. Reduce latency by up to 67%.

Row Buffer Misses are Surprisingly High • Standard configuration • Conventional cache mapping • Page interleaving for DRAM memories • 32 DRAM banks, 2KB page size • SPEC95 and SPEC2000 • Why is the reason behind this?

Conventional Page Interleaving Page 0 Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Page 7 … … … … Bank 0 Bank 1 Bank 2 Bank 3 Address format r k p page index bank page offset

Address Mapping Symmetry • cache-conflicting: same cache index, different tags. • row-buffer conflicting: same bank index, different pages. • address mapping: bank index  cache set index • Property: xy, x and y conflict on cache  also on row buffer. r k p page: page index bank page offset t s b cache: cache tag cache set index block offset

Sources of Misses • Symmetry: invariance in results under transformations. • Address mapping symmetry propogates conflicts from cache address to memory address space: • cache-conflicting addresses are also row-buffer conflicting addresses • cache write-back address conflicts with the address of the to be fetched block in the row-buffer. • Cache conflict misses are also row-buffer conflict misses.

L2 Cache tag index bank page offset k k XOR k page index new bank page offset Breaking the Symmetry by Permutation-based Page Interleaving

Permutation-based interleaving Conventional interleaving memory banks L2 Conflicting addresses 0000 1000 1010 0001 0010 0011 1001 1010 0100 0101 1010 1010 0110 0111 1010 1011 1010 1011 xor Different bank indexes Same bank index Permutation Property (1) • Conflicting addresses are distributed onto different banks

Permutation-based interleaving Conventional interleaving memory banks Within one page 0000 1000 1010 0001 1000 1010 0010 1000 1010 0011 0100 1000 1010 0101 … … 0110 0111 1010 1011 xor Same bank index Same bank index Permutation Property (2) • The spatial locality of memory references is preserved.

bank 0 bank 1 bank 2 bank 3 Permutation Property (3) • Pages are uniformly mapped onto ALL memory banks. 0 1P 2P 3P 4P 5P 6P 7P … … … … C+1P C C+3P C+2P C+5P C+4P C+7P C+6P … … … … 2C+2P 2C+3P 2C 2C+1P 2C+6P 2C+7P 2C+4P 2C+5P … … … …

Row-buffer Miss Rates

Comparison of Memory Stall Time

Improvement of IPC

Where to Break the Symmetry? • Break the symmetry at the bottom level (DRAM address) is most effective: • Far away from the critical path (little overhead) • Reduce the both address conflicts and write-back conflicts. • Our experiments confirm this (30% difference).

System Software Effort: Efficient Buffer Cache Replacement • Buffer cache borrows a variable space in DRAM. • Accessing I/O data in buffer cache is about a milliontimes faster than in the disk. • Performance of data intensive applications relies on exploiting locality of buffer cache. • Buffer cache replacement is a key factor.

Locality Exploitation in Buffer Cache CPU registers Registers TLB TLB L1 L1 L2 L2 L3 L3 CPU-memory bus Row buffer Row buffer Bus adapter DRAM Controller buffer Controller buffer Buffer cache Buffer cache I/O bus I/O controller disk cache Disk cache disk

The Problem of LRU Replacement Inability to cope with weak access locality • File scanning: one-time accessed blocks are not replaced timely; • Loop-like accesses: blocks to be accessed soonest can be unfortunately replaced; • Accesses with distinct frequencies: Frequently accessed blocks can be unfortunately replaced.

Reasons for LRU to Fail and but Powerful • Why LRU fails sometimes? • A recently used block will not necessarily be used again or soon. • The prediction is based on a single source information. • Why it is so widely used? • Simplicity: an easy and simple data structure. • Work well for accesses following LRU assumption.

Our Objectives and Contributions Significant efforts have been made to improve/replace LRU, • Case by case; or • High runtime overhead Our objectives: • Address the limits of LRU fundamentally. • Retain the low overhead and strong locality merits of LRU.

Related Work • Aided by user-level hints • Application-hinted caching and prefetching [OSDI, SOSP, ...] • rely on users` understanding of data access patterns. • Detection and adaptation of access regularities • SEQ, EELRU, DEAR, AFC, UBM [OSDI, SIGMETRICS …] • case-by-case oriented approaches • Tracing and utilizing deeper history information • LRFU, LRU-k, 2Q (VLDB, SIGMETRICS, SIGMOD …) • high implementation cost, and runtime overhead.

5 3 2 1 Observation of Data Flow in LRU Stack • Blocks are ordered by recency in the LRU stack. • Blocks enter the stack top, and leave from its bottom. The stack is long and bottom is the only exit. A block evicted from the bottom of the stack should have been evicted much earlier ! . . . LRU stack 6

Inter-Reference Recency (IRR) IRR of a block: number of other unique blocks accessed between two consecutive references to the block. Recency: number of other unique blocks accessed from last reference to the current time. R = 2 IRR = 3 1 2 3 4 3 1 5 6 5

Basic Ideas of LIRS • A high IRR block will not be frequently used. • High IRR blocks are selected for replacement. • Recency is used as a second reference. • LIRS: Low Inter-reference Recency Set algorithm • Keep Low IRR blocks in buffer cache. • Foundations of LIRS: • effectively use multiple sources of access information. • Responsively determine and change the status of each block. • Low cost implementations.

Llirs Lhirs Data Structure: Keep LIR Blocks in Cache Low IRR(LIR) block and High IRR (HIR) block Block Sets Physical Cache LIR block set (size is Llirs ) Cache size L =Llirs + Lhirs HIR block set

Replacement Operations of LIRS Llirs=2, Lhirs=1 LIR block set = {A, B}, HIR block set = {C, D, E} E becomes a resident HIR determined by its low recency

Which Block is replaced ?Replace an HIR Block D is referenced at time 10 The resident HIR block E is replaced !

How is LIR Set Updated ? LIR Block Recency is Used HIR is a natural place for D,but this is not insightful.

AfterDis Referenced at Time 10 D enters LIR set, and B steps down to HIR set Because D`s IRR< Rmax in LIR set

The Power of LIRS Replacement Capability to cope with weak access locality • File scanning: one-time access blocks will be replaced timely; (due to their high IRRs) • Loop-like accesses: blocks to be accessed soonest will NOT be replaced; (due to their low IRRs) • Accesses with distinct frequencies: Frequently accessed blocks will NOT be replaced. (dynamic status changes)

LIRS Efficiency: O(1) IRR HIR (New IRR of a HIR block) Rmax (Maximum Recency of LIR blocks) Can O(LIRS) = O(LRU)? • Yes!. this efficiency is achieved by our LIRS stack: • Both recencies and useful IRRs are automatically recorded. • Rmax of the block in the stack bottom is larger than IRRs of others. • No comparison operations are needed.

resident in cache 5 LIR block 3 Cache size L = 5 2 HIR block 1 Llir= 3 Lhir =2 6 9 4 5 8 3 LRU Stack for HIRs LIRS stack LIRS Operations • Initialization: All the referenced blocks are given an LIR status until LIR block set is full. • We place resident HIR blocks in a small LRU Stack. • Upon accessing an LIR block (a hit) • Upon accessing a resident HIR block (a hit) • Upon accessing a non-resident HIR block (a miss)

resident in cache 4 5 5 4 5 LIR block 3 3 8 3 2 2 Cache size L = 5 2 1 1 HIR block 1 6 Llir= 3 Lhir =2 6 6 9 5 9 9 5 5 4 3 3 3 8 8 Q S Q Q S S Access an LIR block (a Hit) Access 4 Access 8 S d

resident in cache 4 3 3 8 4 4 LIR block 8 8 5 5 Cache size L = 5 3 HIR block 2 2 Llir= 3 Lhir =2 1 5 1 5 5 3 5 1 Q Q Q S S S Access an HIR Resident Block (a Hit) Access 3 Access 5 S d

resident in cache 3 3 7 LIR block 4 4 5 8 8 Cache size L = 5 HIR block Llir= 3 Lhir =2 7 5 5 1 5 Q Q S S Access a Non-Resident HIR Block ( a Miss) Access 7

9 5 3 3 7 7 8 4 4 5 resident in cache 5 7 8 8 5 block number Cache size L = 5 LIR block HIR block 3 4 Llir= 3 Lhir =2 9 8 7 7 9 5 Q Q S S Q S Access a Non-Resident HIR block (a Miss) (Cont) Access 9 Access 5

Software and Hardware Support for Locality Aware High Performance Computing

Software and Hardware Support for Locality Aware High Performance Computing

Presentation Transcript

Java for High Performance Computing

HIGH PERFORMANCE COMPUTING

Java for High Performance Computing

High Performance Computing

Java for High Performance Computing

Java for High Performance Computing

High-Performance Computing

High-Performance Computing

High Performance Computing

HIGH PERFORMANCE COMPUTING

High-Performance Computing

High-Performance, Power-Aware Computing

High Performance Computing

High Performance Computing

High Performance Computing

HIGH-PERFORMANCE COMPUTING

Software engineering in High Performance Computing

High Performance Computing

HIGH PERFORMANCE COMPUTING

High Performance Computing

Java for High Performance Computing

High-Performance, Power-Aware Computing