Designing On-chip Memory Systems for Throughput Architectures

Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor: Stephen Keckler

Turning to Heterogeneous Chips “Well be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005 AMD - TRINITY nVIDIATegra 3 Intel – Ivy Bridge

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work

Throughput Architectures (TA) • Key Features: • Use explicit parallelism to break application into threads • Optimize hardware for performance density, not single thread performance • Benefits: • Drop voltage, peak frequency • quadratic improvement in power efficiency • Cores smaller, more energy efficient • Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs • Further economize through multithreading each core • Amortize expense using SIMD

Scope – Highly Threaded TA • Architecture Continuum: • Multithreading • Large number of threads mask long latency • Small amount of cache primarily for bandwidth • Caching • Large amounts of cache to reduce latency • Small number of threads • Can we get benefits of both? Power 7 4 threads/core ~1MB/thread SPARC T4 8 threads/core ~80KB/thread GTX 580 48 threads/core ~2KB/thread

Problem - Technology Mismatch • Computation is cheap, data movement is expensive • Hit in L1 cache, 2.5x power of 64-bit FMADD • Move across chip, 50x power • Fetch from DRAM, 320x power • Exponential growth in cores saturates off-chip bandwidth • Performance capped • Latency to off-chip DRAM now hundreds of cycles • Need hundreds of threads in flight to cover latency

The Downward Spiral • Little’s Law • Threads needed is proportional to average latency • On-chip resources: opportunity cost • Thread contexts • In flight memory accesses • Too many threads – negative feedback • Adding threads to cover latency increases latency • Slower register access, thread scheduling • Reduced Locality • Reduces bandwidth and DRAM efficiency • Reduces effectiveness of caching • Parallel starvation

Goal: Increase Parallel Efficiency • Problem: Too Many Threads! • Increase Parallel Efficiency, i.e. • Number of threads needed to achieve a given level of performance • Improves throughput performance • Apply low latency caches • Leverage upwards spiral • Difficult to mix multithreading and caching • Typically used just for bandwidth amplification • Important factors • Thread scheduling • Instruction Scheduling (per thread parallelism)

Contributions • Quantifying the impact of single thread performance on throughput performance • Developing a mathematical analysis of throughput performance • Building a novel hybrid-trace based simulation infrastructure • Demonstrating unique architectural enhancements in thread scheduling and cache policies

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

Mathematical Analysis • Why take a mathematical approach? • Be very precise about what we want to optimize • Understand the relationship and sensitivities to throughput performance: • Single thread performance • Cache improvements • Application characteristics • Suggest most fruitful architectural improvements

Modeling Throughput Performance NT = Total Active Threads PCHIP = Total Throughput Performance PST = Single Thread Performance LAVG = Average Latency per instruction PowerCHIP = EAVG(Joules)xPCHIP

Cache As A Performance Unit Area: 2-11KB SRAM, 8-40KB eDRAM Shared through pipelining Active Power: 20pJ / Op Leakage Power: 1 watt/mm^2 FMADD Active Power: 50pJ/L1 access, 1.1nJ/L2 access Leakage Power: 70 milliwatts/mm^2 1.4 watts/MB eDRAM 350milliwatts/MB Make loads 150x faster, 300x more energy efficient Use10-15x less power/mm^2 than FPUs One FPU = ~64KB SRAM / 256KB eDRAM SRAM Key: How much does a thread need?

Performance From Caching • Ignore changes to DRAM latency & off-chip BW • We will simulate these • Assume ideal caches • What is the maximum performance benefit? NT = Total active threads on chip A = Arithmetic intensity of application (fraction of non-memory instructions) L = Average latency per instruction Memory Intensity, M=1-A For power, replace L with E, the average energy per instruction Qualitatively identical, but differences more dramatic

Ideal Cache = Frequency Cache • Hit rate depends on amount of cache, application working set • Store items used the most times • This is the concept of “frequency” • Once we know an application’s memory access characteristics, we can model throughput performance

Modeling Cache Performance

Performance Per Thread PS(t) is a steep reciprocal

Valley in Cache Space = X

Valley – Annotated Cache Regime Valley MT Regime Width Cache No Cache

Prior Work • Hong et al, 2009, 2010 • Simple, cacheless GPU models • Used to predict “MT peak” • Guzet al, 2008, 2010 • Graphed throughput performance with assumed cache profile • Identified “valley” structure • Validated against PARSEC benchmarks • No mathematical analysis • Didn’t analysis bandwidth limited regime • Focus on CMP benchmarks • Galal et al, 2011 • Excellent mathematical analysis • Focused on FPU+Register design

Valley – Annotated Cache Regime Valley MT Regime Width Cache No Cache

Energy vs Latency * Bill Dally, IPDPS Keynote, 2011

Valley – Energy Efficiency

Thread Throttling • Have real time information • Arithmetic Intensity • Bandwidth Utilization • Current Hit Rate • Can match to approximate/conservative locality • Approximate optimum operating points • Shut down / Activate threads to increase performance • Concentrate power and overclock

Prior Work • Several studies in CMP and GPU area scale back threads • CMP – When miss rates get too high • GPU – When off-chip bandwidth is saturated • Prior attempts simple, unidirectional • We have two complex points to hit, three different operating regimes • Mathematical analysis lets us approximate both points with as little as two samples • Both off-chip bandwidth and 1/Hitrate are nearly linear for a wide range of applications

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology • Proposed Work

Mathematical Analysis • Need to work like LFU cache • Hard to implement in practice • Still very little cache per thread • Policies make big differences for small caches • Associativity a big issue • Cannot cache every line referenced • Beyond “dead line” prediction • Stream lines with lower reuse

Cache Conflict Misses • Different addresses map to same way • Programmers prefer power of 2 array sizes • Power of 2 strides pathological • Prime number of banks/sets thought ideal • No efficient implementation • Mersenne Primes not so convenient: • 3, 7, 15, 31, 63, 127, 255 • Early paper on prime strides for vector computers showed 3x speedup • Kharbutli, HPCA 04 – showed prime sets as hash function for caches worked well • Odd-sets work as well • Fastest implementation of DIV-MOD • “Silver Bullet”, e.g., allowed ¼ banks with same conflict rate

Early Study using PARSEC PARSEC L2 with 64 threads

(Re)placement Policies • Not all data should be cached • Recent papers for LLC caches • Hard drive cache algorithms • Frequency over Recency • Frequency hard to implement • ARC good compromise • Direct Mapping Replacement dominates • Look for explicit approaches • Priority Classes • Epochs

Prior Work • Belady – solved it all, light on implementation details • Three hierarchies of methods • Best one utilized information of prior line usage • Approximations • ARC cache – ghost entries, recency and frequency groups • Generational Caches, multiqueue • Qureshi, 2006, 2007 – Adaptive Insertion policies

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work

Benchmarks • Initially studied regular HPC kernels/applications in CMP environment • Dense Matrix Multiply • Fast Fourier Transform • Homme weather simulation • Added CUDA throughput benchmarks • Parboil – old school MPI, coarse grained • Rodinia – fine grained, varied • Benchmarks typical of historical GPGPU applications • Will add irregular benchmarks • SparseMM, Adaptive Finite Elements, Photon mapping

Subset of Benchmarks

Preliminary Results • Most of the benchmarks should benefit: • Small working sets • Concentrated working sets • Hit rate curves easy to predict

Typical Concentration of Locality

Scratchpad Locality

Hybrid Simulator Design C++/CUDA Simulate Different Architecture Than Traced NVCC PTX Intermediate Assembly Listing Dynamic Trace Blocks Attachment Points Modify Ocelot Functional Sim Custom Trace Module Compressed Trace Data Custom Simulator Goals: Fast simulation, Overcome compiler issues for reasonable base case

Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work

Phase 1 – HPC Applications • Looked at GEMM, FFT & Homme in CMP setting • Learned implementation algorithms, alternative algorithms • Expertise allows for credible throughput analysis • Valuable Lessons in multithreading and caching • Dense Matrix Multiply • Blocking to maximize arithmetic intensity • Enough contexts to cover latency • Fast Fourier Transform • Pathologically hard on memory system • Communication & synchronization • HOMME – weather modeling • Intra-chip scaling incredibly difficult • Memory system performance variation • Replacing data movement with computation • First author publications: • PPoPP 2008, ISPASS 2011 (Best Paper)

Phase 2 – Benchmark Characterization • Memory Access Characteristics of Rodinia and Parboil benchmarks • Apply Mathematical Analysis • Validate model • Find optimum operating points for benchmarks • Find optimum TA topology for benchmarks • NEARLY COMPLETE

Phase 3 – Evaluate Enhancements • Automatic Thread Throttling • Low latency hierarchical cache • Benefits of odd-sets/odd-banking • Benefits of explicit placement (Priority/Epoch) • NEED FINAL EVALUATION and explicit placement study

Final Phase – Extend Domain • Study regular HPC applications in throughput setting • Add at least two irregular benchmarks • Less likely to benefit from caching • New opportunities for enhancement • Explore impact of future TA topologies • Memory Cubes, TSV DRAM, etc.

Proposed Timeline • Phase 1 – HPC applications – completed • Phase 2 – Mathematical model & Benchmark Characterization • MAY-JUNE • Phase 3 – Architectural Enhancements • JULY-AUGUST • Phase 4 – Domain enhancement / new features • September-November

Conclusion • Dissertation Goals: • Quantify the degree single thread performance affects throughput performance for an important class of applications • Improve parallel efficiency through thread scheduling, cache topology, and cache policies • Feasibility • Regular Benchmarks show promising memory behavior • Cycle accurate simulator nearly completed

Designing On-chip Memory Systems for Throughput Architectures

Designing On-chip Memory Systems for Throughput Architectures

Presentation Transcript

High-Throughput Computing on Commodity Systems.

On - Chip Communication Architectures

Architectures for Secure Systems

Designing Memory Systems for Tiled Architectures

Designing On-chip Memory Systems for Throughput Architectures

Throughput-Effective On-Chip Networks for Manycore Accelerators

Scratchpad Memories: A Design Alternative for Cache On-chip Memory in Embedded Systems

Shared-memory Architectures

A Hierarchical Modeling Framework for On-Chip Communication Architectures

Comparing Memory Systems for Chip Multiprocessors

A Fast On-Chip Profiler Memory

Shared memory architectures

Architectures for Distributed Systems

A Cluster-On-A-Chip Architecture For High Throughput Phylogeny Search

Designing Throughput Accountability

Photonic On-Chip Networks for Performance-Energy Optimized Off-Chip Memory Access

Architectures for Distributed Systems

Comparing Memory Systems for Chip Multiprocessors

A High Throughput Network-on-Chip Architecture for System-on-Chip Interconnect

Unlock chip memory