1 / 78

Designing On-chip Memory Systems for Throughput Architectures

Designing On-chip Memory Systems for Throughput Architectures. Ph.D . Proposal Jeff Diamond Advisor: Stephen Keckler. Turning to Heterogeneous Chips. “Well be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005. AMD - TRINITY. nVIDIA Tegra 3. Intel – Ivy Bridge.

gomer
Download Presentation

Designing On-chip Memory Systems for Throughput Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor: Stephen Keckler

  2. Turning to Heterogeneous Chips “Well be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005 AMD - TRINITY nVIDIATegra 3 Intel – Ivy Bridge

  3. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work

  4. Throughput Architectures (TA) • Key Features: • Use explicit parallelism to break application into threads • Optimize hardware for performance density, not single thread performance • Benefits: • Drop voltage, peak frequency • quadratic improvement in power efficiency • Cores smaller, more energy efficient • Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs • Further economize through multithreading each core • Amortize expense using SIMD

  5. Scope – Highly Threaded TA • Architecture Continuum: • Multithreading • Large number of threads mask long latency • Small amount of cache primarily for bandwidth • Caching • Large amounts of cache to reduce latency • Small number of threads • Can we get benefits of both? Power 7 4 threads/core ~1MB/thread SPARC T4 8 threads/core ~80KB/thread GTX 580 48 threads/core ~2KB/thread

  6. Problem - Technology Mismatch • Computation is cheap, data movement is expensive • Hit in L1 cache, 2.5x power of 64-bit FMADD • Move across chip, 50x power • Fetch from DRAM, 320x power • Exponential growth in cores saturates off-chip bandwidth • Performance capped • Latency to off-chip DRAM now hundreds of cycles • Need hundreds of threads in flight to cover latency

  7. The Downward Spiral • Little’s Law • Threads needed is proportional to average latency • On-chip resources: opportunity cost • Thread contexts • In flight memory accesses • Too many threads – negative feedback • Adding threads to cover latency increases latency • Slower register access, thread scheduling • Reduced Locality • Reduces bandwidth and DRAM efficiency • Reduces effectiveness of caching • Parallel starvation

  8. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work

  9. Goal: Increase Parallel Efficiency • Problem: Too Many Threads! • Increase Parallel Efficiency, i.e. • Number of threads needed to achieve a given level of performance • Improves throughput performance • Apply low latency caches • Leverage upwards spiral • Difficult to mix multithreading and caching • Typically used just for bandwidth amplification • Important factors • Thread scheduling • Instruction Scheduling (per thread parallelism)

  10. Contributions • Quantifying the impact of single thread performance on throughput performance • Developing a mathematical analysis of throughput performance • Building a novel hybrid-trace based simulation infrastructure • Demonstrating unique architectural enhancements in thread scheduling and cache policies

  11. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Architectural Enhancements • Thread Scheduling • Cache Policies • Methodology • Proposed Work

  12. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

  13. Mathematical Analysis • Why take a mathematical approach? • Be very precise about what we want to optimize • Understand the relationship and sensitivities to throughput performance: • Single thread performance • Cache improvements • Application characteristics • Suggest most fruitful architectural improvements

  14. Modeling Throughput Performance NT = Total Active Threads PCHIP = Total Throughput Performance PST = Single Thread Performance LAVG = Average Latency per instruction PowerCHIP = EAVG(Joules)xPCHIP

  15. Cache As A Performance Unit Area: 2-11KB SRAM, 8-40KB eDRAM Shared through pipelining Active Power: 20pJ / Op Leakage Power: 1 watt/mm^2 FMADD Active Power: 50pJ/L1 access, 1.1nJ/L2 access Leakage Power: 70 milliwatts/mm^2 1.4 watts/MB eDRAM 350milliwatts/MB Make loads 150x faster, 300x more energy efficient Use10-15x less power/mm^2 than FPUs One FPU = ~64KB SRAM / 256KB eDRAM SRAM Key: How much does a thread need?

  16. Performance From Caching • Ignore changes to DRAM latency & off-chip BW • We will simulate these • Assume ideal caches • What is the maximum performance benefit? NT = Total active threads on chip A = Arithmetic intensity of application (fraction of non-memory instructions) L = Average latency per instruction Memory Intensity, M=1-A For power, replace L with E, the average energy per instruction Qualitatively identical, but differences more dramatic

  17. Ideal Cache = Frequency Cache • Hit rate depends on amount of cache, application working set • Store items used the most times • This is the concept of “frequency” • Once we know an application’s memory access characteristics, we can model throughput performance

  18. Modeling Cache Performance

  19. Performance Per Thread PS(t) is a steep reciprocal

  20. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

  21. Valley in Cache Space = X

  22. Valley – Annotated Cache Regime Valley MT Regime Width Cache No Cache

  23. Prior Work • Hong et al, 2009, 2010 • Simple, cacheless GPU models • Used to predict “MT peak” • Guzet al, 2008, 2010 • Graphed throughput performance with assumed cache profile • Identified “valley” structure • Validated against PARSEC benchmarks • No mathematical analysis • Didn’t analysis bandwidth limited regime • Focus on CMP benchmarks • Galal et al, 2011 • Excellent mathematical analysis • Focused on FPU+Register design

  24. Valley – Annotated Cache Regime Valley MT Regime Width Cache No Cache

  25. Energy vs Latency * Bill Dally, IPDPS Keynote, 2011

  26. Valley – Energy Efficiency

  27. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

  28. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies • Methodology • Proposed Work

  29. Thread Throttling • Have real time information • Arithmetic Intensity • Bandwidth Utilization • Current Hit Rate • Can match to approximate/conservative locality • Approximate optimum operating points • Shut down / Activate threads to increase performance • Concentrate power and overclock

  30. Prior Work • Several studies in CMP and GPU area scale back threads • CMP – When miss rates get too high • GPU – When off-chip bandwidth is saturated • Prior attempts simple, unidirectional • We have two complex points to hit, three different operating regimes • Mathematical analysis lets us approximate both points with as little as two samples • Both off-chip bandwidth and 1/Hitrate are nearly linear for a wide range of applications

  31. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology • Proposed Work

  32. Mathematical Analysis • Need to work like LFU cache • Hard to implement in practice • Still very little cache per thread • Policies make big differences for small caches • Associativity a big issue • Cannot cache every line referenced • Beyond “dead line” prediction • Stream lines with lower reuse

  33. Cache Conflict Misses • Different addresses map to same way • Programmers prefer power of 2 array sizes • Power of 2 strides pathological • Prime number of banks/sets thought ideal • No efficient implementation • Mersenne Primes not so convenient: • 3, 7, 15, 31, 63, 127, 255 • Early paper on prime strides for vector computers showed 3x speedup • Kharbutli, HPCA 04 – showed prime sets as hash function for caches worked well • Odd-sets work as well • Fastest implementation of DIV-MOD • “Silver Bullet”, e.g., allowed ¼ banks with same conflict rate

  34. Early Study using PARSEC PARSEC L2 with 64 threads

  35. (Re)placement Policies • Not all data should be cached • Recent papers for LLC caches • Hard drive cache algorithms • Frequency over Recency • Frequency hard to implement • ARC good compromise • Direct Mapping Replacement dominates • Look for explicit approaches • Priority Classes • Epochs

  36. Prior Work • Belady – solved it all, light on implementation details • Three hierarchies of methods • Best one utilized information of prior line usage • Approximations • ARC cache – ghost entries, recency and frequency groups • Generational Caches, multiqueue • Qureshi, 2006, 2007 – Adaptive Insertion policies

  37. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work

  38. Benchmarks • Initially studied regular HPC kernels/applications in CMP environment • Dense Matrix Multiply • Fast Fourier Transform • Homme weather simulation • Added CUDA throughput benchmarks • Parboil – old school MPI, coarse grained • Rodinia – fine grained, varied • Benchmarks typical of historical GPGPU applications • Will add irregular benchmarks • SparseMM, Adaptive Finite Elements, Photon mapping

  39. Subset of Benchmarks

  40. Preliminary Results • Most of the benchmarks should benefit: • Small working sets • Concentrated working sets • Hit rate curves easy to predict

  41. Typical Concentration of Locality

  42. Scratchpad Locality

  43. Hybrid Simulator Design C++/CUDA Simulate Different Architecture Than Traced NVCC PTX Intermediate Assembly Listing Dynamic Trace Blocks Attachment Points Modify Ocelot Functional Sim Custom Trace Module Compressed Trace Data Custom Simulator Goals: Fast simulation, Overcome compiler issues for reasonable base case

  44. Talk Outline • Introduction • The Problem • Throughput Architectures • Dissertation Goals • The Solution • Modeling Throughput Performance • Cache Performance • The Valley • Architectural Enhancements • Thread Throttling • Cache Policies (Indexing, replacement) • Methodology (Applications, Simulation) • Proposed Work

  45. Phase 1 – HPC Applications • Looked at GEMM, FFT & Homme in CMP setting • Learned implementation algorithms, alternative algorithms • Expertise allows for credible throughput analysis • Valuable Lessons in multithreading and caching • Dense Matrix Multiply • Blocking to maximize arithmetic intensity • Enough contexts to cover latency • Fast Fourier Transform • Pathologically hard on memory system • Communication & synchronization • HOMME – weather modeling • Intra-chip scaling incredibly difficult • Memory system performance variation • Replacing data movement with computation • First author publications: • PPoPP 2008, ISPASS 2011 (Best Paper)

  46. Phase 2 – Benchmark Characterization • Memory Access Characteristics of Rodinia and Parboil benchmarks • Apply Mathematical Analysis • Validate model • Find optimum operating points for benchmarks • Find optimum TA topology for benchmarks • NEARLY COMPLETE

  47. Phase 3 – Evaluate Enhancements • Automatic Thread Throttling • Low latency hierarchical cache • Benefits of odd-sets/odd-banking • Benefits of explicit placement (Priority/Epoch) • NEED FINAL EVALUATION and explicit placement study

  48. Final Phase – Extend Domain • Study regular HPC applications in throughput setting • Add at least two irregular benchmarks • Less likely to benefit from caching • New opportunities for enhancement • Explore impact of future TA topologies • Memory Cubes, TSV DRAM, etc.

  49. Proposed Timeline • Phase 1 – HPC applications – completed • Phase 2 – Mathematical model & Benchmark Characterization • MAY-JUNE • Phase 3 – Architectural Enhancements • JULY-AUGUST • Phase 4 – Domain enhancement / new features • September-November

  50. Conclusion • Dissertation Goals: • Quantify the degree single thread performance affects throughput performance for an important class of applications • Improve parallel efficiency through thread scheduling, cache topology, and cache policies • Feasibility • Regular Benchmarks show promising memory behavior • Cycle accurate simulator nearly completed

More Related