1 / 19

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall

This joint venture between UC Berkeley and LBNL explores performance techniques such as analysis and modeling to understand and optimize the memory wall in high-performance computing. The use of adaptable probes helps to isolate performance limitations and provide feedback to developers and hardware designers. The study examines various architectures and explores the impact of architectural features on performance.

swinehart
Download Presentation

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)

  2. Berkeley Institute for Performance Studies • Joint venture between U.C. Berkeley (Demmel & Yelick) • And LBNL (Oliker, Strohmaier, Bailey, and others) • Three performance techniques: • Analysis (benchmarking) • Modeling (prediction)

  3. Investigating Architectural Balance using Adaptable Probes Kaushik Datta, Parry Husbands, Paul Hargrove, Shoaib Kamil, Leonid Oliker, John Shalf, Katherine Yelick

  4. Overview • Gap between peak and sustained performance well known problem in HPC • Generally attributed to memory system, but difficult to identify bottleneck • Application benchmarks too complex to isolate specific architectural features • Microbenchmarks too narrow to predict actual code performance • We use adaptable probes to isolate performance limitations: • Give application developers possible optimizations • Give hardware designers feedback on current and proposed architectures • Single processor probes • Sqmat captures regular and irregular memory access patterns (such as dense and sparse linear algebra) • Stencil captures nearest neighbor computation (work in progress) • Architectures examined: • Commercial: Intel Itanium2, AMD Opteron, IBM Power3, IBM Power4, G5 • Research: Imagine, Iram, DIVA

  5. Sqmat overview • Sqmat based on matrix multiplication and linear solvers • Java program used to generate optimally unrolled C code • Square a set of matrices M times in (use enough matrices to exceed cache) • M controls computational intensity (CI) - the ratio between flops and mem access • Each matrix is size NxN • N controls working set size: 2N2 registers required per matrix • Direct Storage: Sqmat’s matrix entries stored continuously in memory • Indirect: Entries accessed indirectly through pointer Parameter S controls degree of indirection, S matrix entries stored contiguously, then random jump in memory

  6. 100% 90% 80% 70% 60% percent of algorithmic peak 50% Itanium 2 40% Opteron Power3 30% Power4 20% 10% 0% 1 10 100 1000 10000 computational intensity (CI) Unit Stride Algorithmic Peak • Curve increases until memory system fully utilized, plateaus when FPU units saturate • Itanium2 requires longer time to achieve plateau due to register spill penalty • Opteron’s SIMD nature of SSE2 inhibits high algorithmic peak • Power3 effective hiding latency of cache-access • Power4’s deep pipeline inhibits ability to find sufficient ILP to saturate FPUs

  7. 5 Itanium 2 4 Opteron Power3 slowdown Power4 3 2 1 M 1 2 4 8 16 32 64 128 256 512 Slowdown due to Indirection Unit stride access via indirection (S=1) • Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values • Itanium2 showing high penalty for indirection - issue is currently under invesigation

  8. 4 3.5 100% (S=1) 50% (S=2) 100% (S=1) 3.5 50% (S=2) 3 25% (S=4) 25% (S=4) 12.5% (S=8) 12.5% (S=8) 3 6.25% (S=16) 6.25% (S=16) 3.13% (S=32) 2.5 1.56% (S=64) 0.78% (S=128) 2.5 slowdown for irregular access slowdown for irregular access 2 2 1.5 1.5 1 1 1 2 4 8 16 32 64 128 256 512 M M 1 2 4 8 16 32 64 128 256 512 Irregularity on Itanium, N=4 Irregularity on Opteron, N=4 Cost of Irregularity (1) • Itanium and Opteron perform well for irregular accesses due to: • Itanium2’s L2 caching of FP values (reduces cost of cache miss) • Opteron’s low memory latency due to on-chip memory controller

  9. 15 100% (S=1) 100% (S=1) 50% (S=2) 50% (S=2) 13 21 25% (S=4) 25% (S=4) 12.5% (S=8) 12.5% (S=8) 6.25% (S=16) 6.25% (S=16) 11 3.13% (S=32) 3.13% (S=32) 16 1.56% (S=64) 1.56% (S=64) 9 0.78% (S=128) 0.78% (S=128) 0.39% (S=256) 0.39% (S=256) random accesses slowdown for irregular access slowdown for irregular access random accesses 7 11 5 6 3 1 1 M M 1 2 4 8 16 32 64 128 256 1 2 8 16 32 64 128 256 512 4 512 Irregularity on Power3, N=4 Irregularity on Power4, N=4 Cost of Irregularity (2) • Power3 and Power4 perform well for irregular accesses due to: • Power3’s high penalty cache miss (35 cycles) and limited prefetch abilities • Power4’s requires 4 cache-line hit to activate prefetching

  10. Tolerating Irregularity • S50 • Start with some M at S= (indirect unit stride) • For a given M, how large must S be to achieve at least 50% of the original performance? • M50 • Start with M=1, S= • At S=1 (every access random), how large must M be to achieve 50% of the original performance

  11. Tolerating Irregularity Interested in developing application driven architectural probes for evaluation of emerging petascale systems

  12. Emerging Architectures • General purpose processors badly suited for data intensive ops • Large caches not useful if re-use is low • Low memory bandwidth, especially for irregular patterns • Superscalar methods of increasing ILP inefficient • Power consumption • Application-specific ASICs • Good, but expensive/slow to design. • Solution: general purpose “memory aware” processors • Large number of ALUs: to exploit data-parallelism • Huge memory bandwidth: to keep ALUs busy • Concurrency: overlap memory w/ computation

  13. VIRAM Overview • MIPS core (200 MHz) • Main memory system • 8 banks w/13 MB of on-chip DRAM • Large 6.4 GBytes/s on-chip peak bandwidth • Cache-less Vector unit • Energy efficient way to express fine-grained parallelism and exploit bandwidth • Single issue, in order • Low power consumption: 2.0 W • Peak vector performance • 1.6/3.2/6.4 Gops • 1.6 Gflops (single-precision) • Fabricated by IBM • Deep pipelines mask DRAM latency • Cray’s vcc compiler adapted to VIRAM • Simulator used for results

  14. VIRAM Power Efficiency • Comparable performance with lower clock rate • Large power/performance advantage for VIRAM from • PIM technology, data parallel execution model

  15. Imagine Overview • “Vector VLIW” processor • Coprocessor to off-chip host processor • 8 arithmetic clusters control in SIMD w/ VLIW instructions • Central 128KB Stream Register File @ 32GB/s • SRF can overlap computation w/ memory (double buffering) • SRF cab reuse intermediate results (proc-consum locality) • Stream-aware memory system with 2.7 GB/s off-chip • 544 GB/s intercluster comm • Host sends instuctions to stream controller, SC issues commands to on-chip modules

  16. VIRAM and Imagine • Imagine order of magnitude higher performance • VIRAM twice memory bandwidth, less power consumption • Notice peak Flop/Word ratios

  17. What Does This Have to Do with PIMs? • Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!) • Imagine much faster for long streams, slower for short ones

  18. SQMAT: Performance Crossover • Large number of ops/word N10 where N=3x3 • Crossover point L=64 (cycles) , L = 256 (MFlop) • Imagine power becomes apparent almost 4x VIRAM at L=1024Codes at this end of spectrum greatly benefit from Imagine arch

  19. Stencil Probe • Stencil computations core of wide range of scientific applications • Applications include Jacobi solvers, complex multigrid, block structured AMR • We developing adaptable stencil probe to model range of computations • Findings isolate importance of streaming memory accesses which engage automatic prefetch engines, thus greatly increasing memory throughput • Previous L1 tiling techniques mostly ineffective for stencil computations on modern microprocessors • Small blocks inhibit automatic prefetching performance • Modern large on-chip L2/L3 caches have similar bandwidth to L1 • Currently investigating tradeoffs between blocking and prefetching (paper in preparation) • Interested in exploring potential benefits of enhancing commodity processors with explicitly programmable prefetching

More Related