1 / 17

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines. Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak Biswas (NASA Ames). Motivation.

benita
Download Presentation

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory Intensive Benchmarks:IRAM vs. Cache Based Machines Parry Husbands(LBNL)Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak Biswas (NASA Ames)

  2. Motivation • Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones) • E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of peak on 1.5GHz P4 • Even worse when parallel efficiency considered • Overall ~10% across application benchmarks • Is memory bandwidth the problem? • Performance directly related to how well memory system performs • But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)

  3. Solutions? • Better Software • ATLAS, FFTW, Sparsity, PHiPAC • Power and packaging are important too! • New buildings and infrastructure needed for many recent/planned installations • Alternative Architectures • One idea: Tighter integration of processor and memory • BlueGene/L (~ 25 cycles to main memory) • VIRAM • Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM

  4. 14.5 mm 20.0 mm VIRAM Overview • MIPS core (200 MHz) • Main memory system • 13 MB of on-chip DRAM • Large on-chip bandwidth6.4 GBytes/s peak to vector unit • Vector unit • Energy efficient way to express fine-grained parallelism and exploit bandwidth • Typical power consumption: 2.0 W • Peak vector performance • 1.6/3.2/6.4 Gops • 1.6 Gflops (single-precision) • Fabrication by IBM • Tape-out in O(1 month) • Our results use simulator with Cray’s vcc compiler

  5. Our Task • Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines • For now focus on serial performance • Benchmark VIRAM on Scientific Computing kernels • Originally for multimedia applications • Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser) • Isolate performance limiting features of architectures • More than just memory bandwidth

  6. Benchmarks Considered • Transitive-closure (small & large data set) • NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) • Fetch-and-increment a stream of “random” addresses • Sparse matrix-vector product: • Order 10000, #nonzeros 177820 • Computing a histogram • Different algorithms investigated: 64-elements sorting kernel; privatization; retry • 2D unstructured mesh adaptation

  7. The Results Comparable performance with lower clock rate

  8. Power Efficiency • Large power/performance advantage for VIRAM from • PIM technology • Data parallel execution model

  9. Ops/Cycle

  10. GUPS • 1 op, 2 loads, 1 store per step • Mix of indexed and unit stride operations • Address generation key here (only 4 per cycle on VIRAM)

  11. Histogram • 1 op, 2 loads, 1 store per step • Like GUPS, but duplicates restrict available parallelism and make it more difficult to vectorize • Sort method performs best on VIRAM on real data • Competitive when histogram doesn’t fit in cache

  12. Which Problems are Limited by Bandwidth? • What is the bottleneck in each case? • Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) • SPMV and Mesh limited by address generation, bank conflicts, and parallelism • For Histogram lack of parallelism, not memory bandwidth

  13. Summary and Future Directions • Performance advantage • Large on applications limited only by bandwidth • More address generators/sub-banks would help irregular performance • Performance/Power advantage • Over both low power and high performance processors • Both PIM and data parallelism are key • Performance advantage for VIRAM depends on application • Need fine-grained parallelism to utilize on-chip bandwidth • Future steps • Validate our work on real chip! • Extend to multi-PIM systems • Explore system balance issues • Other memory organizations (banks, bandwidth vs. size of memory) • # of vector units • Network performance vs. on-chip memory

  14. SPARCIIi MIPS R10K P III P 4 Alpha EV6 Make Sun Ultra 10 Origin 2000 Intel Mobile Dell Compaq DS10 Clock 333MHz 180MHz 600MHz 1.5GHz 466MHz L1 16+16KB 32+32KB 32KB 12+8KB 64+64KB L2 2MB 1MB 256KB 256KB 2MB Mem 256MB 1GB 128MB 1GB 512MB The Competition

  15. Transitive Closure (Floyd-Warshall) • 2 ops, 2 loads, 1 store per step • Good for vector processors: • Abundant, regular parallelism and unit stride

  16. SPMV • 2 ops, 3 loads per step • Mix of indexed and unit stride operations • Good performance for ELLPACK, but only when we have same number of non-zeros per row

  17. Mesh Adaptation • Single level of refinement of mesh with 4802 triangular elements, 2500 vertices, and 7301 edges • Extensive reorganization required to take advantage of vectorization • Many indexed memory operations (limited again by address generation)

More Related