1 / 25

Scientific Kernels on VIRAM and Imagine

Scientific Kernels on VIRAM and Imagine. Leonid Oliker Future Technologies Group NERSC/LBNL www.nersc.gov/~oliker Xiaoye Li, Parry Husbands, Adam Janin, Manikandan Narayanan, Kathy Yelick. Motivation.

lucus
Download Presentation

Scientific Kernels on VIRAM and Imagine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scientific Kernels on VIRAM and Imagine Leonid Oliker Future Technologies Group NERSC/LBNL www.nersc.gov/~oliker Xiaoye Li, Parry Husbands, Adam Janin, Manikandan Narayanan, Kathy Yelick

  2. Motivation • Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones) • E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of peak on 1.5GHz P4 • Even worse when parallel efficiency considered • Overall ~10% across application benchmarks • Is memory bandwidth the problem? • Performance directly related to how well memory system performs • But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)

  3. Solutions? • Better Software • ATLAS, FFTW, Sparsity, PHiPAC • Power and packaging are important too! • New buildings and infrastructure needed for many recent/planned installations • Alternative Architectures • One idea: Tighter integration of processor and memory • BlueGene/L (~ 25 cycles to main memory) • VIRAM • Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM • Imagine • Stream aware memory hierarchy to support SIMD controlled VLIW clusters

  4. 14.5 mm 20.0 mm VIRAM Overview • MIPS core (200 MHz) • Main memory system • 13 MB of on-chip DRAM • Large on-chip bandwidth6.4 GBytes/s peak to vector unit • Vector unit • Energy efficient way to express fine-grained parallelism and exploit bandwidth • Typical power consumption: 2.0 W • Peak vector performance • 1.6/3.2/6.4 Gops • 1.6 Gflops (single-precision) • Fabrication by IBM • Tape-out in O(1 month) • Our results use simulator with Cray’s vcc compiler

  5. Our Task • Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines • For now focus on serial performance • Benchmark VIRAM on Scientific Computing kernels • Originally for multimedia applications • Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser) • Isolate performance limiting features of architectures • More than just memory bandwidth

  6. Benchmarks Considered • Transitive-closure (small & large data set) • NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) • Fetch-and-increment a stream of “random” addresses • Sparse matrix-vector product: • Order 10000, #nonzeros 177820 • Computing a histogram • Different algorithms investigated: 64-elements sorting kernel; privatization; retry • 2D unstructured mesh adaptation

  7. The Results Comparable performance with lower clock rate

  8. Power Efficiency • Large power/performance advantage for VIRAM from • PIM technology • Data parallel execution model

  9. Ops/Cycle

  10. GUPS • 1 op, 2 loads, 1 store per step • Mix of indexed and unit stride operations • Address generation key here (only 4 per cycle on VIRAM)

  11. Histogram • 1 op, 2 loads, 1 store per step • Like GUPS, but duplicates restrict available parallelism and make it more difficult to vectorize • Sort method performs best on VIRAM on real data • Competitive when histogram doesn’t fit in cache

  12. Which Problems are Limited by Bandwidth? • What is the bottleneck in each case? • Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) • SPMV and Mesh limited by address generation, bank conflicts, and parallelism • For Histogram lack of parallelism, not memory bandwidth

  13. VIRAMSummary and Future Directions • Performance advantage • Large on applications limited only by bandwidth • More address generators/sub-banks would help irregular performance • Performance/Power advantage • Over both low power and high performance processors • Both PIM and data parallelism are key • Performance advantage for VIRAM depends on application • Need fine-grained parallelism to utilize on-chip bandwidth • Future steps • Validate our work on real chip! • Extend to multi-PIM systems • Explore system balance issues • Other memory organizations (banks, bandwidth vs. size of memory) • # of vector units • Network performance vs. on-chip memory

  14. IMAGINE:The need for stream processors • General purpose processors badly suited • Large caches not useful • Low memory bandwidth • Superscalar methods of increasing ILP inefficient • Power consumption • Application-specific ASICs • Good, but expensive/slow to design. • Solution: general purpose “stream processors” • Exploit producer-consumer locality • High arithmetic requirement • Homogenous computation (SIMD controlled VLIW clusters) • Unique (but limited) control logic

  15. Bandwidth Hierarchy SIMD/VLIW Control SDRAM ALU Cluster ALU Cluster ALU Cluster SDRAM ALU Cluster Stream Register File ALU Cluster SDRAM ALU Cluster ALU Cluster SDRAM ALU Cluster Peak BW: 4GB/s 32GB/s 544GB/s

  16. IMAGINE ARITHMETIC CLUSTERS: • Each cluster contains 3 ADD, 2 MULT, 1 DIV/SQRT, 1 scratch pad register unit, & 1 cluster communication unit • 32 bit operations: subword operations support 16 and 8 bit data. • Local registers on functional units hold 16 words each (total 1.5 KB local registers per cluster) • Clusters receive VLIW-style instructions broadcast from microcontroller.

  17. VIRAM vs. IMAGINE

  18. Sqmat Microbenchmark: Scalable Synthetic Probe • Used to gain insight into the architectures and capture performance crossover point • Sqmat contains abundant fine-grain parallelism, no data dependencies, and “multi-word” records • Sqmat squares a set of L matrices of size NxN repeatedly M times • Varying N,M controls size of comp kernel and ops/word • Varying L controls vector/stream length • Start with low end of performance spectrum and work our way up to high efficiency

  19. Sqmat:Low ops/word % of algorithmic peak for varying N with M=1 and L =16 % of algorithmic peak for varying L with M=1 and N=3

  20. Sqmat:High ops/word % of algorithmic performance for varying M with N=3 and L=1024 • Imagine achieves less then 50% efficiency even though there are 30 multiplications for each memory access Achieving high efficiency using long streams and high computation • To achieve 90% efficiency on Imagine, requires large (N=5) computational kernel and many ops per word

  21. Sqmat:Performance Crossover Performance crossover for N=3 and M=10 • VIRAM performance flattens while Imagine continues to grow. At L=1024, Imagine raw power becomes apparent - requiring 33% fewer cycles and 4x improvement in MFlop/s • Codes are better suited for each architecture depending on their computational characteristics

  22. Low Comp Intensity Example:SPMV • Performance is low using original matrix due to lack of parallelism (only 8 and 18 nnz per row). But VIRAM achieves much higher fraction of peak • Ellpack (filled) version shows better performance on both arch, but VIRAM still showing better performance characteristics • Note that padding matrices to create equal row lengths can reduce the fraction of useful operations arbitrarily low (but equally across both architectures)

  23. High Comp Intensity Example:Complex QR Decomposition • Both use block variants of the Householder QR - rich in BLAS3 operations • Use of complex elements increases computational intensity (ops/word) • VIRAM version is a port of CLAPACK, involves inserting vectorization directives into BLAS routines and minimizing strided access • Imagine uses blocks of 8 columns and requires complicated indexing logic • VIRAM only sustains 34% of peak due to large stride memory access • Imagine performs at 65% of peak with an impressive speed of over 13GFlop • Demonstrates significant performance that can be achieved on Imaginefor applications with high operations per memory access

  24. VIRAM Compiler poor at basic block scheduling No automatic unroll/software pipelining Hand-tweaking assembly No ISA support for multi-word records Degrades memory performance Increases program complexity Imagine Programmer exposed to memory hierarchy 2 levels of programming Number of clusters (8) exposed in ISA Not portable Hard to handle streams with (size % clusters) != 0 Burden on programmer Complicated control logic Brook language attempts to address these issues VIRAM and IMAGINE:Current usability problems

  25. Observations:VIRAM and IMAGINE • Relative performance depends on computational requirements per data element (Bytes/Flop) • Different balance of memory organization • Programming complexity is high for both approaches, although VIRAM is based on established vector technology • For well-suited applications Imagine processor can sustain over 10GFlop/s (simulated results) • Large number of homogeneous computation required to sufficiently saturate Imagine while VIRAM can operate on small vector sizes • IMAGINE can take advantage of consumer-producer locality • Both present significant reduction in power and space

More Related