1 / 41

A benchmark for sparse matrix-vector multiplication

A benchmark for sparse matrix-vector multiplication. Hormozd Gahvari and Mark Hoemmen {hormozd|mhoemmen}@eecs http://mhoemmen.arete.cc/Report/

albert
Download Presentation

A benchmark for sparse matrix-vector multiplication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A benchmark for sparse matrix-vector multiplication • Hormozd Gahvari and Mark Hoemmen {hormozd|mhoemmen}@eecs • http://mhoemmen.arete.cc/Report/ • Research made possible by: NSF, Argonne National Lab, a gift from Intel, National Energy Research Scientific Computing Center, and Tyler Berry tyler@arete.cc

  2. Topcs for today: • Sparse matrix-vector multiplication (SMVM) and the Sparsity optimization • Preexisting SMVM benchmarks vs. ours • Results: Performance predictors • Test case: Desktop SIMD

  3. Sparse matrix-vector multiplication • Sparse vs. dense matrix * vector • Dense: Can take advantage of temporal, spatial locality (BLAS level 2,3) • Sparse: “Stream through” matrix one value at a time • Index arrays: Lose locality • Compressed sparse row (CSR) format

  4. Register block optimization • Many matrices have small blocks • FEM matrices especially • 2x2, 3x3, 6x6 common • Register blocking: Like unrolling a loop (circumvent latencies) • Sparsity: • Automatic heuristic optimal block size selection

  5. SMVM benchmarks: Three strategies • Actually do SMVM with test cases • Simpler ops “simulating” SMVM • Analytical / heuristic model

  6. 1) Actually do SMVM • SparseBench: Iterative Krylov solvers • Tests other things besides SMVM! • SciMark 2.0: • Fixed problem size • Uses unoptimized CSR (no reg. blocks) • Doesn't capture potential performance with many types of matrices • Register blocking: Large impact (will see)

  7. 2) Microbenchmarks “simulating” SMVM • Goal: capture SMVM behavior with simple set of operations • STREAM http://www.streambench.org/ • “Sustained memory bandwidth” • Copy, Scale, Add, Triad • Triad: like dense level-1 BLAS DAXPY • Rich Vuduc's indirect indexed variants • Resemble sparse matrix addressing • Still not predictive

  8. 3) Analytical models of SMVM performance • Account for miss rates, latencies and bandwidths • Sparsity: bounds as heuristic to predict best block dimensions for a machine • Upper and lower bounds not tight, so difficult to use for performance prediction • Sparsity's goal: optimization, not performance prediction

  9. Our SMVM benchmark • Do SMVM with BSR matrix: randomly scattered blocks • BSR format: Typically less structured matrices anyway • “Best” block size, 1x1 • Characterize different matrix types • Take advantage of potential optimizations (unlike current benchmarks), but in a general way

  10. Dense matrix in sparse format • Test this with optimal block size: • To show that fill doesn't affect performance much • Fill: affects locality of accesses to source vector

  11. Data set sizing • Size vectors to fit in largest cache, matrix out of cache • Tests “streaming in” of matrix values • Natural scaling to machine parameters! • “Inspiration” SPECfp92 (small enough so manufacturers could size cache to fit all data) vs. SPECfp95 (data sizes increased) • Fill now machine-dependent: • Tests show fill (locality of source vector accesses) has little effect

  12. Results: “Best” block size • Highest Mflops/s value for the block sizes tested, for: • Sparse matrix (fill chosen as above) • Dense matrix in sparse format (4096 x 4096) • Compare with Mflops/s for STREAM Triad (a[i] = b[i] + s * c[i])

  13. Rank processors acc. to benchmarks: • For optimized (best block size) SMVM: • Peak mem bandwidth good predictor for Itanium 2, P4, PM relationship • STREAM mispredicts these • STREAM: • Better predicts unoptimized (1 x 1) SMVM • Peak bandwidth no longer helpful

  14. Our benchmark: Useful performance indicator • Comparison with results for “real-life” matrices: • Works well for FEM matrices • Not always as well for non-FEM matrices • More wasted space in block data structure: directly proportional to slowdown

  15. Comparison of Benchmark with Real Matrices • Following two graphs show MFLOP rate of matrices generated by our benchmark vs. matrices from BeBOP group and a dense matrix in sparse format • Plots compare by block size; matrix “number” is given in parentheses. Matrices 2-9 are FEM matrices. • A comprehensive list of the BeBOP test suite matrices can be found in Vuduc, et. al., “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” 2002.

  16. Comparison Conclusions • Our benchmark does a good job modeling real data • Dense matrix in sparse format looks good on Ultra 3, but is noticeably inferior to our benchmark for large block sizes on Itanium 2

  17. Evaluating SIMD instructions • SMVM benchmark: • Tool to evaluate arch. features • e.g.: Desktop SIMD floating-point • SSE-2 ISA: • Pentium 4, M; AMD Opteron • Parallel ops on 2 floating-point doubles • {ADD|MUL|DIV}PD: arithmetic • MOVAPD: load aligned pair

  18. Vectorizing DAXPY • Register block: small dense Matrix * vector • Dep. on matrix data ordering: • Column-major (Fortran-style): • Need scalar * vector operation • Row-major (C-style): • Need “reduce” (dot product)

  19. Sparsity register block layout • Row-major order within block • Vs. Sparse BLAS proposal (col-major)! • Vector reductions change associativity (results may differ from scalar version, due to roundoff) • We chose to keep it for now • Can't just switch algorithm: orientation affects stride of vector loads • Need a good vector reduction

  20. Vector reduce • e.g. C. Kozyrakis' recent UC Berkeley Ph.D. thesis on multimedia vector ops • “vhalf” instruction: • Copy lower half of src vector reg. --> upper half of dest. • Iterate (vhalf, vector add) to reduce.

  21. SSE-2 has “vhalf”! # Sum the 2 elements of %xmm1: # -------------------------------- # Low 8B %xmm1 --> high 8B %xmm0 SHUFPD %xmm0, %xmm1 # High 8B of %xmm0 gets sum ADDPD %xmm0, %xmm1

  22. One possible SSE-2 6x6 A*x • %xmm0 <- (dest(0), 0) • 6 MOVAPD: interleave matrix row pairs and src vector pairs • Update indices • 3x (MULPD, then ADDPD to %xmm0) • Sum elems of %xmm0 • (SHUFPD and ADDPD) • Extract and store sum

  23. SSE-2: gcc and Intel C compilers won't vectorize! • Use SIMD registers for scalar math! • SSE-2 latency: 1 cycle less than x87 • x87 uses same fn unit as SIMD anyway • Vector reduce sub-optimal? • Fewer ops: less latency-hiding potential • Only 8 XMM regs: Can't unroll • Col-major suboptimal • No scalar * vector instruction! • Or the alignment issue...

  24. “Small matrix library” • From Intel: Matrix * vector • Optimized for 6x6 or less • Idea: • Replace Sparsity's explicit (BLAS-1-like) register block multiplication... • ...with optimized function (BLAS-2-like) • We're working on this • Needed to say if SIMD valuable

  25. SIMD load: alignment • Possible reason for no automatic vectorization • Load pair needs alignm. on 16B bdys • Non-aligned load: slower • Compiler can't guarantee alignment • Itanium 2: Same issue reappears...

  26. SSE-2 results: Disappointing • Pentium M: gains nothing • Pentium 4: actually gains a little • SSE-2 1 cycle lower latency than x87 • Small blocks: latency dominates • x87 ISA harder to schedule • AMD Opteron not available for testing • 16 XMM regs (vs. 8): better unrolling capability?

  27. b[0:N-1] = scalar * c[0:N-1] (speedup 1.72) Loop: movapd c(%eax), %xmm4 mulpd %xmm0, %xmm4 movntpd %xmm4, b(%eax) addl $16, %eax cmpl $16000000, %eax jl Loop How SSE-2 should look: STREAM Scale

  28. NetBurst (Pentium 4,M arch)(Note: diagram used w/out permission)

  29. Can NetBurst keep up with DAXPY? • One cycle: • 1 load aligned pair, 1 store aligned pair, 1 SIMD flop (alternate ADDPD/MULPD) • DAXPY (in row-major): Triad - like • y(i) = y(i) + A(i,j) * x(j) • If y(i) loaded: 2 lds, 1 mul, 1 add, 1 store • Ratio of loads to stores inadequate? • Itanium 2 changes this...

  30. Itanium 2: Streaming fl-pt • NO SSE-2 support!!! • BUT: In 1 cycle: 2 MMF bundles: • 2 load pair (4 loads), 2 stores • 2 FMACs (a + s * b) • (Or MFI: Load pair, FMAC, update idx) • 1 cycle: theoretically 2x DAXPY!

  31. Itanium 2: Alignment strikes again! • Intel C Compiler won't generate “load pair” instructions!!! • Why? • ldfpd (“load pair”) needs aligned data • Compiler doesn't see underlying dense BLAS 2 structure? • Register pressure?

  32. SIMD conclusions: • STREAM Triad suggests modest potential speedup • Multiple scalar functional units: • More flexible than SIMD: Speedup independent of orientation • Code scheduling difficult • Pragmas to tell compiler data is aligned • Encapsulate block A*x in hand-coded routine

  33. Conclusions: • Our benchmark: • Good SMVM performance prediction • Scales for any typical uniprocessor • With “optimal” block sizes: • Performance tied to memory bandwidth • With 1x1 blocks: • Performance related more to latency

  34. Conclusions (2): • SIMD: Need to test custom mini dense matrix * vector routines • Development will continue after this semester: • More testing • Parallelization

More Related