CSE 690: GPGPU Lecture 7: Matrix Multiplications

1 / 19

# CSE 690: GPGPU Lecture 7: Matrix Multiplications - PowerPoint PPT Presentation

CSE 690: GPGPU Lecture 7: Matrix Multiplications . Klaus Mueller Computer Science, Stony Brook University. Basic Concept. Triple loop. GPU Algorithms. First algorithm: render a rectangle of size NxN represent the matrices as NxN textures each (i,j) is then a fragment

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## CSE 690: GPGPU Lecture 7: Matrix Multiplications

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### CSE 690: GPGPULecture 7: Matrix Multiplications

Klaus Mueller

Computer Science, Stony Brook University

Basic Concept
• Triple loop
GPU Algorithms
• First algorithm:
• render a rectangle of size NxN
• represent the matrices as NxN textures
• each (i,j) is then a fragment
• each fragment program is a loop or an unrolled loop -> may get too long
• must pull in the same data many times -> poor data reuse, needs bandwidth
• makes no use of 4-way RGBA parallelism -> wastes speedup
GPU Algorithms
• Better algorithm:
• use RGBA channels, pack a 2x2 submatrix
• use swizzling to facilitate data reuse
• swizzling improves fragment code length by factor 2
• may need multiple passes for larger matrices
GPU Algorithms
• Using multi-texturing
• requires l passes
GPU Algorithms
• Can use RGBA parallelism as well
• each texel represents a 2x2 submatrix
• use swizzling as usual
• needs l/2 passes
GPU Algorithms
• Instead of a 2x2 submatrix, pack 4x1 column vectors
• makes 4-times reuse of texels read from B, but uses texels from A only once
GPU Algorithms
• Instead of a 2x2 submatrix, pack 4x1 column vectors
• 6 fetches are needed for 4 mad’s (mult-add’s) -> 1.5 times more than before
• but less rows and columns are accessed per pass -> improves cache hit frequency
GPU Algorithms
• Originally only compute one product per shader
• practically can unroll the loop 3-6 times (compute 3-6 products)
• maximal fragment program length is the limit
• reduces the number of passes required
Reality Check
• Would like to compare CPU and GPU efficiencies for GPGPU tasks
• The task of matrix multiplication is insightful here
• features much data reuse
• graphics programs are generally more stream-like and have less data reuse
• this may lead to some limitations
Platforms
• Pentium 4 3Ghz CPU, 512KB L2 cache
• 12 GFLOPS peak compute
• 44.1GB/sec cache BW
• Using sgemm routine from ATLAS package
• NVIDIA
• GeForce 5900 Ultra
• GeForce 6800 Ultra
• ATI
Analysis
• Currently:
• GPUs can fetch 16 floats and perform 16 4-component mad’s per clock
• our app fetches 8 floats to perform one 4-component mad -> not enough computations
• need more math ops per float fetched (> 8)
Analysis
• Pentium processors have large L1 caches to boost memory bandwidth (bw)
• bw / compute ratio better
• main reason for only small performance gain achieved with GPUs
Analysis
• Pentium processors have large L1 caches to boost memory bandwidth (bw)
• bw / compute ratio better
• main reason for only small performance gain achieved with GPUs
• for matrix multiplications
Analysis
• Expectations
• make sure that there is enough arithmetic per data item fetched
• lots of data resuse in the algorithm / task will make the CPU look better
• streaming data OK -> they don’t “suffer” from reuse
• matrix multiplication is an excellent reality-check example
Analysis
• What do GPUs need:
• bigger caches to enable larger blocks
• currently there are enough registers to store a 6x6 submatrix
• but currently shaders can only produce a small number of outputs -> limits the amount of blocking
• Provide full-floating point accumulator registers
• Widen path between texture and register files
References
• E. Larsen and D. McAllister, “Fast matrix multiplies using graphics hardware,” Supercomputing 2001.
• J. Hall, N. Carr and J. Hart, “Cache and bandwidth aware matrix multiplication on the GPU,” Tech Report UIUCDCS-R-2003-2328-1
• K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication,” Graphics Hardware Workshop 2004.