cse 690 gpgpu lecture 7 matrix multiplications l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CSE 690: GPGPU Lecture 7: Matrix Multiplications PowerPoint Presentation
Download Presentation
CSE 690: GPGPU Lecture 7: Matrix Multiplications

Loading in 2 Seconds...

play fullscreen
1 / 19

CSE 690: GPGPU Lecture 7: Matrix Multiplications - PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on

CSE 690: GPGPU Lecture 7: Matrix Multiplications . Klaus Mueller Computer Science, Stony Brook University. Basic Concept. Triple loop. GPU Algorithms. First algorithm: render a rectangle of size NxN represent the matrices as NxN textures each (i,j) is then a fragment

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CSE 690: GPGPU Lecture 7: Matrix Multiplications' - twyla


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cse 690 gpgpu lecture 7 matrix multiplications

CSE 690: GPGPULecture 7: Matrix Multiplications

Klaus Mueller

Computer Science, Stony Brook University

basic concept
Basic Concept
  • Triple loop
gpu algorithms
GPU Algorithms
  • First algorithm:
    • render a rectangle of size NxN
    • represent the matrices as NxN textures
    • each (i,j) is then a fragment
    • each fragment program is a loop or an unrolled loop -> may get too long
    • must pull in the same data many times -> poor data reuse, needs bandwidth
    • makes no use of 4-way RGBA parallelism -> wastes speedup
gpu algorithms4
GPU Algorithms
  • Better algorithm:
    • use RGBA channels, pack a 2x2 submatrix
    • use swizzling to facilitate data reuse
    • swizzling improves fragment code length by factor 2
    • may need multiple passes for larger matrices
gpu algorithms5
GPU Algorithms
  • Using multi-texturing
    • requires l passes
gpu algorithms6
GPU Algorithms
  • Can use RGBA parallelism as well
    • each texel represents a 2x2 submatrix
    • use swizzling as usual
    • needs l/2 passes
gpu algorithms7
GPU Algorithms
  • Instead of a 2x2 submatrix, pack 4x1 column vectors
    • makes 4-times reuse of texels read from B, but uses texels from A only once
gpu algorithms8
GPU Algorithms
  • Instead of a 2x2 submatrix, pack 4x1 column vectors
    • 6 fetches are needed for 4 mad’s (mult-add’s) -> 1.5 times more than before
    • but less rows and columns are accessed per pass -> improves cache hit frequency
gpu algorithms9
GPU Algorithms
  • Originally only compute one product per shader
    • practically can unroll the loop 3-6 times (compute 3-6 products)
    • maximal fragment program length is the limit
    • reduces the number of passes required
reality check
Reality Check
  • Would like to compare CPU and GPU efficiencies for GPGPU tasks
  • The task of matrix multiplication is insightful here
    • features much data reuse
    • graphics programs are generally more stream-like and have less data reuse
    • this may lead to some limitations
platforms
Platforms
  • Pentium 4 3Ghz CPU, 512KB L2 cache
    • 12 GFLOPS peak compute
    • 44.1GB/sec cache BW
    • Using sgemm routine from ATLAS package
  • NVIDIA
    • GeForce 5900 Ultra
    • GeForce 6800 Ultra
  • ATI
    • Radeon 9800 XT
    • Radeon X800 XT PE
analysis
Analysis
  • Currently:
    • GPUs can fetch 16 floats and perform 16 4-component mad’s per clock
    • our app fetches 8 floats to perform one 4-component mad -> not enough computations
    • need more math ops per float fetched (> 8)
analysis15
Analysis
  • Pentium processors have large L1 caches to boost memory bandwidth (bw)
    • bw / compute ratio better
    • main reason for only small performance gain achieved with GPUs
analysis16
Analysis
  • Pentium processors have large L1 caches to boost memory bandwidth (bw)
    • bw / compute ratio better
    • main reason for only small performance gain achieved with GPUs
    • for matrix multiplications
analysis17
Analysis
  • Expectations
    • make sure that there is enough arithmetic per data item fetched
    • lots of data resuse in the algorithm / task will make the CPU look better
    • streaming data OK -> they don’t “suffer” from reuse
    • matrix multiplication is an excellent reality-check example
analysis18
Analysis
  • What do GPUs need:
    • bigger caches to enable larger blocks
    • currently there are enough registers to store a 6x6 submatrix
    • but currently shaders can only produce a small number of outputs -> limits the amount of blocking
    • Provide full-floating point accumulator registers
    • Widen path between texture and register files
references
References
  • E. Larsen and D. McAllister, “Fast matrix multiplies using graphics hardware,” Supercomputing 2001.
  • J. Hall, N. Carr and J. Hart, “Cache and bandwidth aware matrix multiplication on the GPU,” Tech Report UIUCDCS-R-2003-2328-1
  • K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication,” Graphics Hardware Workshop 2004.