- By
**jara** - Follow User

- 87 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Notes on Homework 1' - jara

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Blocking for multiple levels (registers or SIMD, L1, L2) Unrolling the loops Copy optimizations Aligning memory and padding Adding SIMD intrinsics Tuning parameters

Using compiler flags Checking assembly code for SSE instructions Remember the write-up DUE DATE changed, now due Monday Feb 17th at 11:59pm

2x2 Matrix Multiply

- C00 += A00B00 + A01B10
- C10 += A10B00 + A11B10
- C01 += A00B01 + A01B11
- C11 += A10B01 + A11B11
- Rewrite as SIMD algebra
- C00_C10 += A00_A10 * B00_B00
- C01_C11 += A00_A10 * B01_B01
- C00_C10 += A01_A11 * B10_B10
- C01_C11 += A01_A11 * B11_B11

Summary of SSE intrinsics

- #include <emmintrin.h>
- Vector data type:
- __m128d
Load and store operations:

- _mm_load_pd
- _mm_store_pd
- _mm_loadu_pd
- _mm_storeu_pd
Load and broadcast across vector

- _mm_load1_pd
Arithmetic:

- _mm_add_pd
- _mm_mul_pd

Example: multiplying 2x2 matrices

#include <emmintrin.h>

c1 = _mm_loadu_pd( C+0*lda ); //load unaligned block in C

c2 = _mm_loadu_pd( C+1*lda );

for( inti = 0; i < 2; i++ )

{

a = _mm_load_pd( A+i*lda); //load alignedi-th column of A

b1 = _mm_load1_pd( B+i+0*lda ); //load i-th row of B

b2 = _mm_load1_pd( B+i+1*lda );

c1=_mm_add_pd( c1, _mm_mul_pd( a, b1 ) ); //rank-1 update

c2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) );

}

_mm_storeu_pd( C+0*lda, c1 ); //store unaligned block in C

_mm_storeu_pd( C+1*lda, c2 );

General suggestions for optimizations

- Changing the order of the loops (i, j, k)
- changes which matrix stays in memory and the type of product

- L3 is not important as most matrices will fit and optimizations not likely to bring more benefits

- will be a large benefit if done for sub-matrix that is kept in memory
- careful not to overdo (overhead is high for every level)

- can help with SIMD performance
- eliminating special cases and bad performance

- cannot get over 50% without explicit intrinsics or auto-vectorization

- various block sizes, register sizes (perhaps automate)

- Checking compilers
- options available PGI, PathScale, GNU, Cray
- Cray compiler seems to have an issue with explicit intrinsics

- remember not allowed to use MM specific flags (-matmul)

- use the –S flag to see assembly
- should have mainly ADDPD & MULPD ops, ADDSD & MULSD are scalar computations

- want to know what optimizations you tried and what worked and failed
- try and have an incremental design and show the performance of multiple iterations

Download Presentation

Connecting to Server..