cs4961 parallel programming lecture 11 data locality cont mary hall september 28 2010
Download
Skip this Video
Download Presentation
CS4961 Parallel Programming Lecture 11: Data Locality, cont. Mary Hall September 28, 2010

Loading in 2 Seconds...

play fullscreen
1 / 17

CS4961 Parallel Programming Lecture 11: Data Locality, cont. Mary Hall September 28, 2010 - PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on

CS4961 Parallel Programming Lecture 11: Data Locality, cont. Mary Hall September 28, 2010. Programming Assignment 2: Due 11:59 PM, Friday October 8. Combining Locality, Thread and SIMD Parallelism:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CS4961 Parallel Programming Lecture 11: Data Locality, cont. Mary Hall September 28, 2010' - manchu


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cs4961 parallel programming lecture 11 data locality cont mary hall september 28 2010

CS4961 Parallel ProgrammingLecture 11: Data Locality, cont.Mary HallSeptember 28, 2010

CS4961

programming assignment 2 due 11 59 pm friday october 8
Programming Assignment 2: Due 11:59 PM, Friday October 8

Combining Locality, Thread and SIMD Parallelism:

The following code excerpt is representative of a common signal processing technique called convolution. Convolution combines two signals to form a third signal. In this example, we slide a small (32x32) signal around in a larger (4128x4128) signal to look for regions where the two signals have the most overlap.

for (l=0; l<N; l++) {

for (k=0; k<N; k++) {

C[k][l] = 0.0;

for (j=0; j<W; j++) {

for (i=0; i<W; i++) {

C[k][l] += A[k+i][l+j]*B[i][j];

}

}

}

}

CS4961

programming assignment 2 cont
Programming Assignment 2, cont.
  • Your goal is to take the code as written, and by either changing the code or changing the way you invoke the compiler, improve its performance.
  • You will need to use an Intel platform that has the “icc” compiler installed.
    • You can use the CADE Windows lab, or another Intel platform with the appropriate software.
    • The Intel compiler is available for free on Linux platforms for non-commercial use.
    • The version of the compiler, the specific architecture and the flags you give to the compiler will drastically impact performance.
  • You can discuss strategies with other classmates, but do not copy code. Also, do not copy solutions from the web. This must be your own work.

CS4961

programming assignment 2 cont1
Programming Assignment 2, cont.
  • How to compile
    • OpenMP (all versions): icc –openmp conv-assign.c
  • Measure and report performance for five versions of the code, and turn in all variants:
    • Baseline: compile and run code as provided (5 points)
    • Thread parallelism only (using OpenMP): icc –openmp conv-omp.c (5 points)
    • SSE-3 only: icc –openmp –msse3 –vec-report=3 conv-sse.c (10 points)
    • Locality only: icc –openmp conv-loc.c (10 points)
    • Combined: icc –openmp –msse3 –vec-report=3 conv-all.c (15 points)
  • Explain results and observations (15 points)
  • Extra credit (10 points): improve results by optimizing for register reuse

CS4961

targets of memory hierarchy optimizations
Targets of Memory Hierarchy Optimizations
  • Reduce memory latency
    • The latency of a memory access is the time (usually in cycles) between a memory request and its completion
  • Maximize memory bandwidth
    • Bandwidth is the amount of useful data that can be retrieved over a time interval
  • Manage overhead
    • Cost of performing optimization (e.g., copying) should be less than anticipated gain

CS4961

reuse and locality
Reuse and Locality
  • Consider how data is accessed
    • Data reuse:
      • Same or nearby data used multiple times
      • Intrinsic in computation
    • Data locality:
      • Data is reused and is present in “fast memory”
      • Same data or same data transfer
    • If a computation has reuse, what can we do to get locality?
      • Appropriate data placement and layout
      • Code reordering transformations

CS4961

temporal reuse in sequential code
Temporal Reuse in Sequential Code
  • Same data used in distinct iterations I and I’
  • for (i=1; i<N; i++)
    • for (j=1; j<N; j++)
    • A[j]= A[j+1]+A[j-1]
  • A[j]has self-temporal reuse in loopi

CS4961

spatial reuse
Spatial Reuse
  • Same data transfer (usually cache line) used in distinct iterations I and I’
  • for (i=1; i<N; i++)
    • for (j=1; j<N; j++)
    • A[j]= A[j+1]+A[j-1];
  • A[j]has self-spatial reuse in loopj
  • Multi-dimensional array note:
    • C arrays are stored in row-major order
    • Fortran arrays are stored in column-major order

CS4961

can use reordering transformations
Can Use Reordering Transformations!
  • Analyze reuse in computation
  • Apply loop reordering transformations to improve locality based on reuse
  • With any loop reordering transformation, always ask
    • Safety? (doesn’t reverse dependences)
    • Profitablity? (improves locality)

CS4961

loop permutation an example of a reordering transformation

for (i= 0; i<3; i++)

  • for (j=0; j<6; j++)
    • A[i][j+1]=A[i][j]+B[j];

i

i

new traversal order!

j

j

Loop Permutation:An Example of a Reordering Transformation

Permute the order of the loops to modify the traversal order

  • for (j=0; j<6; j++)
  • for (i= 0; i<3; i++)
    • A[i][j+1]=A[i][j]+B[j];

Which one is better for row-major storage?

CS4961

permutation has many goals
Permutation has many goals
  • Locality optimization
      • Particularly, for spatial locality (like in your SIMD assignment)
  • Rearrange loop nest to move parallelism to appropriate level of granularity
    • Inward to exploit fine-grain parallelism (like in your SIMD assignment)
    • Outward to exploit coarse-grain parallelismx
  • Also, to enable other optimizations

CS4961

tiling blocking another loop reordering transformation

I

I

J

Tiling (Blocking):Another Loop Reordering Transformation
  • Blocking reorders loop iterations to bring iterations that reuse data closer in time
  • Goal is to retain in cache between reuse

J

CS4961

tiling is fundamental
Tiling is Fundamental!
  • Tiling is very commonly used to manage limited storage
    • Registers
    • Caches
    • Software-managed buffers
    • Small main memory
  • Can be applied hierarchically

CS4961

tiling example
Tiling Example
  • for (j=1; j<M; j++)
    • for (i=1; i<N; i++)
    • D[i] = D[i] +B[j,i]
  • for (j=1; j<M; j++)
    • for (i=1; i<N; i+=s)
    • for (ii=i; ii<min(i+s-1,N); ii++)
    • D[ii] = D[ii] +B[j,ii]

Strip

mine

  • for (i=1; i<N; i+=s)
  • for (j=1; j<M; j++)
    • for (ii=i; ii<min(i+s-1,N); ii++)
    • D[ii] = D[ii] +B[j,ii]

Permute

CS4961

how to determine safety and profitability
How to Determine Safety and Profitability?
  • Safety
    • A step back to Lecture 2
    • Notion of reordering transformations
    • Based on data dependences
    • Intuition: Cannot permute two loops i and j in a loop nest if doing so reverses the direction of any dependence.
      • Tiling safety test is similar.
  • Profitability
    • Reuse analysis (and other cost models)
    • Also based on data dependences, but simpler

CS4961

simple examples 2 d loop nests
Simple Examples: 2-d Loop Nests
  • Ok to permute?
  • for (i= 0; i<3; i++)
    • for (j=0; j<6; j++)
    • A[i+1][j-1]=A[i][j]
    • +B[j]
  • for (i= 0; i<3; i++)
  • for (j=0; j<6; j++)
    • A[i][j+1]=A[i][j]+B[j]

CS4961

legality of tiling
Legality of Tiling
  • Tiling = strip-mine and permutation
    • Strip-mine does not reorder iterations
    • Permutation must be legal

OR

    • strip size less than dependence distance

CS4961

ad