parallel and pipeline programming n.
Skip this Video
Download Presentation
Parallel and Pipeline Programming

Loading in 2 Seconds...

play fullscreen
1 / 37

Parallel and Pipeline Programming - PowerPoint PPT Presentation

  • Uploaded on

Parallel and Pipeline Programming. Super-scalar, pipelined with vector instruction support. Definitions. Super-scalar - multiple integer or floating-point ALUs Pipeline - executes instructions in steps like an assembly line Stall - instruction execution state that delays a pipeline step

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Parallel and Pipeline Programming' - clinton-mcneil

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Super-scalar - multiple integer or floating-point ALUs
  • Pipeline - executes instructions in steps like an assembly line
  • Stall - instruction execution state that delays a pipeline step
    • If an add takes 2 steps and there are two ALUs, then 3 adds in a row could cause a stall
access times
Access Times

HierarchyAccess Times

To Where CPU Cycles

Register <= 1

L1d cache ~3

L2 cache ~14

L3 cache ~30

Main Memory ~240

Disk ~7,000,000

disk transfer time
Disk Transfer Time
  • Fujitsu MHS2060AT 60GB Laptop Hard Drive
  • 4200RPM, 420 sectors/track, 512 B/sector
  • 1 track per 1/4200M = 1/70S
  • 1 track/(1/70S) x 420 sect/trk x 512 B/sect
  • = 15.05mb/second transfer rate
key to effective cache utilization is program locality
Key to effective cache utilization is program locality
  • Temporal locality refers to the reuse of the same address within relatively small time durations.
  • Spatial locality refers to the use of data within relatively "close" storage locations.
strict consistency
Strict Consistency
  • R/W are seen in same order by all processors
  • In hardware, is implemented by atomic hardware instructions

Intel Compare-Exchange Semantics

if (EAX== DEST) {

ZF = 1    DEST = SRC

} else {

    ZF = 0    EAX= DEST


gcc optimization options
Gcc Optimization Options

-falign-functions=n -falign-jumps=n -falign-labels=n -falign-loops=n -falign-loops-max-skip=n -falign-jumps-max-skip=n

-fbounds-check -fmudflap -fmudflapth -fmudflapir -fbranch-probabilities -fprofile-values -fvpt -fbranch-target-load-optimize

-fbranch-target-load-optimize2 -fbtr-bb-exclusive -fcaller-saves -fcprop-registers -fcreate-profile -fcse-follow-jumps

-fcse-skip-blocks -fcx-limited-range -fdata-sections -fdelayed-branch -fdelete-null-pointer-checks -fearly-inlining

-fexpensive-optimizations -ffast-math -ffloat-store -fforce-addr -ffunction-sections -fgcse -fgcse-lm -fgcse-sm -fgcse-las

-fgcse-after-reload -fcrossjumping -fif-conversion -fif-conversion2 -finline-functions -finline-functions-called-once

-finline-limit=n -fkeep-inline-functions -fkeep-static-consts -flocal-alloc (APPLE ONLY)-fmerge-constants

-fmerge-all-constants -fmodulo-sched -fno-branch-count-reg -fno-default-inline -fno-defer-pop -fmove-loop-invariants

-fno-function-cse -fno-guess-branch-probability -fno-inline -fno-math-errno -fno-peephole -fno-peephole2

-funsafe-math-optimizations -funsafe-loop-optimizations -ffinite-math-only -fno-toplevel-reorder

-fno-trapping-math -fno-zero-initialized-in-bss -mstackrealign -fomit-frame-pointer -foptimize-register-move

-foptimize-sibling-calls -fprefetch-loop-arrays -fprofile-generate -fprofile-use -fregmove -frename-registers -freorder-blocks

-freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -frounding-math -frtl-abstract-sequences

-fschedule-insns -fschedule-insns2 -fno-sched-interblock -fno-sched-spec -fsched-spec-load

-fsched-spec-load-dangerous -fsched-stalled-insns=n -fsched-stalled-insns-dep=n

-fsched2-use-superblocks -fsched2-use-traces -fsee -freschedule-modulo-scheduled-loops

-fsection-anchors -fsignaling-nans -fsingle-precision-constant -fstack-protector -fstack-protector-all -fstrict-aliasing

-fstrict-overflow -ftracer -fthread-jumps -funroll-all-loops -funroll-loops -fpeel-loops -fsplit-ivs-in-unroller -funswitch-loops

-fvariable-expansion-in-unroller -ftree-pre -ftree-ccp -ftree-dce -ftree-loop-optimize

-ftree-loop-linear -ftree-loop-im -ftree-loop-ivcanon -fivopts -ftree-dominator-opts

-ftree-dse -ftree-copyrename -ftree-sink -ftree-ch -ftree-sra -ftree-ter -ftree-lrs -ftree-fre

-ftree-vectorize -ftree-vect-loop-version -ftree-salias -fuse-profile -fipa-pta -fweb -ftree-copy-prop

-ftree-store-ccp -ftree-store-copy-prop -fwhole-program --param name=value

-O -O0 -O1 -O2 -O3 -Os -Oz <<most important

code improvement options
Code Improvement Options

Constant folding


x = 23;

Constant propagation




Assign variables to registers in C/C++

register int x,y;

Operator strength reduction



Peephole optimization (use architecture-specific instructions)

a += 1;


Compiler option to target different architectures

(386, 486, Pentium, i5)

Aligning data structures on natural boundaries

(unaligned data accesses fault on some and are slower on all)

Common sub-expression optimization




then use t

Inline functions

(treat function definition as a macro and substitute

the text at every call)

Invariant code motion out of loops

while (x++<Y) {

z += p+n*6;



n*6 never changes

Loop fusion (make one loop out of two or more (i.e. omp collapse))

Loop unrolling (reduce iteration by factor of n, replicate loop body n times)

Loop interchange (change nesting order of loops, which may enable other optimizations)

Loop blocking or tiling (replace array processing by two loops to divide the iteration space into smaller blocks to minimize cache misses

Omit frame pointers

(procedure entry-exit code can be simplified when

procedure call chain is deterministic)

max function
Max Function

#include <stdio.h>

#include <stdlib.h>

#include <time.h>

#define N 20000000

int array_int_max(int a[], int n) {

    int i, max=0;

    for (i=1; i<n; i++) if (a[max]<a[i]) max=i;

    return max;


int test[N];int main(int argc, char *argv[]) { int i, j;

for (i=0; i<N; i++) test[i]=rand();

j=clock(); i=array_int_max(test,N);

printf("clock=%ld index=%d max=%d\n",

clock()-j, i, test[i]);

return 0;}

OUTPUTclock=96782 index=1310 max=2147483531


Microsoft Visual Studio Timings for Max Function

Build Options Clock() Timing

Debug 131

Release, Optimization disabled,

Favor small code, no whole program optimization 127

Release, Minimize size, Favor small code,

no whole program optimization 42

Release, Minimize size, Favor fast code,

no whole program optimization 29

Release, Minimize size, Favor fast code,

Whole program optimization 29

Release, Maximize speed, Favor fast code,

Whole program optimization 31

#pragma omp sections, 2 threads 37

pipeline hazards
Pipeline Hazards
  • Structural hazard
    • hardware resource conflicts prevent overlapped execution.
  • Control hazard
    • when any instruction, such as a branch, changes the instruction pointer register (IP). The choices are to stall after a branch IF, to undo un-branched-to instructions, or to predict where every branch is going.
  • Data hazard
    • An instruction produces output or an action that is needed by a later instruction’s pipeline stage
loop unrolling
Loop Unrolling

int A[N][N], B[N][N], C[N][N];

int main(int argc, char *argv[]) {

int i, j, k, z;

for (i=0; i<N; i++) for (j=0; j<N; j++) {

A[i][j]=rand(); B[i][j]=A[i][j]+1; C[i][j]=A[i][j]-1;



for (i=0; i<N; i+=4) //increment by unrolling factor

for (j=0; j<N; j++)

for (k=0; k<N; k++) {

//8301 clocks, no unrolling

A[i][j] = A[i][j] + B[i][k] * C[k][j];

//4281 clocks, 2 statements

A[i+1][j] = A[i+1][j] + B[i+1][k] * C[k][j];

//3251 clocks, 3 statements

A[i+2][j] = A[i+2][j] + B[i+2][k] * C[k][j];

//3063 clocks, 4 statements

A[i+3][j] = A[i+3][j] + B[i+3][k] * C[k][j];


printf("clock=%d\n", clock()-z);

return 0;


software pipelining
Software Pipelining
  • Loop over statements, each statement is dependent on the previous statement.
    • ai, bi, ci
  • Loop unrolling would result in
    • ai, bi, ci, ai+1, bi+1, ci+1
  • However, the dependency (data hazard) between b and a and between c and b still exist.
  • Software pipelining changes loop to contain
    • ai, ai+1, bi, bi+1, ci, ci+1
vector max function
Vector Max Function

#define N 20000000

int array_int_max(vInt32 a[], int n) {

int i; vInt32 max, temp, temp1;

vCopy(max, a[0]);

for (i=1; i<n; i+=4) {

vMax_int(temp,a[i],a[i+1]); vMax_int(temp1,a[i+2],a[i+3]);

vMax_int(max,temp,max); vMax_int(max,temp1,max);


vSplat_int(temp,max,0); vMax_int(max,temp,max);

vSplat_int(temp,max,1); vMax_int(max,temp,max);

vSplat_int(temp,max,2); vMax_int(max,temp,max);

return vExtract_int(max,3);


int test[N];

int main(int argc, char *argv[]) {

int i, j;

for (i=0; i<N; i++) test[i]=rand();


i=array_int_max((vInt32 *) test, N/4);

printf("clock=%d max=%d\n", clock()-j, i);


opencl execution model
OpenCL Execution Model
  • Context
    • Defines the target execution environment for a Program. A Context can include muliple GPUs and a CPU.
  • Kernel
    • A C-like method executed on a streaming processor (also referred to as a processing element).
    • Kernel code only uses registers, no stack and no heap. Kernel code that uses more registers than are available may fail to load or execute inefficiently.
    • No nested kernel calls, no recursion.
    • Kernels are compiled for every device in a context.
  • Kernel Arguments
    • Scalar
    • Vector (128 bits, 4 floats or ints, 2 doubles)
    • Pointer to a 1-d sequence of values no matter what the shape of the data.
  • Program
    • Collection of kernels. Must be dynamically loaded into one or more CPU/GPUs.
opencl vector addition
OpenCL Vector Addition

const char * sProgramSource =

"__kernel void vectorAdd(       \n" \

"__global const float * a,          \n" \

"__global const float * b,          \n" \

"__global   float * c)                 \n" \

"{                                     \n" \

"   // Vector element index         \n" \

"   int nIndex = get_global_id(0); \n" \

"   c[nIndex] = a[nIndex] + b[nIndex]; \n" \

"}                                     \n";

opencl vector addition1
OpenCL Vector Addition
  • No use or private or local storage
  • Reference to __global is slow
  • Computation per PE is too little