Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Ap...
Download
1 / 28

IPDPS 2006 - PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on

Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications. IPDPS 2006. Wouter Caarls , Pieter Jonker, Henk Corporaal. Quantitative Imaging Group, department of Imaging Science and Technology. Overview. Stream programming

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IPDPS 2006' - tiana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ipdps 2006

Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications

IPDPS 2006

Wouter Caarls, Pieter Jonker, Henk Corporaal

Quantitative Imaging Group, department of Imaging Science and Technology


Overview
Overview Hetereogeneous Parallel Image Processing Applications

  • Stream programming

  • Writing stream kernels

  • Algorithmic skeletons

  • Writing algorithmic skeletons

  • Skeleton merging

  • Results

  • Conclusion & Future work


Stream programming
Stream Programming Hetereogeneous Parallel Image Processing Applications

  • FIFO-connected kernels processing series of data elements

    • Well suited to signal processing applications

  • Explicit communication and task decomposition

    • Ideal for distributed-memory systems

  • Each data element processed (mostly) independently

    • Ideal for data-parallel systems such as SIMDs


Kernel examples from image processing
Kernel Examples from Image Processing Hetereogeneous Parallel Image Processing Applications

Increasing generality &

Architectural requirements

  • Pixel processing (color space conversion)

    • Perfect match

  • Local neighborhood processing (convolution)

    • Requires 2D access

  • Recursive neighborhood processing (distance transform)

    • Regular data dependencies

  • Stack processing (region growing)

    • Irregular data dependencies


Writing kernels
Writing Kernels Hetereogeneous Parallel Image Processing Applications

  • The language for writing kernels should be restricted

    • To allow efficient compilation to constrained architectures

  • But also general

    • So many different algorithms can be specified

  • Solution: a different language for each type of kernel

    • User selects the most restricted language that supports his kernel

      • Retargetability

      • Efficiency

      • Ease-of-use


Algorithmic skeletons as kernel languages
Algorithmic skeletons* as kernel languages Hetereogeneous Parallel Image Processing Applications

  • An algorithmic skeleton captures a pattern of computation

  • Is conceptually a higher-order function, repetitively calling a kernel function with certain parameters

    • Iteration strategy may be parallel

    • Kernel parameters restrict dependencies

  • Provides the environment in which the kernel runs, and can be seen as a very restricted DSL

*M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation, 1989


Sequential neighborhood skeleton

NeighborhoodToPixelOp() Hetereogeneous Parallel Image Processing Applications

Average(in stream float i[-1..1]

[-1..1],

out stream float *o)

{

int ky, kx;

float acc=0;

for (ky=-1; ky <=1; ky++)

for (kx=-1; kx <=1; kx++)

acc += i[ky][kx];

*o = acc/9;

}

void Average(float **i, float **o)

{

for (int y=1; y < HEIGHT-1; y++)

for (int x=1; x < WIDTH-1; x++)

{

float acc=0;

acc += i[y-1][x-1];

acc += i[y-1][x ];

acc += i[y-1][x+1];

acc += i[y ][x-1];

acc += i[y ][x ];

acc += i[y ][x+1];

acc += i[y+1][x-1];

acc += i[y+1][x ];

acc += i[y+1][x+1];

o[y][x] = acc/9;

}

}

Sequential neighborhood skeleton

Kernel definition

Resulting operation

Skeleton


Skeleton tasks
Skeleton tasks Hetereogeneous Parallel Image Processing Applications

  • Implement structure

    • Outer loop, border handling, buffering, parallel implementation

    • Just write C code

  • Transform kernel

    • Stream access, translation to target language

    • Term rewriting

  • How to combine in a single language?

    • Partial evaluation


Term rewriting 1
Term rewriting (1) Hetereogeneous Parallel Image Processing Applications

Input

*o = acc/9;

Rewrite Rule (applied topdown to all nodes)

replace(`o`, `&o[y][x]`);

Output

o[y][x] = acc/9;


Term rewriting 2 using stratego
Term rewriting (2) Hetereogeneous Parallel Image Processing ApplicationsUsing Stratego*

Input

acc += i[ky][kx];

Rewrite Rule (applied topdown to all nodes)

RelativeToAbsolute:

|[ i[~e1][~e2] ]| ->

|[ i[y + ~e1][x + ~e2] ]|

Output

acc += i[y+ky][x+kx];

*E. Visser. Stratego: A language for program transformation based on rewriting strategies, 2001


Pepci 1 rule composition and code generation in c
PEPCI (1) Hetereogeneous Parallel Image Processing ApplicationsRule composition and code generation in C

stratego RelativeToAbsolute(code i, code body)

{

main = <topdown(RelativeToAbsolute’)>(body)

RelativeToAbsolute’:

|[ ~i[~e1][~e2] ]| ->

|[ ~i[y + ~e1][x + ~e2] ]|

}

for (a=0; a < arguments; a++)

if (args[a].type == ARG_STREAM_IN)

body = RelativeToAbsolute(args[a].id, body);

else if (args[a].type == ARG_STREAM_OUT)

body = DerefToArrayIndex(args[a].id, body);

for (y=1; y < HEIGHT-1; y++)

for (x=1; x < WIDTH-1; x++)

@body;

Rule definition

Rule composition

Code generation


Pepci 2 combining rule composition and code generation
PEPCI (2) Hetereogeneous Parallel Image Processing ApplicationsCombining rule composition and code generation

  • How to distinguish rule composition from code generation?

    for (a=0; a < arguments; a++)

    body = DerefToArrayIndex(args[a].id, body);

    for (x=0; x < stride; x++)

    @body;

  • Partial evaluation: evaluate only the parts of the program that are known. Output the rest

    • arguments is known, DerefToArrayIndex is known, args[a].id is known, body is known -> evaluate

    • stride is unknown -> output


Pepci 3 partial evaluation by interpretation

double n, x=1; Hetereogeneous Parallel Image Processing Applications

int ii, iterations=3;

scanf(“%lf”, &n);

for (ii=0; ii < iterations; ii++)

x = (x + n/x)/2;

printf(“sqrt(%f) = %f\n”, n, x);

double n;

double x;

int ii;

int iterations;

x = 1;

iterations = 3;

scanf(“%lf”, &n);

ii = 0;

x = (1 + n/1)/2;

ii = 1;

x = (x + n/x)/2;

ii = 2;

x = (x + n/x)/2;

ii = 3;

printf(“sqrt(%f) = %f\n”, n, x);

PEPCI (3)Partial evaluation by interpretation

Input

Output

Symbol table

double n

double x

int ii

int iterations

?

1

?

1

?

3

?

1

0

3

?

?

0

3

?

?

1

3

?

?

2

3

?

?

3

3


Kernelization overheads
Kernelization overheads Hetereogeneous Parallel Image Processing Applications

  • Kernelizing an application impacts performance

    • Mapping

    • Scheduling

    • Buffers management

    • Lost ILP

  • Merge kernels

    • Extract static kernel sequences

    • Statically schedule at compile-time

    • Replace sequence with merged kernel


Skeleton merging
Skeleton merging Hetereogeneous Parallel Image Processing Applications

  • Skeletons are completely general functions

    • Cannot be properly analyzed or reasoned about

  • Restrict skeleton generality be using metaskeletons

    • Skeletons using the same metaskeleton can be merged

    • Merged operation still uses the original metaskeleton, and can be recursively merged


Example
Example Hetereogeneous Parallel Image Processing Applications

  • Philips Inca+ smart camera

    • 640x480 sensor

    • XeTaL 16MHz, 320-way SIMD

    • TriMedia 180MHz, 5-issue VLIW

  • Ball detection

    • Filtering, Segmentation, Hough transform


Results
Results Hetereogeneous Parallel Image Processing Applications

Buffers,

Scheduling, ILP

ILP not fully

recovered


Conclusion
Conclusion Hetereogeneous Parallel Image Processing Applications

  • Stream programming is a natural fit for running image processing applications on distributed-memory systems

  • Algorithmic Skeletons efficiently exploit data parallelism, by allowing the user to select the most restricted skeleton that supports his kernel

    • Extensible (new skeletons)

    • Retargetable (new skeleton implementations)

  • PEPCI effectively combines the necessities of efficiently implementing algorithmic skeletons

    • Term rewriting (by embedding Stratego)

    • Partial evaluation (to automatically separate rule composition and code generation)


Future work
Future Work Hetereogeneous Parallel Image Processing Applications

  • Better merging of kernels

    • Merge more efficiently

    • Merge different metaskeletons

  • Implement on a more general architecture

  • Implement more demanding applications

    • And more involved skeletons


Ipdps 2006
End Hetereogeneous Parallel Image Processing Applications


Partial evaluation 2 free optimizations
Partial evaluation (2) Hetereogeneous Parallel Image Processing ApplicationsFree optimizations

  • Loop unrolling

    • If the conditions are known, and the body isn’t

  • Function inlining

  • Aggressive constant folding

    • Including external “pure” functions


Kernel translation
Kernel translation Hetereogeneous Parallel Image Processing Applications

  • SIMD processors are not programmed in C, but in parallel derivatives

  • Skeleton should translate kernel to target language

  • Extend PEPCI with C derivative syntax

    • Though only minimally interpreted


Example local neighborhood operation in xtc

NeighbourhoodToPixelOp() Hetereogeneous Parallel Image Processing Applications

sobelx(in stream unsigned char

i[-1..1][-1..1],

out stream int *o)

{

int x, y, temp;

temp = 0;

for (y=-1; y < 2; y++)

for (x=-1; x < 2; x=x+2)

temp = temp + x*i[y][x];

*o = temp;

}

static lmem _in2;

static lmem _in1;

{

lmem temp;

temp = (0)+((-1)*(_in2[-1 .. 0]));

temp = (temp)+((1)*(_in2[1 .. 2]));

temp = (temp)+((-1)*(_in1[-1 .. 0]));

temp = (temp)+((1)*(_in1[1 .. 2]));

temp = (temp)+((-1)*(larg0[-1 .. 0]));

temp = (temp)+((1)*(larg0[1 .. 2]));

larg1 = temp;

}

_in2 = _in1;

_in1 = larg0;

Example: local neighborhood operation in XTC


Stream program
Stream program Hetereogeneous Parallel Image Processing Applications

void main(int argc, char **argv)

{

STREAM a, b, c;

int maxval, dummy, maxc;

scInit(argc, argv);

while (1) {

capture(&a);

interpolate(&a, &a);

sobelx(&a, &b);

sobely(&a, &c);

magnitude(&b, &c, &a);

direction(&b, &c, &b);

mask(&b, &a, &a, scint(128));

hough(&a, &a);

display(&a);

imgMax(&a, scint(0), &maxval, scint(0), &dummy, scint(0),

&maxc);

_block(&maxc, &maxval);

printf(“Ball found at %d with strength %d\n”, maxc, maxval);

}

return scExit();

}


Programming with algorithmic skeletons 1
Programming with algorithmic skeletons (1) Hetereogeneous Parallel Image Processing Applications

PixelToPixelOp()

binarize(in stream int *i, out stream int *o, in int *threshold)

{

*o = (*i > *threshold);

}

NeighbourhoodToPixelOp()

average(in stream int i[-1..1][-1..1], out stream int *o)

{

int x, y;

*o = 0;

for (y=-1; y < 2; y++)

for (x=-1; x < 2; x++)

*o += i[y][x];

*o /= 9;

}


Programming with algorithmic skeletons 2
Programming with algorithmic skeletons (2) Hetereogeneous Parallel Image Processing Applications

StackOp(in stream int *init)

propagate(in stream int *i[-1..1][-1..1], out stream int *o)

{

int x, y;

for (y=-1; y < 2; y++)

for (x=-1; x < 2; x++)

if (i[y][x] && !*o)

{

*o = 1;

push(y, x);

}

}

AssocPixelReductionOp()

max(in stream int *i, out int *res)

{

if (*i > *res)

*res = *i;

}


Algorithmic skeletons

<=t Hetereogeneous Parallel Image Processing Applications

+

=

>t

<=t

<=t

+

=

>t

+

=

>t

Algorithmic Skeletons


Term rewriting 1 from code to abstract syntax tree
Term rewriting (1) Hetereogeneous Parallel Image Processing ApplicationsFrom code to abstract syntax tree

acc

+=

i

[ ]

ky

[ ]

kx

;

Stat

AssignPlus

Id

ArrayIndex

“acc”

ArrayIndex

Id

Id

Id

“kx”

“i”

“ky”

Stat(AssignPlus(Id("acc"),ArrayIndex(ArrayIndex(Id("i"),Id("ky")),Id("kx"))))