A prototypical self optimizing package for parallel implementation of fast signal transforms
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms. Kang Chen and Jeremy Johnson Department of Mathematics and Computer Science Drexel University. Motivation and Overview. High performance implementation of critical signal processing kernels

Download Presentation

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A prototypical self optimizing package for parallel implementation of fast signal transforms

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms

Kang Chen and Jeremy Johnson

Department of Mathematics and Computer Science

Drexel University


Motivation and overview

Motivation and Overview

  • High performance implementation of critical signal processing kernels

  • A self-optimizing parallel package for computing fast signal transforms

    • Prototype transform (WHT)

    • Build on existing sequential package

    • SMP implementation using OpenMP

  • Part of SPIRAL project

    • http://www.ece.cmu.edu/~spiral


Outline

Outline

  • Walsh-Hadamard Transform (WHT)

  • Sequential performance and optimization using dynamic programming

  • A parallel implementation of the WHT

  • Parallel performance and optimization including parallelism in the search


Walsh hadamard transform

Walsh-Hadamard Transform

Fast WHT algorithms are obtained by factoring the WHT matrix


Spiral wht package

SPIRAL WHT Package

  • All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns

  • Different factorizations lead to varying amounts of recursion and iteration

  • Transforms in small sizes (21 to 28) are implemented in straight-line code to reduce overheads

  • The WHT package allows exploration of different algorithms and implementations

  • Optimization/adaptation to architectures is performed by searching for the fastest algorithm

Johnson and Püschel: ICASSP 2000


Dynamic programming

213

29

24

29

24

25

24

25

Dynamic Programming

  • Exhaustive Search: Searching all possible algorithms

    • Cost is (4n/n3/2) for binary factorizations

  • Dynamic Programming: Searching among algorithms generated from previously determined best algorithms

    • Cost is (n2) for binary factorizations

Possibly best algorithm at size 213

Best algorithm at size 29

Best algorithm at size 24

24


A prototypical self optimizing package for parallel implementation of fast signal transforms

Performance of WHT Algorithms

  • Iterative algorithms have less overhead

  • Recursive algorithms have better data locality

  • Best WHT algorithms are compromise between less overhead and good data flow pattern.


Architecture dependency

222

222

222

25,(1)

217

210

212

24,(4)

218

24,(1)

213

25,(1)

26

26

26

26,(2)

212

24,(1)

29

25

27

222

A DDL split node

24,(1)

25

25,(1)

An IL=1 straight-line WHT32 node

Architecture Dependency

The best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc.

UltraSPARC v9

POWER3 II

PowerPC RS64 III


Improved data access patterns

x0

Stride tensor

Union tensor

x1

x2

x3

x4

23

x5

x6

21

22

x7

time

Improved Data Access Patterns

  • Stride tensor causes WHT accessing data out of block and loss of locality

  • Large stride introduces more conflict cache misses


Dynamic data layout

pseudo transpose

pseudo transpose

x0

x0

x1

x4

x2

x2

x0 x1 x2 x3

x4 x5 x6 x7

x0 x4 x2 x6

x1 x5 x3 x7

x6

x3

x4

x1

x5

x5

x3

x6

x7

x7

Dynamic Data Layout

DDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor.

N. Park and V. K. Prasanna: ICASSP 2001


Loop interleaving

WHT2 ÄI4

WHT2 IL=1ÄI4/2

WHT2 IL=2ÄI4/4

x0

x0

x0

x1

x1

x1

x2

x2

x2

x3

x3

x3

x4

x4

x4

x5

x5

x5

x6

x6

x6

x7

x7

x7

Access order

1st

2nd

3rd

4th

Loop Interleaving

IL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform.

Gatlin and Carter: PACT 2000, Implemented by Bo Hong


Best wht partition trees

216

216

21

215

27

29

216

21

214

24

25

25,(3)

211

23

211

216

A DDL split node

25

26

25

26

25,(3)

An IL=3 straight-line WHT32 node

Best WHT Partition Trees

Environment: PowerPC RS64 III/12 450 MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc 5.0.5

Standard best tree

Best tree with DDL

Best tree with IL


A prototypical self optimizing package for parallel implementation of fast signal transforms

Effect of IL and DDL on Performance

DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 214 8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 24 8 bytes.


Parallel wht package

Parallel WHT Package

  • SMP implementation obtained using OpenMP

  • WHT partition tree is parallelized at the root node

    • Simple to insert OpenMP directives

    • Better performance obtained with manual scheduling

  • DP decides when to use parallelism

  • DP builds the partition with best sequential subtrees

  • DP decides the best parallel root node

    • Parallel split

    • Parallel split with DDL

      • Parallel pseudo-transpose

    • Parallel split with IL


Openmp implementation

OpenMP Implementation

# pragma omp parallel

{

R = N; S = 1;

for (i = 0; i < t; i ++) {

R = R / N(i);

# pragma omp parallel for

for (j = 0; j < R - 1) {

for (k = 0; k < S - 1) {

WHT(N(i)) * x(j, k, S, N(i));

}

}

S = S * N(i);

}

}

# pragma omp parallel

{

total = get_total_threads( );

id = get_thread_id( );

R = N; S = 1;

for (i = 0; i < t; i ++) {

R = R / N(i);

for (; id < R*S - 1; id += total) {

j = id / S;

k = id % S;

WHT(N(i)) * x(j, k, S, N(i));

}

S = S * N(i);

# pragma omp barrier

}

}


Parallel ddl

S

S

S

R

Coarse-grained pseudo transpose

Fine-grained pseudo transpose

Fine-grained pseudo transpose with ID shift

thread 1

thread 2

thread 3

thread 4

Parallel DDL

In WHTRS = L (ISWHTR) L (IRWHTS), the pseudo transpose, L, can be parallelized in different granularity


Comparison of parallel schemes

Comparison of Parallel Schemes


Best tree of parallel ddl schemes

226

226

212

213

226

214

213

29

217

26

26

26

27

24

26

27

210

24

25

28

29

25

24

25

25

A parallel DDL split node

226

A DDL split node

217

Best Tree of Parallel DDL Schemes

Coarse-grained DDL

Fine-grained DDL

Fine-grained with ID Shift DDL


Normalized runtime of powerpc rs64

The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.

Normalized Runtime of PowerPC RS64

PowerPC RS64 III


Overall parallel speedup

Overall Parallel Speedup


Parallel performance

Parallel Performance

A. PowerPC RS64 III

B. POWER3 II

C. UltraSPARC v8plus

Data size is 225 for Table A, 223 for Table B and C.


Conclusion and future work

Conclusion and Future Work

  • Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP

    • Self-adapts to different architectures using search

    • Must take into account data access pattern

    • Parallel implementation should not constrain search

    • Package is available for download at SPIRAL website http://www.ece.cmu.edu/~spiral

  • Working on a distributed memory version using MPI


Effect of scheduling strategy

Effect of Scheduling Strategy


Parallel split node with il and ddl

Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures.

Parallel Split Node with IL and DDL


Modified scheduling

thread 1

thread 2

thread 3

thread 4

Modified Scheduling

Choice in scheduling WHT tasks for (WHTR IS) and (IRWHTS).

small granularity, size R or S

large granularity, size R  S / thread number


  • Login