- 60 Views
- Uploaded on
- Presentation posted in: General

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms

Kang Chen and Jeremy Johnson

Department of Mathematics and Computer Science

Drexel University

- High performance implementation of critical signal processing kernels
- A self-optimizing parallel package for computing fast signal transforms
- Prototype transform (WHT)
- Build on existing sequential package
- SMP implementation using OpenMP

- Part of SPIRAL project
- http://www.ece.cmu.edu/~spiral

- Walsh-Hadamard Transform (WHT)
- Sequential performance and optimization using dynamic programming
- A parallel implementation of the WHT
- Parallel performance and optimization including parallelism in the search

Fast WHT algorithms are obtained by factoring the WHT matrix

- All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns
- Different factorizations lead to varying amounts of recursion and iteration
- Transforms in small sizes (21 to 28) are implemented in straight-line code to reduce overheads
- The WHT package allows exploration of different algorithms and implementations
- Optimization/adaptation to architectures is performed by searching for the fastest algorithm

Johnson and Püschel: ICASSP 2000

213

29

24

29

24

25

24

25

- Exhaustive Search: Searching all possible algorithms
- Cost is (4n/n3/2) for binary factorizations

- Dynamic Programming: Searching among algorithms generated from previously determined best algorithms
- Cost is (n2) for binary factorizations

Possibly best algorithm at size 213

Best algorithm at size 29

Best algorithm at size 24

24

Performance of WHT Algorithms

- Iterative algorithms have less overhead
- Recursive algorithms have better data locality
- Best WHT algorithms are compromise between less overhead and good data flow pattern.

222

222

222

25,(1)

217

210

212

24,(4)

218

24,(1)

213

25,(1)

26

26

26

26,(2)

212

24,(1)

29

25

27

222

A DDL split node

24,(1)

25

25,(1)

An IL=1 straight-line WHT32 node

The best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc.

UltraSPARC v9

POWER3 II

PowerPC RS64 III

x0

Stride tensor

Union tensor

x1

x2

x3

x4

23

x5

x6

21

22

x7

time

- Stride tensor causes WHT accessing data out of block and loss of locality
- Large stride introduces more conflict cache misses

pseudo transpose

pseudo transpose

x0

x0

x1

x4

x2

x2

x0 x1 x2 x3

x4 x5 x6 x7

x0 x4 x2 x6

x1 x5 x3 x7

x6

x3

x4

x1

x5

x5

x3

x6

x7

x7

DDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor.

N. Park and V. K. Prasanna: ICASSP 2001

WHT2 ÄI4

WHT2 IL=1ÄI4/2

WHT2 IL=2ÄI4/4

x0

x0

x0

x1

x1

x1

x2

x2

x2

x3

x3

x3

x4

x4

x4

x5

x5

x5

x6

x6

x6

x7

x7

x7

Access order

1st

2nd

3rd

4th

IL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform.

Gatlin and Carter: PACT 2000, Implemented by Bo Hong

216

216

21

215

27

29

216

21

214

24

25

25,(3)

211

23

211

216

A DDL split node

25

26

25

26

25,(3)

An IL=3 straight-line WHT32 node

Environment: PowerPC RS64 III/12 450 MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc 5.0.5

Standard best tree

Best tree with DDL

Best tree with IL

Effect of IL and DDL on Performance

DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 214 8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 24 8 bytes.

- SMP implementation obtained using OpenMP
- WHT partition tree is parallelized at the root node
- Simple to insert OpenMP directives
- Better performance obtained with manual scheduling

- DP decides when to use parallelism
- DP builds the partition with best sequential subtrees
- DP decides the best parallel root node
- Parallel split
- Parallel split with DDL
- Parallel pseudo-transpose

- Parallel split with IL

# pragma omp parallel

{

R = N; S = 1;

for (i = 0; i < t; i ++) {

R = R / N(i);

# pragma omp parallel for

for (j = 0; j < R - 1) {

for (k = 0; k < S - 1) {

WHT(N(i)) * x(j, k, S, N(i));

}

}

S = S * N(i);

}

}

# pragma omp parallel

{

total = get_total_threads( );

id = get_thread_id( );

R = N; S = 1;

for (i = 0; i < t; i ++) {

R = R / N(i);

for (; id < R*S - 1; id += total) {

j = id / S;

k = id % S;

WHT(N(i)) * x(j, k, S, N(i));

}

S = S * N(i);

# pragma omp barrier

}

}

S

S

S

R

Coarse-grained pseudo transpose

Fine-grained pseudo transpose

Fine-grained pseudo transpose with ID shift

thread 1

thread 2

thread 3

thread 4

In WHTRS = L (ISWHTR) L (IRWHTS), the pseudo transpose, L, can be parallelized in different granularity

226

226

212

213

226

214

213

29

217

26

26

26

27

24

26

27

210

24

25

28

29

25

24

25

25

A parallel DDL split node

226

A DDL split node

217

Coarse-grained DDL

Fine-grained DDL

Fine-grained with ID Shift DDL

The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.

PowerPC RS64 III

A. PowerPC RS64 III

B. POWER3 II

C. UltraSPARC v8plus

Data size is 225 for Table A, 223 for Table B and C.

- Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP
- Self-adapts to different architectures using search
- Must take into account data access pattern
- Parallel implementation should not constrain search
- Package is available for download at SPIRAL website http://www.ece.cmu.edu/~spiral

- Working on a distributed memory version using MPI

Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures.

thread 1

thread 2

thread 3

thread 4

Choice in scheduling WHT tasks for (WHTR IS) and (IRWHTS).

small granularity, size R or S

large granularity, size R S / thread number