Loading in 5 sec....

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal TransformsPowerPoint Presentation

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms

- 101 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms' - kaemon

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms

Kang Chen and Jeremy Johnson

Department of Mathematics and Computer Science

Drexel University

Motivation and Overview Implementation

- High performance implementation of critical signal processing kernels
- A self-optimizing parallel package for computing fast signal transforms
- Prototype transform (WHT)
- Build on existing sequential package
- SMP implementation using OpenMP

- Part of SPIRAL project
- http://www.ece.cmu.edu/~spiral

Outline Implementation

- Walsh-Hadamard Transform (WHT)
- Sequential performance and optimization using dynamic programming
- A parallel implementation of the WHT
- Parallel performance and optimization including parallelism in the search

Walsh-Hadamard Transform Implementation

Fast WHT algorithms are obtained by factoring the WHT matrix

SPIRAL WHT Package Implementation

- All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns
- Different factorizations lead to varying amounts of recursion and iteration
- Transforms in small sizes (21 to 28) are implemented in straight-line code to reduce overheads
- The WHT package allows exploration of different algorithms and implementations
- Optimization/adaptation to architectures is performed by searching for the fastest algorithm

Johnson and Püschel: ICASSP 2000

2 Implementation 13

29

24

29

24

25

24

25

Dynamic Programming- Exhaustive Search: Searching all possible algorithms
- Cost is (4n/n3/2) for binary factorizations

- Dynamic Programming: Searching among algorithms generated from previously determined best algorithms
- Cost is (n2) for binary factorizations

Possibly best algorithm at size 213

Best algorithm at size 29

Best algorithm at size 24

24

Performance of WHT Algorithms Implementation

- Iterative algorithms have less overhead
- Recursive algorithms have better data locality
- Best WHT algorithms are compromise between less overhead and good data flow pattern.

2 Implementation 22

222

222

25,(1)

217

210

212

24,(4)

218

24,(1)

213

25,(1)

26

26

26

26,(2)

212

24,(1)

29

25

27

222

A DDL split node

24,(1)

25

25,(1)

An IL=1 straight-line WHT32 node

Architecture DependencyThe best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc.

UltraSPARC v9

POWER3 II

PowerPC RS64 III

x Implementation 0

Stride tensor

Union tensor

x1

x2

x3

x4

23

x5

x6

21

22

x7

time

Improved Data Access Patterns- Stride tensor causes WHT accessing data out of block and loss of locality
- Large stride introduces more conflict cache misses

pseudo transpose Implementation

pseudo transpose

x0

x0

x1

x4

x2

x2

x0 x1 x2 x3

x4 x5 x6 x7

x0 x4 x2 x6

x1 x5 x3 x7

x6

x3

x4

x1

x5

x5

x3

x6

x7

x7

Dynamic Data LayoutDDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor.

N. Park and V. K. Prasanna: ICASSP 2001

WHT Implementation 2 ÄI4

WHT2 IL=1ÄI4/2

WHT2 IL=2ÄI4/4

x0

x0

x0

x1

x1

x1

x2

x2

x2

x3

x3

x3

x4

x4

x4

x5

x5

x5

x6

x6

x6

x7

x7

x7

Access order

1st

2nd

3rd

4th

Loop InterleavingIL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform.

Gatlin and Carter: PACT 2000, Implemented by Bo Hong

2 Implementation 16

216

21

215

27

29

216

21

214

24

25

25,(3)

211

23

211

216

A DDL split node

25

26

25

26

25,(3)

An IL=3 straight-line WHT32 node

Best WHT Partition TreesEnvironment: PowerPC RS64 III/12 450 MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc 5.0.5

Standard best tree

Best tree with DDL

Best tree with IL

Effect of IL and DDL on Performance Implementation

DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 214 8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 24 8 bytes.

Parallel WHT Package Implementation

- SMP implementation obtained using OpenMP
- WHT partition tree is parallelized at the root node
- Simple to insert OpenMP directives
- Better performance obtained with manual scheduling

- DP decides when to use parallelism
- DP builds the partition with best sequential subtrees
- DP decides the best parallel root node
- Parallel split
- Parallel split with DDL
- Parallel pseudo-transpose

- Parallel split with IL

OpenMP Implementation Implementation

# pragma omp parallel

{

R = N; S = 1;

for (i = 0; i < t; i ++) {

R = R / N(i);

# pragma omp parallel for

for (j = 0; j < R - 1) {

for (k = 0; k < S - 1) {

WHT(N(i)) * x(j, k, S, N(i));

}

}

S = S * N(i);

}

}

# pragma omp parallel

{

total = get_total_threads( );

id = get_thread_id( );

R = N; S = 1;

for (i = 0; i < t; i ++) {

R = R / N(i);

for (; id < R*S - 1; id += total) {

j = id / S;

k = id % S;

WHT(N(i)) * x(j, k, S, N(i));

}

S = S * N(i);

# pragma omp barrier

}

}

S Implementation

S

S

R

Coarse-grained pseudo transpose

Fine-grained pseudo transpose

Fine-grained pseudo transpose with ID shift

thread 1

thread 2

thread 3

thread 4

Parallel DDLIn WHTRS = L (ISWHTR) L (IRWHTS), the pseudo transpose, L, can be parallelized in different granularity

Comparison of Parallel Schemes Implementation

2 Implementation 26

226

212

213

226

214

213

29

217

26

26

26

27

24

26

27

210

24

25

28

29

25

24

25

25

A parallel DDL split node

226

A DDL split node

217

Best Tree of Parallel DDL SchemesCoarse-grained DDL

Fine-grained DDL

Fine-grained with ID Shift DDL

The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.

Normalized Runtime of PowerPC RS64PowerPC RS64 III

Overall Parallel Speedup caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.

Parallel Performance caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.

A. PowerPC RS64 III

B. POWER3 II

C. UltraSPARC v8plus

Data size is 225 for Table A, 223 for Table B and C.

Conclusion and Future Work caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.

- Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP
- Self-adapts to different architectures using search
- Must take into account data access pattern
- Parallel implementation should not constrain search
- Package is available for download at SPIRAL website http://www.ece.cmu.edu/~spiral

- Working on a distributed memory version using MPI

Effect of Scheduling Strategy caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau.

Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures.

Parallel Split Node with IL and DDLthread 1 and eliminates data contention among threads. So it has better parallel efficiency on some architectures.

thread 2

thread 3

thread 4

Modified SchedulingChoice in scheduling WHT tasks for (WHTR IS) and (IRWHTS).

small granularity, size R or S

large granularity, size R S / thread number

Download Presentation

Connecting to Server..