250 likes | 469 Views
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms. Kang Chen and Jeremy Johnson Department of Mathematics and Computer Science Drexel University. Motivation and Overview. High performance implementation of critical signal processing kernels
 
                
                E N D
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and Computer Science Drexel University
Motivation and Overview • High performance implementation of critical signal processing kernels • A self-optimizing parallel package for computing fast signal transforms • Prototype transform (WHT) • Build on existing sequential package • SMP implementation using OpenMP • Part of SPIRAL project • http://www.ece.cmu.edu/~spiral
Outline • Walsh-Hadamard Transform (WHT) • Sequential performance and optimization using dynamic programming • A parallel implementation of the WHT • Parallel performance and optimization including parallelism in the search
Walsh-Hadamard Transform Fast WHT algorithms are obtained by factoring the WHT matrix
SPIRAL WHT Package • All WHT algorithms have the same arithmetic cost O(NlgN) but different data access patterns • Different factorizations lead to varying amounts of recursion and iteration • Transforms in small sizes (21 to 28) are implemented in straight-line code to reduce overheads • The WHT package allows exploration of different algorithms and implementations • Optimization/adaptation to architectures is performed by searching for the fastest algorithm Johnson and Püschel: ICASSP 2000
213 29 24 29 24 25 24 25 Dynamic Programming • Exhaustive Search: Searching all possible algorithms • Cost is (4n/n3/2) for binary factorizations • Dynamic Programming: Searching among algorithms generated from previously determined best algorithms • Cost is (n2) for binary factorizations Possibly best algorithm at size 213 Best algorithm at size 29 Best algorithm at size 24 24
Performance of WHT Algorithms • Iterative algorithms have less overhead • Recursive algorithms have better data locality • Best WHT algorithms are compromise between less overhead and good data flow pattern.
222 222 222 25,(1) 217 210 212 24,(4) 218 24,(1) 213 25,(1) 26 26 26 26,(2) 212 24,(1) 29 25 27 222 A DDL split node 24,(1) 25 25,(1) An IL=1 straight-line WHT32 node Architecture Dependency The best WHT algorithms also depend on architecture characteristics such as memory hierarchy, cache structure and cache miss penalty, etc. UltraSPARC v9 POWER3 II PowerPC RS64 III
x0 Stride tensor Union tensor x1 x2 x3 x4 23 x5 x6 21 22 x7 time Improved Data Access Patterns • Stride tensor causes WHT accessing data out of block and loss of locality • Large stride introduces more conflict cache misses
pseudo transpose pseudo transpose x0 x0 x1 x4 x2 x2 x0 x1 x2 x3 x4 x5 x6 x7 x0 x4 x2 x6 x1 x5 x3 x7 x6 x3 x4 x1 x5 x5 x3 x6 x7 x7 Dynamic Data Layout DDL uses in-place pseudo transpose to swap data in a special way so that Stride tensor is changed to Union tensor. N. Park and V. K. Prasanna: ICASSP 2001
WHT2 ÄI4 WHT2 IL=1ÄI4/2 WHT2 IL=2ÄI4/4 x0 x0 x0 x1 x1 x1 x2 x2 x2 x3 x3 x3 x4 x4 x4 x5 x5 x5 x6 x6 x6 x7 x7 x7 Access order 1st 2nd 3rd 4th Loop Interleaving IL maximizes the use of cache pre-fetching by interleaving multiple WHT transforms into one transform. Gatlin and Carter: PACT 2000, Implemented by Bo Hong
216 216 21 215 27 29 216 21 214 24 25 25,(3) 211 23 211 216 A DDL split node 25 26 25 26 25,(3) An IL=3 straight-line WHT32 node Best WHT Partition Trees Environment: PowerPC RS64 III/12 450 MHz, 128/128KB L1 cache, 8 MB L2 cache, 8 GB RAM, AIX 4.3.3, cc 5.0.5 Standard best tree Best tree with DDL Best tree with IL
Effect of IL and DDL on Performance DDL and IL improve performance when data size is larger than the L1 cache, 128 KB = 214 8 bytes. IL level 4 reaches the maximal use of cache line, 128 bytes = 24 8 bytes.
Parallel WHT Package • SMP implementation obtained using OpenMP • WHT partition tree is parallelized at the root node • Simple to insert OpenMP directives • Better performance obtained with manual scheduling • DP decides when to use parallelism • DP builds the partition with best sequential subtrees • DP decides the best parallel root node • Parallel split • Parallel split with DDL • Parallel pseudo-transpose • Parallel split with IL
OpenMP Implementation # pragma omp parallel { R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); # pragma omp parallel for for (j = 0; j < R - 1) { for (k = 0; k < S - 1) { WHT(N(i)) * x(j, k, S, N(i)); } } S = S * N(i); } } # pragma omp parallel { total = get_total_threads( ); id = get_thread_id( ); R = N; S = 1; for (i = 0; i < t; i ++) { R = R / N(i); for (; id < R*S - 1; id += total) { j = id / S; k = id % S; WHT(N(i)) * x(j, k, S, N(i)); } S = S * N(i); # pragma omp barrier } }
S S S R Coarse-grained pseudo transpose Fine-grained pseudo transpose Fine-grained pseudo transpose with ID shift thread 1 thread 2 thread 3 thread 4 Parallel DDL In WHTRS = L (ISWHTR) L (IRWHTS), the pseudo transpose, L, can be parallelized in different granularity
226 226 212 213 226 214 213 29 217 26 26 26 27 24 26 27 210 24 25 28 29 25 24 25 25 A parallel DDL split node 226 A DDL split node 217 Best Tree of Parallel DDL Schemes Coarse-grained DDL Fine-grained DDL Fine-grained with ID Shift DDL
The three plateaus in the figure are due to the L1 and L2 caches. A good binary partition of a large parallel tree node tends to be built from subtree nodes within the first plateau. Normalized Runtime of PowerPC RS64 PowerPC RS64 III
Parallel Performance A. PowerPC RS64 III B. POWER3 II C. UltraSPARC v8plus Data size is 225 for Table A, 223 for Table B and C.
Conclusion and Future Work • Parallel WHT package provides efficient parallel performance across multiple SMP platforms using OpenMP • Self-adapts to different architectures using search • Must take into account data access pattern • Parallel implementation should not constrain search • Package is available for download at SPIRAL website http://www.ece.cmu.edu/~spiral • Working on a distributed memory version using MPI
Parallel IL utilizes pre-fetched data on the same cache line and eliminates data contention among threads. So it has better parallel efficiency on some architectures. Parallel Split Node with IL and DDL
thread 1 thread 2 thread 3 thread 4 Modified Scheduling Choice in scheduling WHT tasks for (WHTR IS) and (IRWHTS). small granularity, size R or S large granularity, size R  S / thread number