1 / 32

Massively Parallel LDPC Decoding on GPU

Massively Parallel LDPC Decoding on GPU. Vivek Tulsidas Bhat Priyank Gupta. “Workload Partitioning”. Priyank Motivation and LDPC introduction. Analysis of the sequential algorithm and build up to the parallelization strategy. Lessons Learned : Part 1 Vivek Parallelization strategy

niloufer
Download Presentation

Massively Parallel LDPC Decoding on GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Massively Parallel LDPC Decoding on GPU Vivek Tulsidas Bhat Priyank Gupta

  2. “Workload Partitioning” • Priyank • Motivation and LDPC introduction. • Analysis of the sequential algorithm and build up to the parallelization strategy. • Lessons Learned : Part 1 • Vivek • Parallelization strategy • Results and Discussion • Lessons Learned : Part 2 • Conclusion

  3. Motivation • FEC codes used extensively in various applications to ensure reliability in communication. • Current trends in application show demands in increased data rates. • Considering Shannon Limit, low complexity encoders-decoders necessary. • Enter LDPC : Low-Density Parity Check.

  4. LDPC : Quick Overview • Iterative approach. • Inherently data-parallel • Computationally expensive. • Therefore, perfect candidate for operations that can be parallelized.

  5. Our Initial Approach

  6. Parallel Code Flow Likelihood Ratio Initialization Probability Ratio Initialization Likelihood Ratio Recomputation Probability Ratio Recomputation Next Guess Calculation Found Codeword or Max Iter. Yes No Report Results

  7. Analysis of Sequential Code

  8. Sparse Matrix Representation typedef struct /* Representation of a sparse matrix */ { int n_rows; /* Number of rows in the matrix */ int n_cols; /* Number of columns in the matrix */ mod2entry *rows; /* Ptr to array of row headers */ mod2entry *cols; /* Ptr to array of column headers */ mod2block *blocks; /* Allocated Blocks*/ mod2entry *next_free; /* Next free entry */ } mod2sparse; typedef struct /* Structure representing a non-zero entry, or the header for a row or column */ { int row, col; /* Row and column indexes */ mod2entry *left, *right, /* Pointers to adjacent entry in row */ *up, *down; /* and column, or to headers. Free */ /* entries are linked by 'left'.*/ double pr, lr; /* Probability and likelihood ratios - not used */ /* by the mod2sparse module itself */ } mod2entry;

  9. Likelihood Ratio Computation LR_estimator = 1 (initial) Forward Transition: element_LR(nth) = LR_estimator(nth) LR_estimator(n+1th) = LR_estimator(nth) *2/element_PR(n+1th) - 1 Reverse Transition: temp = element_LR(nth) * LR_estimator(nth) element_LR (n-1th) = (1-temp) / (1+temp) LR_estimator(n-1th) = LR_estimator(nth) *2/element_PR(n-1th) - 1

  10. Probability Ratio Computation PR_estimator(nth) = Likelihood_Ratio (nth) (initial) Top-Down Transition: element_PR(nth) = PR_estimator(nth) PR_estimator(n+1th) = PR_estimator(nth) * element_LR(nth) Bottom-Up Transition: element_PR (n-1th) = element_PR (nth) * PR_estimator(nth) PR_estimator(n-1th) = PR_estimator(nth) * element_LR(nth)

  11. Lessons Learned : Part 1 "entities must not be multiplied beyond necessity"

  12. Parallelization Strategy

  13. Transformation Codeword i-2 Codeword i-1 Codeword i Codeword i+1 Codeword i+2 Likelihood Ratio Computation Probability Ratio Recomputation Next Guess Calculation Found Codeword or Max Iter. No Yes Report Results

  14. Use 1-D arrays BSC Channel Data (N , M-bit codewords read at a time) BSC Data Array with N codewords aligned Likelihood ratio for all the MN bits Bit Probabilities for MN bits Decoded Blocks (N M-bit codewords) Each thread does the computation for one-bit. So for N M-bit codewords, we would need MN threads for the Likelihood ratio, Probability Ratio and Decoded Block related computations

  15. Likelihood Ratio Computation : Revisited Likelihood Ratio Estimator : Forward Estimation Likelihood Ratio Estimator : Reverse Estimation Likelihood Ratio Estimator calculation for Forward and Reverse Estimation done on the host before the launch of the Likelihood ratio kernel. Note: Illustration for just one codeword. This is done for N codewords at a time.

  16. Probability Ratio Computation : Revisited Probability Ratio Estimator : Top Down Transition Probability Ratio Estimator : Bottom-Up Transition Likewise for the Probability Ratio Computation, only this time operations are done on a column basis

  17. Salient Features of our implementation • Usage of efficient sparse matrix representation of standard Parity-Check matrix. • Simplistic Mathematical model for likelihood ratio and probability ratio computation. • Dedicated data structure for likelihood ratio and probability ratio kernels. • Code is easily customizable for different code rates. • Supports larger number of code words without any major change to the program architecture.

  18. Experimental Setup

  19. Results (1/3) • Tested extensively for code rate of (3,7) on BSC channel with error probability of 0.05. • Optimal execution configuration : numThreadsPerBlock = 256, numBlocks = 7* Mul_factor where mul_factor is evaluated depending on the number of code words to be decoded mul_factor = num_codewords / numThreadsPerBlock • Bit error rate is evaluated by comparing percentage change with respect to original source file.

  20. Results (2/3) : Software Execution Time

  21. Results (3/3) : Bit Error Rate Curve

  22. Lessons Learned : Part 2 • High occupancy does not guarantee better performance. • Although GPU implementation provides considerable speedup, its BER results are not attractive (in fact worse than CPU based implementation) • Absence of a double-precision floating point unit in GPU impacted the results. Probability ratio and Likelihood ratio computations are based on double-precision arithmetic. • Reliability? Random Bit Flips ? Could be catastrophic depending on the application for which LDPC decoding is being used. • Other programming paradigms : OpenMP ? Not as attractive in terms of speedup compared to GPU, but better BER curve. • Case for built-in ECC features within GPU architecture : NVIDIA Fermi architecture!

  23. Future Work • Trying this for AWGN channel for different error probabilities. • How does this perform on better GPU architectures ? Tesla ? Fermi ? • Any other parallelization strategies ? CuBLAS routines for sparse matrix computations on GPU ?

  24. Acknowledgement • We would like to thank Prof. Ali Akoglu and Murat Arabaci (OCSL Lab) for guiding us throughout the course of this project.

  25. References • Gabriel Falcao, Leonel Sousa, Vitor Silva, “How GPUs can outperform ASICs for Fast LDPC Decoding”, ICS’09. • Gabriel Falcao, Leonel Sousa, Vitor Silva, “ Parallel LDPC Decoding on the Cell/B.E. Processor”, HiPEAC 2009. • Gregory M. Striemer, Ali Akoglu, “An Adaptive LDPC Engine for Space Based Communication Systems”.

  26. Questions : Ask!

  27. Backup Slides

  28. Code Transformation: Likelihood ratio Init Kernel

  29. Code Transformation: Initprp Decode Kernel

  30. Code Transformation: Likelihood Ratio Kernel

  31. Code Transformation: Probability Ratio Kernel

  32. Code Transformation: Next Guess Kernel

More Related