1 / 57

Center for Visual Information Technology International Institute of Information

Combining Data Parallelism and Task Parallelism for Efficient Performance on Hybrid CPU and GPU Systems. Center for Visual Information Technology International Institute of Information Technology Hyderabad. Aditya Deshpande Adviser: Prof. P J Narayanan.

cdominick
Download Presentation

Center for Visual Information Technology International Institute of Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Data Parallelism and Task Parallelism for Efficient Performance on Hybrid CPU and GPU Systems Center for Visual Information Technology International Institute of Information Technology Hyderabad Aditya Deshpande Adviser: Prof. P J Narayanan

  2. Early computer systems had only a single core. • #transistors doubled every 2yrs. (Moore’s Law) • Computer architects used them to give speedup. (Freq. Scaling) • Power increases with Frequency. Background: Why Parallel? • Post Pentium 4 (May’04) Intel shifted focus to multi-core processors. • End of freq. scaling, Parallel Algorithms only means for speedup.

  3. Commodity computers have a multi-core CPU and many-core GPU. • Multi-core CPUs (Coarse/Task Parallelism) – • LU, QR and Cholesky Decomposition. • Random number and Probability Distribution Generators. • FFT, PBzip2, String Processing, Bioinformatics, Data Struct. etc. • Intel MKL and other libraries. • Many-core GPUs (Fine/Data Parallelism) – • Scan, Sort, Hashing, SpMV, Lists, Linear Algebra etc. • Graph Algorithms: BFS, SSSP, APSP, SCC, MST. • cuBLAS, cuFFT, NvPP, Magma, cuSparse, CUDPP, Thrust etc. As of today …

  4. Past Work • Earlier data-parallel algorithms constitute only portions of end-to-end applications. For example, Linear Algebra, Matrix, List. • Earlier algorithms also had some inherent data-parallelism. For example, BFS, Image Proc. – Filtering, Color Conversion, FFT. • Design Principles for Challenging Data Parallel Algorithms • Breaking Sequentiality • Addressing Irregularity • Combining Data and Task Parallelism for end-to-end app’s • Work Sharing • Pipelining In this thesis …

  5. Data Par. + Task Par. GPU CPU BREAKING SEQUENTIALITY Floyd-Steinberg Dithering Outline - WORK SHARING Floyd-Steinberg Dithering (Hybrid and Handover Algorithms) ADDRESSING IRREGULARITY String Sorting + Burrows Wheeler Transform PIPELINING Burrows Wheeler Compression (Hybrid and All-Core Algorithms)

  6. Data Par. + Task Par. GPU CPU BREAKING SEQUENTIALITY Floyd-Steinberg Dithering Outline - WORK SHARING Floyd-Steinberg Dithering (Hybrid and Handover Algorithms) ADDRESSING IRREGULARITY String Sorting + Burrows Wheeler Transform PIPELINING Burrows Wheeler Compression (Hybrid and All-Core Algorithms)

  7. Technique to create an illusion of higher color depth Error Diffusion Dithering • Floyd-Steinberg Dithering (FSD) 7/16 1. Sum (error and pixel value) 2. Find Nearest Color to Sum 3. Output the Nearest Color 4. Diffuse Error to neighbors … 3/16 1/16 5/16 … … …

  8. Error distribution dictates order of processing pixels. • Scan-line order of processing. • Inherently sequential O(mn) algorithm. • Long chain of dependency (last to first). • Previous work, • Metaxas: 3 pixel groups, processed on N-processor array. • Zhang et al.: FSD too hard to parallelize, used a more inherently parallel pin-wheel error diffusion algorithm. Problem of Sequentiality m n

  9. 7/16 Data Dependency in FSD 1/16 Outgoing Errors Incoming Errors 3/16 5/16 • Data dependency imposes scheduling constraint. • T(i,j) : Iteration in which pixel (i,j) is processed. • T(i,j) > max( T(i-1,j) , T(i,j-1), • T(i-1,j-1), T(i-1,j+1) ) Trapezoidal Region of Dependency

  10. 1 2 3 4 5 6 7 3 4 5 6 7 8 9 Optimal Scheduling in FSD 5 6 7 8 9 10 11 7 8 9 10 11 12 13 9 10 11 12 13 14 15 11 12 13 14 15 16 17 • Optimal scheduling: • T(i,j) = 1 + max(…) • Knight’s move order of processing pixels. • Each pixel depends only on previous three iterations. • Pixels in same iteration (or label) can be processed in parallel.

  11. To increase computation per thread, group pixels in blocks. • Pixels within blocks are sequentially processed. • Trapezoidal blocks to satisfy data dependency of last pixel within a block. Coarse Parallel FSD on CPU F E C A a+b G B D b Adjacent Block Processing Trapezoidal blocks adhere to knight’s move ordering Process blocks in Parallel 1 2 3 4 a 3 4 5 6 Block Height : b Block Width : a 5 6 7 8

  12. GPU’s have more number physical core’s compared to multi-core CPU’s. • Favorable to have more number of light-weight threads. • Grouping into blocks reduces the amount of parallelism. • Process at the pixel level itself, more parallelism. • Pixels of same iteration (knight’s move) can be scheduled in parallel. Fine-Grained Data Parallel GPU FSD

  13. Naïve Storage : row major or column major. • Un-coalesced access. Data Re-ordering for Optimal Performance 1 P 2 P 3 P 4 P 5 P 3 4 5 6 P 7 P 5 6 7 8 P 9 5 P 3 1 P 2 P 3 P 4 P 4 5 6 P 7 P 5 Access pattern of Image Values

  14. Store image and errors in knight’s order. • Coalesced memory access! Data Re-ordering for Optimal Performance 1 P 2 P 3 P 4 P 5 P 3 4 5 6 P 7 P 5 7 8 P 6 9 5 5 6 P 6 1 P 2 P 3 P 3 4 P 4 5 P Only Consecutive Memory Locations are accessed

  15. Data Par. + Task Par. GPU CPU BREAKING SEQUENTIALITY Floyd-Steinberg Dithering Outline - WORK SHARING Floyd-Steinberg Dithering (Hybrid and Handover Algorithms) ADDRESSING IRREGULARITY String Sorting + Burrows Wheeler Transform PIPELINING Burrows Wheeler Compression (Hybrid and All-Core Algorithms)

  16. Work-Sharing for FSD • Drawbacks of a pure (only) GPU approach • GPU needs a lot of threads. • Dithering does not offer many threads initially and at end. • Launching only a few threads on GPU doesn’t make sense. • Should use GPU only when parallelism is above a threshold. • Split data-parallel step across CPU and GPU using Work Sharing! Time (t) Width of shaded region denotes #pixels for parallel processing

  17. Handover and Hybrid FSD “Width” controls load balancing between CPU and GPU Time (t) Time (t) Handover point Handover FSD Hybrid FSD Work on CPU Work on GPU

  18. Results: Speedup by Work Sharing Optimal Point for Handover FSD Optimal Width for Hybrid FSD

  19. Results: Runtime Performance Pure GPU FSD (8600M): 1024x768 – 48ms, 6200x8000 – 576ms.

  20. Data Par. + Task Par. GPU CPU BREAKING SEQUENTIALITY Floyd-Steinberg Dithering Outline - WORK SHARING Floyd-Steinberg Dithering (Hybrid and Handover Algorithms) ADDRESSING IRREGULARITY String Sorting + Burrows Wheeler Transform PIPELINING Burrows Wheeler Compression (Hybrid and All-Core Algorithms)

  21. Textbooks teach us many popular sorting methods. Sorting Quicksort Mergesort 37 42 12 29 39 29 12 39 42 37 Radixsort Data is always Numbers! • Real data is beyond just numbers • Dictionary words or sentences • DNA sequences, multi-dimensional db records • File Paths

  22. Can we sort strings efficiently?

  23. Number Sorting (or Fixed Length Sorting) • Fixed Length Keys (8 to 128 bits). • Standard containers: float, int, double etc. • Keys Fit into registers. • Comparisons take O(1) time. Irregularity in String Sorting FIXED LENGTH KEYS • String Sorting (or Variable/Long Length Sorting) • Keys have no restriction on length. • Iteratively load keys from main memory. • Comparisons *do not* take O(1) time. • Suffix Sort (1M strings of 1M length!) VARIABLE LENGTH KEYS Variable work per thread and arbitrary memory accesses: IRREGULARITY

  24. Can we sort strings efficiently? Yes, we can. If, we limit the #(iterative comparisons) performed.

  25. GPU CPU String Sorting Multi-key Quicksort[Bentley and Sedgewick, SODA’97] Thrust Merge Sort [Satish et al., IPDPS’09] Burstsort [Sinha et al., JEA’07] Fixed/Var. Merge Sort [Davidson et al., InPar’12] MSD Radix Sort [Kärkkäinen and Rantala, SPIRE’08] Hybrid Merge Sort [Banerjee et al., AsHES’13] Our String Sort (Radix Sort)

  26. Input{bat, barn, bark, by, byte, bytes, wane, way, wall, west} • Partition into small buckets CPU: Burstsort (Sinha et al.) A B C D W X Y Z A B C D W X Y Z A B C D E X Y Z T \0 NE ST RN TE Y RK TES LL BURST TRIE • Sort small buckets in CPU Cache • No merging of sorted buckets, already ordered!

  27. AA SORT:{bat, barn, bark, by, byte, bytes, wane, way, wall} AB ✔ • Don’t explicitly use pointers. • - Counting Methods (Two-Pass) • - Dynamic Methods (One-Pass), use std::vector, std::list. • Algorithmic Caching • Fixed next few characters stored. • Supra-alphabets • Use 2 character granularity. A AC CPU: MSD Radix Sort AT B A AR B AR C Y ✔ Z T \0 NE W RN TE Y A X RK TES LL ✔ Y ZW Y ZY Z • GPU • Fastest Parallel Radix Sort • Max. successive char loaded • Adaptive Granularity Z ZZ MSD Radix Sort

  28. Three Stage Merge Sort: Stable Bitonic Sort Parallel Merge Co-operative Merge Prefer Register Packing vs. Over-utilization. 2.5x faster than Thrust Merge Sort. GPU: Davidson et al. String Sort: 3-Stage Merge Sort Keys : First few chars.Value : Index of successive chars. Iterative Comparator

  29. Can we sort strings efficiently? Yes, we can. If, we limit the #(iterative comparisons) performed. Previous methods do this? CPU:✔ (Small Buckets) GPU: ✖(Not Quite!)

  30. Repetitive loading for resolving ties in every merge step. • Davidson et al. show that • “After every merge step comparisons • are between more similar strings” • All previous GPU String Sorting approaches are based on Merge Sort. Merge Sort: Iterative Comparisons Illustration of Comparisons Iterative comparisons = High Latency Global Memory Access = Divergence  We develop ‘Radix Sort’ based String Sort to mitigate this.

  31. 0 Radix Sort for String Sorting Future Sorts Seg ID + k-char prefix as Keys 0 0 0 First Sort 0 1 1 1 MSB Segment ID (proxy for prefix) k char prefix

  32. Entire Strings are not shuffled, we move only indices and prefix. • Fastest GPU Radix Sort Primitive is used to perform each sort. • GPU Radix Sort by Merrill and Grimshaw: • Optimized to be compute bound. (Hide irregularity of memory!) • Many operations performed in shared memory/registers. • Unlike merge sort, each part of string loaded only once! • Parallel Segment ID Generation Our GPU String Sort Parallel Scan (Segment Array) Parallel Lookup 1 1 1 1 1 1 1 Sorted Keys

  33. Adaptive Segment ID • #Segments are limited  #Segment ID bytes in Key limited! • Apart from min. segment id bytes, remaining have next chars. • Allows maximum characters to be compared per sort step. • Singleton Elimination – Remove Size 1 buckets. Additional Optimizations PARALLEL SCATTER (Smaller Sorting Problem) Parallel Scan (DEST Array) STENCIL 0 0 0 0 SINGLETONS

  34. We use datasets created by Sinha et al. (Burstsort). • We also create 2 practical datasets of our own (pc-filelist / sentences) • After-Sort Tie Length: Indicates difficulty of sorting a dataset. Results: Datasets Details of Datasets

  35. Runtime Speedup • Good speedups on tough datasets (url, pc-filelist, sentences). • Max. speedup on genome demonstrates scalability of radix sort. • Outperforms both previous GPU and CPU methods.

  36. Analytical Estimate of Runtime Sort Time: ‘t’ For ‘p’ Mkeys/s performance of thrust and ‘N’ Mkeys problem t = 1000/p x N Time per iteration: α x t (where, ‘α’ = 1.5 – 2 for our approach) ‘k’: max. tie length Total time = α x 1000/p x N x k

  37. Effect of Optimizations • Speed Up Singleton Elimination : 0.9 to 4.5 • Speed Up Adaptive Segment ID : 1.4 to 3.3

  38. Standard Primitives improve in performance • Tuned by vendors for new architectures • New algorithms/improvements developed over time • (GPU Sorts have continually improved) • Our String Sort can inherit these without re-design Runtime on Standard Primitives

  39. We built a GPU String Sort that: • outperforms state-of-the-art • adapts to future architectures • first radix sort based string sort • scales to challenging inputs • code available at • http://web.iiit.ac.in/~adity.deshapandeug08/stringSort/ • code also made a part of CUDPP (standard GPU library) String Sorting Summary

  40. Data Par. + Task Par. GPU CPU BREAKING SEQUENTIALITY Floyd-Steinberg Dithering Outline - WORK SHARING Floyd-Steinberg Dithering (Hybrid and Handover Algorithms) ADDRESSING IRREGULARITY String Sorting + Burrows Wheeler Transform PIPELINING Burrows Wheeler Compression (Hybrid and All-Core Algorithms)

  41. Input String: I[1…N] • Sort all cyclically shifted strings of I[1…N]. • Last column of sorted strings, with index of original string is BWT. • O(N) strings are sorted, each with length O(N). Burrows Wheeler Transform • GPU String Sort works when ties are only a few characters (~100’s). • Suffix sort in BWT has longer ties 103 to 105 characters.

  42. Modified String Sort for BWT • Doubling MCU length of String Sort • MCU length determines #sort steps. • Large #sort steps for long ties and thus, longer runtime.  • Use fixed length MCU initially, then double. • 1.5 to 2.5x speedup (e.g. enwik8, 1.06 to 0.58s per block). • Fixed length sort inexpensive initially, doubling curtails • #sort steps eventually.

  43. Modified String Sort for BWT • Partial GPU Sort and CPU Merge • Cyclically shifted strings have special property. • We can sort only 2/3rd strings, synthesize rest w/o iterative sort. • Sort all (mod 3) ≠ 0 strings iteratively. • 1st char of (mod 3) = 0 string, rank of next in 2/3rd sort enough to sort remaining 1/3rd strings. • Non-iterative overlapped merge also possible. (CPU)

  44. Datasets • Enwik8: First 108 bytes of English Wikipedia Dump (96MB). • Wiki-xml: Wikipedia xml dump (151MB). • Linux-2.6.11.tar: Publicly available linux kernel (199 MB). • Silesia Corpus: Data-compression benchmark (208MB). Datasets GPU BWT • Tie-Length vs. Block Size

  45. No speedup for small blocks. GPU not utilized sufficiently. Runtime GPU BWT vs. Bzip2 BWT Speedup on large blocks. GPU still slow for worst-case linux dataset.

  46. Large #sort steps result from repeated substrings/long ties. • Runtime reduces greatly if we break ties. • Perturbation ‘add random chars at fixed interval’ to break ties. • Useful for applications where BWT transformed string is irrelevant, and BWT+IBWT are used in pairs (viz. BW Compression). • Fixed Perturbation can be removed after IBWT. String Perturbation Linux-9MB Blocks, 8.2x speedup with 0.1% perturbation

  47. Data Par. + Task Par. GPU CPU BREAKING SEQUENTIALITY Floyd-Steinberg Dithering Outline - WORK SHARING Floyd-Steinberg Dithering (Hybrid and Handover Algorithms) ADDRESSING IRREGULARITY String Sorting + Burrows Wheeler Transform PIPELINING Burrows Wheeler Compression (Hybrid and All-Core Algorithms)

  48. Three step procedure. • File divided into blocks and following steps done on each block. • Burrows Wheeler Transform • - Suffix sort and use the last last column. (Most compute intensive) • Move-to-Front Transform • - Similar to run-length encoding. (~10% of runtime) • Huffman Encoding • - Standard frequency of chars based encoding. (~10% of runtime) Burrows Wheeler Compression

  49. Patel et al. did all 3 steps on GPU, 2.78X slowdown. • Map appropriate operation to appropriate compute platform. • GPU for sorts of BWT, CPU does sequential merge, MTF, Huff. • Pipeline blocks such that CPU computation overlaps with GPU. • Throughput BWC = BWT, barring first and last block offset. MERGE MERGE MERGE MTF+HUFF #1 MTF+HUFF #2 CPU BWC Pipelining: Hybrid BWC O/P block # 1 O/P block # 3 O/P block # 2 I/P block # 2 I/P block # 3 I/P block # 4 I/P block # 1 (i)2/3rd Sort(ii)1/3rd SortBlock #4 (i)2/3rd Sort(ii)1/3rd SortBlock #1 (i)2/3rd Sort(ii)1/3rd SortBlock #2 (i)2/3rd Sort(ii)1/3rd SortBlock #3 GPU

  50. System made of CoSt’s: • GPU with controlling CPU thread a CoSt • Other CPU cores are CoSt’s • Split blocks across CoSt’s, dequeued from work-queue. • GPU CoSt runs Hybrid BWC • CPU CoSt runs best BWC by Seward (i.e. Bzip2). BWC Pipelining: All-Core BWC

More Related