1 / 52

Sorting and Searching

Sorting and Searching. Timothy J. Purcell Stanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR-2005-183) U. of Pennsylvania. Topics. Sorting Sorting networks Search Binary search Nearest neighbor search. Assumptions. Data organized into 1D arrays

nowles
Download Presentation

Sorting and Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sorting and Searching Timothy J. Purcell Stanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR-2005-183) U. of Pennsylvania

  2. Topics • Sorting • Sorting networks • Search • Binary search • Nearest neighbor search

  3. Assumptions • Data organized into 1D arrays • Rendering pass == screen aligned quad • Not using vertex shaders • PS 2.0 GPU • No data dependent branching at fragment level

  4. Sorting

  5. Sorting • Given an unordered list of elements, produce list ordered by key value • Kernel: compare and swap • GPUs constrained programming environment limits viable algorithms • Bitonic merge sort [Batcher 68] • Periodic balanced sorting networks [Dowd 89]

  6. Bitonic Merge Sort Overview • Repeatedly build bitonic lists and then sort them • Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing. • List A: (3, 4, 7, 8) monotonically increasing • List B: (6, 5, 2, 1) monotonically decreasing • List AB: (3, 4, 7, 8, 6, 5, 2, 1) bitonic

  7. Bitonic Merge Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)

  8. Bitonic Merge Sort 3 7 4 8 6 2 1 5 Sort the bitonic lists

  9. Bitonic Merge Sort 3 3 7 7 4 8 8 4 6 2 2 6 1 5 5 1 4x monotonic lists: (3,7) (8,4) (2,6) (5,1) 2x bitonic lists: (3,7,8,4) (2,6,5,1)

  10. Bitonic Merge Sort 3 3 7 7 4 8 8 4 6 2 2 6 1 5 5 1 Sort the bitonic lists

  11. Bitonic Merge Sort 3 3 3 7 7 4 4 8 8 8 4 7 6 2 5 2 6 6 1 5 2 5 1 1 Sort the bitonic lists

  12. Bitonic Merge Sort 3 3 3 7 7 4 4 8 8 8 4 7 6 2 5 2 6 6 1 5 2 5 1 1 Sort the bitonic lists

  13. Bitonic Merge Sort 3 3 3 3 7 7 4 4 4 8 8 7 8 4 7 8 6 2 5 6 2 6 6 5 1 5 2 2 5 1 1 1 2x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1)

  14. Bitonic Merge Sort 3 3 3 3 7 7 4 4 4 8 8 7 8 4 7 8 6 2 5 6 2 6 6 5 1 5 2 2 5 1 1 1 Sort the bitonic list

  15. Bitonic Merge Sort 3 3 3 3 3 7 7 4 4 4 4 8 8 7 2 8 4 7 8 1 6 2 5 6 6 2 6 6 5 5 1 5 2 2 7 5 1 1 1 8 Sort the bitonic list

  16. Bitonic Merge Sort 3 3 3 3 3 7 7 4 4 4 4 8 8 7 2 8 4 7 8 1 6 2 5 6 6 2 6 6 5 5 1 5 2 2 7 5 1 1 1 8 Sort the bitonic list

  17. Bitonic Merge Sort 3 3 3 3 3 2 7 7 4 4 4 1 4 8 8 7 2 3 8 4 7 8 1 4 6 2 5 6 6 6 2 6 6 5 5 5 1 5 2 2 7 7 5 1 1 1 8 8 Sort the bitonic list

  18. Bitonic Merge Sort 3 3 3 3 3 2 7 7 4 4 4 1 4 8 8 7 2 3 8 4 7 8 1 4 6 2 5 6 6 6 2 6 6 5 5 5 1 5 2 2 7 7 5 1 1 1 8 8 Sort the bitonic list

  19. Bitonic Merge Sort 3 3 3 3 3 2 1 7 7 4 4 4 1 2 4 8 8 7 2 3 3 8 4 7 8 1 4 4 6 2 5 6 6 6 5 2 6 6 5 5 5 6 1 5 2 2 7 7 7 5 1 1 1 8 8 8 Done!

  20. Bitonic Merge Sort Summary • Separate rendering pass for each set of swaps • O(log2n) passes • Each pass performs n compare/swaps • Total compare/swaps: O(n log2n) • Limitations of GPU cost us factor of logn over best CPU-based sorting algorithms

  21. Limitations to GPU Sorting • Data Size: Limited to databases that fit in GPU memory • Limit on Key Size: Sort keys limited to 32-bit floating point operands. • Efficiency: Not fast enough to match disk array IO bandwidth.

  22. GPUTeraSort • Created by University of North Carolina and Microsoft • Overcomes previous limitations • Won the Pennysort competition • Outperformed prior CPU or GPU algorithms by 3-10 times

  23. GPUTeraSort Hybrid sorting algorithm • Reader – Reads input file into main memory buffer • Key Generator – Computes the (key, record pointer) pairs from the input buffer • Sorter – Reads and sorts the key-pointer pairs • Reorder – Rearrange the input buffer based on the sorted key-pointer pairs to generate a sorted output buffer • Writer – Asynchronously writes the run to the disk

  24. Data Representation • Single-array representation – Texture is represented as a stretched 2D array. A texture of (W, H) can be represented in 2D array form as (4W, H) • Four-array representation – Texture composed of 4 sub arrays, each sub-array corresponding to a single channel. a01 a02 a03 a04 a05 a06 a07 a08 a09 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a01 b01 c01 d01 a02 b02 c02 d02 a03 b03 c03 d03 a04 b04 c04 d04 a05 b05 c05 d05

  25. Data Representation Single Array Representation Faster • Mapping: Data transfer operations from CPU to GPU directly maps to the single array representation • Efficient Sorting: Reduces memory accesses for early steps of algorithm. i.e. steps 1 and two can be performed with one texture fetch instead of two.

  26. Searching

  27. Types of Search • Search for specific element • Binary search • Search for nearest element(s) • k-nearest neighbor search • Both searches require ordered data

  28. Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm • Assuming hardware supports long enough shaders • Finds the first element of a given value v • If v does not exist, find next smallest element > v • Why use the GPU then? • Search algorithm is sequential, but many searches can be executed in parallel • Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search

  29. Binary Search • Search for v0 Search starts at center of sorted array v2 >= v0 so search left half of sub-array Initialize 4 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  30. Binary Search • Search for v0 v0 >= v0 so search left half of sub-array Initialize 4 Step 1 2 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  31. Binary Search • Search for v0 v0 >= v0 so search left half of sub-array Initialize 4 Step 1 2 Step 2 1 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  32. Binary Search • Search for v0 At this point, we either have found v0 or are 1 element too far left One last step to resolve Initialize 4 Step 1 2 Step 2 1 Step 3 0 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  33. Binary Search • Search for v0 Done! Initialize 4 Step 1 2 Step 2 1 Step 3 0 Step 4 0 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  34. Binary Search • Search for v0 and v2 Search starts at center of sorted array Both searches proceed to the left half of the array Initialize 4 4 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  35. Binary Search • Search for v0 and v2 The search for v0 continues as before The search for v2 overshot, so go back to the right Initialize 4 4 Step 1 2 2 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  36. Binary Search • Search for v0 and v2 We’ve found the proper v2, but are still looking for v0 Both searches continue Initialize 4 4 Step 1 2 2 Step 2 1 3 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  37. Binary Search • Search for v0 and v2 Now, we’ve found the proper v0, but overshot v2 The cleanup step takes care of this Initialize 4 4 Step 1 2 2 Step 2 1 3 Step 3 0 2 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  38. Binary Search • Search for v0 and v2 Done! Both v0 and v2 are located properly Initialize 4 4 Step 1 2 2 Step 2 1 3 Step 3 0 2 Step 4 0 3 Sorted List v0 v0 v0 v2 v2 v2 v5 v5 0 1 2 3 4 5 6 7

  39. Binary Search Summary • Single rendering pass • Each pixel drawn performs independent search • O(log n) steps

  40. Nearest Neighbor Search

  41. Nearest Neighbor Search • Given a sample point p, find the k points nearest p within a data set • On the CPU, this is easily done with a heap or priority queue • Can add or reject neighbors as search progresses • Don’t know how to build one efficiently on GPU • kNN-grid • Can only add neighbors…

  42. sample point candidate neighbor neighbors found kNN-grid Algorithm Want 4 neighbors

  43. Candidate neighbors must be within max search radius Visit voxels in order of distance to sample point sample point candidate neighbor neighbors found kNN-grid Algorithm Want 4 neighbors

  44. If current number of neighbors found is less than the number requested, grow search radius sample point candidate neighbor neighbors found kNN-grid Algorithm 1 Want 4 neighbors

  45. If current number of neighbors found is less than the number requested, grow search radius sample point candidate neighbor neighbors found kNN-grid Algorithm 2 Want 4 neighbors

  46. Don’t add neighbors outside maximum search radius Don’t grow search radius when neighbor is outside maximum radius sample point candidate neighbor neighbors found kNN-grid Algorithm 2 Want 4 neighbors

  47. Add neighbors within search radius sample point candidate neighbor neighbors found kNN-grid Algorithm 3 Want 4 neighbors

  48. Add neighbors within search radius sample point candidate neighbor neighbors found kNN-grid Algorithm 4 Want 4 neighbors

  49. Don’t expand search radius if enough neighbors already found sample point candidate neighbor neighbors found kNN-grid Algorithm 4 Want 4 neighbors

  50. Add neighbors within search radius sample point candidate neighbor neighbors found kNN-grid Algorithm 5 Want 4 neighbors

More Related