1 / 24

Yuanyuan Sun Feiteng Yang

Using SIMD Registers and instructions to Enable Instruction-Level Parallelism in Sorting Algorithms. Yuanyuan Sun Feiteng Yang. Source. Source ACM Symposium on Parallel Algorithms and Architectures

krysta
Download Presentation

Yuanyuan Sun Feiteng Yang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using SIMD Registers and instructions to Enable Instruction-Level Parallelism in Sorting Algorithms Yuanyuan Sun Feiteng Yang

  2. Source • Source ACM Symposium on Parallel Algorithms and Architectures Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures • Authors Timothy Furtak   José Nelson Amaral Robert Niewiadomski

  3. Outline • Introduction • Sorting network • Sorting algorithms • Experimental evaluation • Contributions

  4. Introduction • Use SIMD resources to improve the performance of sorting algorithms for short sequence. • Initial inspiration: need for Fast sorting of short sequences implementation of Graphics rendering in interactive video game • SIMD machineries

  5. Introduction SIMD machineries • X86-64’s SSE2 (Streaming SIMD Extensions 2) • G5’s AltiVec AltiVec,SSE2: SIMD instruction sets, both feature 128-bit vector registers

  6. Sorting network • a comparator network produces a sorted output for any possible input sequence. • COMP(a, b) — the inputs are two storage units: memory locations, registers, or vector-register elements — a and b, each containing a numerical input.

  7. Sorting network • Size: the total number of comparators in the network. • Depth: the length of the critical path in its dependence graph.

  8. Sorting network

  9. Sorting network • A comparator moves the larger value to the left, and the smaller value to the right. • For instance, Figure1 size=5,width=3; Inputs: a = 7, b = 2, c = 5, d = 9 Output: a = 9, b = 7, c = 5, d = 2.

  10. Supporting hardware for Sorting Network • The comparator required by a sorting network is easily constructed using these two operations, a copy instruction, and a temporary variable. • Min and max instructions min(a, b) = a : a ≤ b b : otherwise max(a, b) = a : a ≥ b b : otherwise

  11. Supporting hardware for Sorting Network • x86-64 architectures supports the SSE2 min and max operations that return the minimum (maximum) packed single-precision floating-point values.

  12. Supporting hardware for Sorting Network • Width: the number of vectors being sorted. x86-64 has 16 XMM vector registers, and each register can hold 4 floating-point values. Sorting the values in n XMM registers using a sorting network produces 4 sorted streams of data of length n. 1 ≤ n < 16, one register must be reserved as temporary storage for the swap of values.

  13. Three sorting methods • Two pass sorting with insertion sorting • Two pass sorting with merge sorting • One pass sorting (Register sorting)

  14. Tow pass sorting • In the first phase the SIMD registers and instructions are used to generate a partially-sorted output. • In the second phase a standard sorting algorithm — insertion sort and mergesort are investigated in this paper — finishes the sorting.

  15. First phase: SIMD sort Vector registers …… After SIMD sort:

  16. Second phase • Insertion sort • Merge sort A1<A5<A9 A2<A6<A10 A3<A7<A11 A4<A8<A12 A1<A2<A3 A4<A5<A6 A7<A8<A9 A10<A11<A12

  17. One pass sorting (Register sorting) • Algorithm input • Initial state • Align a set of comparators • Write values back to memory

  18. 4-elements example • P1={comp(a,c) comp(b,d)} • P2={comp(a,b) comp(c,d)} • P3={comp(b,c)}

  19. One concrete example

  20. SSE2 instructions used

  21. The method is also applied to sort Key-pointer pairs and D-heaps.

  22. Evaluation

  23. Contributions • Effectively use SIMD resources to improve performance of sorting short sequence through the reduction of memory references and increases in ILP.

  24. Contributions • 1.three algorithms that use the SIMD machinery for efficient in-register sorting of short sequences • 2.a method to use iterative-deepening search to find fast instruction sequences to move data within the SIMD registers • 3.an extensive experimental study that indicates the elimination of loads, stores, branches correlates well with improvement performance.

More Related