1 / 21

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping. Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia * , Krisztián Flautner * University of Michigan *ARM Ltd. . Computational Efficiency. Low power envelope More useful work/transistors

meagan
Download Presentation

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia*, Krisztián Flautner* University of Michigan *ARM Ltd. 1

  2. Computational Efficiency • Low power envelope • More useful work/transistors • Hardware accelerators • Niagara II encryption engine Source: AMD Analyst Day 12/14/06 2

  3. Program Accel. CPU How Are Accelerators Used? Control statically placed in binary 3

  4. Program Accel. Accel. CPU CPU Problem With Static Control Not forward/backward compatible CPU 4

  5. Accel. Accel. Proc. Proc. Program Proc. Trans. Engineer/ Compiler Trans. Trans. Solution: Virtualization • Statically identify accelerated computation • Abstract accelerator features • Dynamically retarget binary 5

  6. Liquid SIMD • Virtualize SIMD accelerators • Why virtualize SIMD? • Intel MMX to SSE2 • ARM v6 to Neon • Wide vectors useful [Lin 06] 6

  7. SIMD Accelerator Assumptions • Same instruction stream • Separate pipeline – memory interface SIMD Exec Decode Fetch Retire Scalar Exec 7

  8. How to Virtualize • Use scalar ISA to represent SIMD operations • Compatibility, low overhead • Key: easy to translate Program Branch 8

  9. uCode Cache Accel. Fetch Retire Trans. Execute Decode Virtualization Architecture 9

  10. A A A B B B + + + & & & 1. Data Parallel Operations for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = r3 & constant; C[i] = r4; } C 10

  11. A B SADD 1a. What If There’s No Scalar Equivalent? for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF; ... } Idioms can always be constructed 11

  12. + + + & & & 2. Scalarizing Permutations for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1 } for(i = 0; i < 8; i++) { r1 = offset[i]; r2 = tmp[r1 + i] r3 = r2 & const … } offset = {4, 4, 4, 4, -4, -4, -4, -4} offset = {4, 4, 4, 4, -4, -4, -4, -4} offset = {4, 4, 4, 4, -4, -4, -4, -4} 12

  13. + 3. Scalarizing Reductions for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; … } 13

  14. v3 v2 1 0 1 3 v1 Mem v1 Applied to ARM Neon • All instructions supported except… • VTBL – indirect indexing v1 = vtbl v2, v3 • Interleaved memory accesses • Not needed in evaluated benchmarks 14

  15. Translation to SIMD • Update induction variable • Use inverse of defined translation rules for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = offset[i]; } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = v3 & constant } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v3 = shuffle v3; C[i] = v3; } i += 4 for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3; } 15

  16. Accel. Accel. Proc. Proc. Program Proc. Trans. Engineer/ Compiler Trans. Trans. Translator Design Translator: efficiency, speed, flexibility 16

  17. Evaluation • Trimaran ARM • Hand SIMDized loops • SimpleScalar model ARM926 w/ Neon SIMD • VHDL translator, 130nm std. cell 17

  18. Liquid SIMD Issues • Code bloat • <1% overhead beyond baseline • Register pressure • Not a problem • Translator cost • 0.2 mm2 + 2KB cache • Translation overhead 18

  19. Translation Overhead MediaBench Kernels SPECfp 19

  20. Summary • Accelerators are more common and evolving • Costly binary migration • SIMD virtualization using scalar ISA • One binary: forward/backward compatibility • Negligible overhead 20

  21. Questions ? ? ? ? ? ? ? ? ? ? ? ? 21

More Related