1 / 30

Floating Point Analysis Using Dyninst

Floating Point Analysis Using Dyninst. Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor. Background. Floating point represents real numbers as (± sgnf × 2 exp ) Sign bit Exponent Significand ( “ mantissa ” or “ fraction ” ) Finite precision

vaughn
Download Presentation

Floating Point Analysis Using Dyninst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

  2. Background • Floating point represents real numbers as (± sgnf × 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) • Finite precision • Single-precision: 24 bits (~7 decimal digits) • Double-precision: 53 bits (~16 decimal digits) 8 4 32 0 16 IEEE Single Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 IEEE Double Exponent (11 bits) Significand (52 bits) 2

  3. Motivation • Finite precision causes round-off error • Compromises certain calculations • Hard to detect and diagnose • Increasingly important as HPC scales • Computation on streaming processors is faster in single precision • Data movement in double precision is a bottleneck • Need to balance speed (singles) and accuracy (doubles) 3

  4. Our Goal Automated analysis techniques to inform developers about floating point behavior and make recommendations regarding the use of floating point arithmetic. 4

  5. Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning • Static binary instrumentation • Read configuration settings • Replace floating-point instructions with new code • Rewrite modified binary • Dynamic analysis • Run modified program on representative data set • Produce results and recommendations 5

  6. Previous Work • Cancellation detection • Reports loss of precision due to subtraction • Paper appeared in WHIST‘11 • Range tracking • Reports min/max values • Replacement • Implements mixed-precision configurations • Paper to appear in ICS’13 6

  7. Mixed Precision • Use double precision where necessary • Use single precision everywhere else • Can be difficult to implement 1: LU ← PA 2: solve Ly = Pb 3: solve Ux0 = y 4: for k = 1, 2, ... do 5: rk ← b – Axk-1 6: solve Ly = Prk 7: solve Uzk = y 8: xk ← xk-1 + zk 9: check for convergence 10: end for Mixed-precision linear solver algorithm Red text indicates steps performed in double-precision (all other steps are single-precision) 7

  8. Configuration 8

  9. Implementation • In-place replacement • Narrowed focus: doubles  singles • In-place downcast conversion • Flag in the high bits to indicate replacement 8 4 64 32 0 16 Double downcast conversion 8 4 64 32 0 16 Replaced Double 7 F F 4 D E A D Non-signalling NaN 8 4 32 0 16 Single 9

  10. Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd0x601e38(%rax, %rbx, 8)  %xmm0 2 mulsd-0x78(%rsp)  %xmm0 3 addsd-0x4f02(%rip)  %xmm0 4 movsd %xmm0 0x601e38(%rax, %rbx, 8) 10

  11. Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd0x601e38(%rax, %rbx, 8)  %xmm0 check/replace -0x78(%rsp) and %xmm0 2 mulss-0x78(%rsp)  %xmm0 check/replace -0x4f02(%rip) and %xmm0 3 addss-0x20dd43(%rip)  %xmm0 4 movsd %xmm0 0x601e38(%rax, %rbx, 8) 11

  12. Block Editing (PatchAPI) original instruction in block block splits double  single conversion initialization cleanup check/replace 12

  13. Automated Search • Manual mixed-precision analysis • Hard to use without intuition regarding potential replacements • Automatic mixed-precision analysis • Try lots of configurations (empirical auto-tuning) • Test with user-defined verification routine and data set • Exploit program control structure: replace larger structures (modules, functions) first • If coarse-grained replacements fail, try finer-grained subcomponent replacements 13

  14. System Overview 14

  15. NAS Results 15

  16. AMGmk Results • Algebraic MultiGrid microkernel • Multigrid method is highly adaptive • Good candidate for replacement • Automatic search • Complete conversion (100% replacement) • Manually-rewritten version • Speedup: 175 sec to 95 sec (1.8X) • Conventional x86_64 hardware 16

  17. SuperLU Results • Package for LU decomposition and linear solves • Reports final error residual • Both single- and double-precision versions • Verified manual conversion via automatic search • Used error from provided single-precision version as threshold • Final config matched single-precision profile (99.9% replacement) 17

  18. Retrospective • Twofold original motivation • Faster computation (raw FLOPs) • Decreased storage footprint and memory bandwidth • Domains vary in sensitivity to these parameters • Computation-centric analysis • Less insight for memory-constrained domains • Sometimes difficult to translate instruction-level recommendations to source code-level transformations • Data-centric analysis • Focus on data motion, which is closer to source code-level structures 18

  19. Current Project • Memory-based replacement • Perform all computation in double precision • Save storage space by storing single-precision values in some cases • Implementation • Register-based computation remains double-precision • Replace movement instructions (movsd) • Memory to register: check and upcast • Register to memory: downcast if configured • Searching for replaceable writes instead of computes 19

  20. Preliminary Results All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux. 20

  21. 21

  22. 22

  23. 23

  24. 24

  25. 25

  26. 26

  27. 27

  28. Future Work • Case studies • Search convergence study 28

  29. Conclusion Automated binary instrumentation techniques can be used to implement mixed-precision configurations for floating point code, and memory-based replacement provides actionable results. 29

  30. Thank you! sf.net/p/crafthpc 30

More Related