1 / 58

Automated Floating-Point Precision Analysis

Automated Floating-Point Precision Analysis. Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor. Context. Floating-point arithmetic is ubiquitous. Context. Floating-point arithmetic represents real numbers as ( ± 1. frac × 2 exp ) Sign bit Exponent

britain
Download Presentation

Automated Floating-Point Precision Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Floating-PointPrecision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor

  2. Context • Floating-point arithmetic is ubiquitous

  3. Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 Single Precision Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 Double Precision Exponent (11 bits) Significand (52 bits)

  4. Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 0x40000000 Representing 2.0: Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 0x4000000000000000 Exponent (11 bits) Significand (52 bits)

  5. Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 0x40200000 Representing 2.625: Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 0x4005000000000000 Exponent (11 bits) Significand (52 bits)

  6. Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 0x3DCCCCCD Representing 0.1: Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 0x3FB999999999999A Exponent (11 bits) Significand (52 bits)

  7. Context Floating-point arithmetic represents real numbers as (±1.frac× 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) 8 4 32 0 16 0x3F9DF3B6 Representing 1.234: Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 0x3FF3BE76C8B43958 Exponent (11 bits) Significand (52 bits)

  8. Context • Floating-point is ubiquitous but problematic • Rounding error • Accumulates after many operations • Not always intuitive (e.g., non-associative) • Naïve approach: higher precision • Lower precision is preferable • Tesla K20Xis 2.3X faster in single precision • Xeon Phi is 2.0X faster in single precision • Single precision uses 50% of the memory bandwidth

  9. Problem • Current analysis solutions are lacking • Numerical analysis methods are difficult • Static analysis is too conservative • Trial-and-error is time-consuming • We need better analysis solutions • Produce easy-to-understand results • Incorporate runtime effects • Automated or semi-automated

  10. Thesis Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

  11. Contributions • Floating-point software analysis framework • Cancellation detection • Mixed-precision configuration • Reduced-precision analysis Initial emphasis on capability over performance 2.7182818284590452353603...

  12. Example: Sum2PI_X intsum2pi_x() { inti, j, k; real x, y, acc, sum; real final = PI * OUTER; /* correct answer */ sum = 0.0; for(i=0; i<OUTER; i++) { acc= 0.0; for(j=1; j<INNER; j++) { /* calculate 2^j */ x = 1.0; for(k=0; k<j; k++) x *= 2.0; /* 870K execs */ /* approximately calculate pi */ y = (real)PI / x; /* 58K execs */ acc+= y; /* 58K execs */ } sum += acc; /* 2K execs */ } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

  13. Contribution 1 of 4 Software Framework

  14. Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning 2.7182818284590452353603...

  15. Framework • Dyninst: a binary analysis library • Parses executable files (InstructionAPI & ParseAPI) • Inserts instrumentation (DyninstAPI) • Supports full binary modification (PatchAPI) • Rewrites binary executable files (SymtabAPI) • Binary-level analysis benefits • Programming language-agnostic • Supports closed third-party libraries • Sensitive to compiler transformations

  16. Framework • CRAFT framework • Dyninst-based binary mutator (C/C++) • Swing-based GUI viewers (Java) • Automated search scripts (Ruby) • Proof-of-concept analyses • Instruction counting • Not-a-Number (NaN) detection • Range tracking (from Brown et al. 2007)

  17. Sum2PI_X No NaNs detected

  18. Contribution 2 of 4 Cancellation Detection

  19. Cancellation • Loss of significant digits due to subtraction • Cancellation detection • Instrument every addition and subtraction • Report cancellation events 2.491264 (7) 1.613647 (7) - 2.491252 (7) - 1.613647 (7) 0.000012 (2) 0.000000 (0) (5 digits cancelled) (all digits cancelled) PRECISION

  20. Cancellation: GUI

  21. Cancellation: GUI

  22. Cancellation: Sum2PI_X

  23. Cancellation: Results • Gaussian elimination • Detect effects of a small pivot value • Highlight algorithmic differences • Domain-specific insights • Dense point fields • Color saturations • Error checking • Larger cancellations are better

  24. Cancellation: Conclusions • Automated analysis can detect cancellation • Cancellation detection serves a wide variety of purposes • Later work expanded the ability to identify problematic cancellation [Benz et al. 2012]

  25. Contribution 3 of 4 Mixed Precision

  26. Mixed Precision • Tradeoff: Single (32 bits) vs. Double (64 bits) • Single precision is faster • 2X+ computational speedup in recent hardware • 50% reduction in memory storage and bandwidth • Double precision is more accurate • 16 digits vs. 7 digits

  27. Mixed Precision • Most operations use single precision • Crucial operations use double precision Mixed-precision linear solver [Buttari 2008] 1: LU ← PA 2: solveLy = Pb 3: solve Ux0 = y 4: for k = 1, 2, ... do 5: rk ← b – Axk-1 6: solveLy = Prk 7: solveUzk = y 8: xk ← xk-1 + zk 9: checkforconvergence 10: end for Red text indicates double-precision (all other steps are single-precision) 50% speedup on average (12X in special cases) Difficult to prototype

  28. Mixed Precision Double Precision Mixed Precision Original Binary CRAFT Modified Binary Mixed Config

  29. Mixed Precision • Simulate single precision by storing 32-bit version inside 64-bit double-precision field 8 4 64 32 0 16 Double downcast conversion 8 4 64 32 0 16 Replaced Double 7 F F 4 D E A D Non-signallingNaN 8 4 32 0 16 Single

  30. Mixed Precision gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd0x601e38(%rax, %rbx, 8)  %xmm0 2 mulsd-0x78(%rsp) * %xmm0  %xmm0 3 addsd-0x4f02(%rip) + %xmm0 %xmm0 4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

  31. Mixed Precision gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd0x601e38(%rax, %rbx, 8)  %xmm0 check/replace -0x78(%rsp) and %xmm0 2 mulss-0x78(%rsp) * %xmm0  %xmm0 check/replace -0x4f02(%rip) and %xmm0 3 addss-0x4f02(%rip) + %xmm0 %xmm0 4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

  32. Mixed Precision

  33. Mixed Precision push%rax push%rbx <for each input operand> <copy input into %rax> mov%rbx, 0xffffffff00000000 and%rax, %rbx# extract high word mov%rbx, 0x7ff4dead00000000 test%rax, %rbx# check for flag jenext# skip if replaced <copy input into %rax> cvtsd2ss%rax, %rax# down-cast value or%rax, %rbx# set flag <copy %rax back into input> next: <next operand> pop%rbx pop%rax <replaced instruction> # e.g. addsd => addss

  34. Mixed Precision • Question: Which parts to replace? • Answer: Automatic search • Empirical, iterative feedback loop • User-defined verification routine • Heuristic search optimization

  35. Automated Search

  36. Automated Search

  37. Automated Search • Keys to search algorithm • Depth-first search • Look for replaceable larger structures first • Modules, functions, blocks, etc. • Prioritization • Inspect highly-executed routines first

  38. Mixed Precision: Sum2PI_X Failed single-precision replacement

  39. Mixed Precision: Sum2PI_X intsum2pi_x() { inti, j, k; real x, y, acc; sum_typesum; real final = PI * OUTER; sum = 0.0; for(i=0; i<OUTER; i++) { acc= 0.0; for(j=1; j<INNER; j++) { x = 1.0; for(k=0; k<j; k++) x *= 2.0; y = (real)PI / x; acc += y; } sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavyway to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

  40. Mixed Precision: Sum2PI_X intsum2pi_x() { inti, j, k; real x, y, acc; sum_typesum; real final = PI * OUTER; sum = 0.0; for(i=0; i<OUTER; i++) { acc= 0.0; for(j=1; j<INNER; j++) { x = 1.0; for(k=0; k<j; k++) x *= 2.0; y = (real)PI / x; acc += y; } sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavyway to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

  41. Mixed Precision: Results • SuperLU • Lower error threshold = fewer replacements

  42. Mixed Precision: Results • AMGmk • Highly-adaptive multigrid microkernel • Built-in error tolerance • Search found complete replacement • Manual conversion • Speedup: 175s to 95s (1.8X) • Conventional x86_64 hardware

  43. Mixed Precision: Results

  44. Mixed Precision: Results • Memory-based analysis • Replacement candidates: output operands • Generally higher replacement rates • Analysis found several valid variable-level replacements

  45. Mixed Precision: Conclusions • Automated tools can prototype mixed-precision configurations • Automated search can provide precision-level replacement insights • Precision analysis could provide another “knob” for application tuning • Even if computation requires double precision, storage/communication may not

  46. Contribution 4 of 4 Reduced Precision

  47. Reduced Precision • Simulate reduced precision with truncation • Truncate result after every operation • Allows zero up to double (64-bit) precision • Less overhead (fewer added operations) • Search routine • Identifies component-level precision requirements vs. Single Double 0 Single Double

  48. Reduced Precision: GUI • Bit-level precision requirements Single 0 Double

  49. Reduced Precision: Sum2PI_X 0 bits (single – exponent only) 22 bits (single) 27 bits (double – overly conservative) 32 bits (double)

  50. Reduced Precision • Faster search convergence compared to mixed-precision analysis

More Related