Automated Floating Point Analysis with CRAFT: Balancing Precision and Performance

Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor

Background • Floating point represents real numbers as (± sgnf × 2exp) • Sign bit • Exponent • Significand (“mantissa” or “fraction”) • Finite precision • Single-precision: 24 bits (~7 decimal digits) • Double-precision: 53 bits (~16 decimal digits) 8 4 32 0 16 IEEE Single Exponent (8 bits) Significand (23 bits) 8 4 64 32 0 16 IEEE Double Exponent (11 bits) Significand (52 bits) 2

Motivation • Finite precision causes round-off error • Compromises certain calculations • Hard to detect and diagnose • Increasingly important as HPC scales • Computation on streaming processors is faster in single precision • Data movement in double precision is a bottleneck • Need to balance speed (singles) and accuracy (doubles) 3

Our Goal Automated analysis techniques to inform developers about floating point behavior and make recommendations regarding the use of floating point arithmetic. 4

Framework CRAFT: Configurable Runtime Analysis for Floating-point Tuning • Static binary instrumentation • Read configuration settings • Replace floating-point instructions with new code • Rewrite modified binary • Dynamic analysis • Run modified program on representative data set • Produce results and recommendations 5

Previous Work • Cancellation detection • Reports loss of precision due to subtraction • Paper appeared in WHIST‘11 • Range tracking • Reports min/max values • Replacement • Implements mixed-precision configurations • Paper to appear in ICS’13 6

Mixed Precision • Use double precision where necessary • Use single precision everywhere else • Can be difficult to implement 1: LU ← PA 2: solve Ly = Pb 3: solve Ux0 = y 4: for k = 1, 2, ... do 5: rk ← b – Axk-1 6: solve Ly = Prk 7: solve Uzk = y 8: xk ← xk-1 + zk 9: check for convergence 10: end for Mixed-precision linear solver algorithm Red text indicates steps performed in double-precision (all other steps are single-precision) 7

Configuration 8

Implementation • In-place replacement • Narrowed focus: doubles  singles • In-place downcast conversion • Flag in the high bits to indicate replacement 8 4 64 32 0 16 Double downcast conversion 8 4 64 32 0 16 Replaced Double 7 F F 4 D E A D Non-signalling NaN 8 4 32 0 16 Single 9

Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd0x601e38(%rax, %rbx, 8)  %xmm0 2 mulsd-0x78(%rsp)  %xmm0 3 addsd-0x4f02(%rip)  %xmm0 4 movsd %xmm0 0x601e38(%rax, %rbx, 8) 10

Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1 movsd0x601e38(%rax, %rbx, 8)  %xmm0 check/replace -0x78(%rsp) and %xmm0 2 mulss-0x78(%rsp)  %xmm0 check/replace -0x4f02(%rip) and %xmm0 3 addss-0x20dd43(%rip)  %xmm0 4 movsd %xmm0 0x601e38(%rax, %rbx, 8) 11

Block Editing (PatchAPI) original instruction in block block splits double  single conversion initialization cleanup check/replace 12

Automated Search • Manual mixed-precision analysis • Hard to use without intuition regarding potential replacements • Automatic mixed-precision analysis • Try lots of configurations (empirical auto-tuning) • Test with user-defined verification routine and data set • Exploit program control structure: replace larger structures (modules, functions) first • If coarse-grained replacements fail, try finer-grained subcomponent replacements 13

System Overview 14

NAS Results 15

AMGmk Results • Algebraic MultiGrid microkernel • Multigrid method is highly adaptive • Good candidate for replacement • Automatic search • Complete conversion (100% replacement) • Manually-rewritten version • Speedup: 175 sec to 95 sec (1.8X) • Conventional x86_64 hardware 16

SuperLU Results • Package for LU decomposition and linear solves • Reports final error residual • Both single- and double-precision versions • Verified manual conversion via automatic search • Used error from provided single-precision version as threshold • Final config matched single-precision profile (99.9% replacement) 17

Retrospective • Twofold original motivation • Faster computation (raw FLOPs) • Decreased storage footprint and memory bandwidth • Domains vary in sensitivity to these parameters • Computation-centric analysis • Less insight for memory-constrained domains • Sometimes difficult to translate instruction-level recommendations to source code-level transformations • Data-centric analysis • Focus on data motion, which is closer to source code-level structures 18

Current Project • Memory-based replacement • Perform all computation in double precision • Save storage space by storing single-precision values in some cases • Implementation • Register-based computation remains double-precision • Replace movement instructions (movsd) • Memory to register: check and upcast • Register to memory: downcast if configured • Searching for replaceable writes instead of computes 19

Preliminary Results All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux. 20

Future Work • Case studies • Search convergence study 28

Conclusion Automated binary instrumentation techniques can be used to implement mixed-precision configurations for floating point code, and memory-based replacement provides actionable results. 29

Thank you! sf.net/p/crafthpc 30

Automated Floating Point Analysis with CRAFT: Balancing Precision and Performance

Automated Floating Point Analysis with CRAFT: Balancing Precision and Performance

Presentation Transcript

Floating Point

Floating Point Representation

Floating point

Automated Floating-Point Precision Analysis

Floating Point

IA32 Floating Point

Floating Point

Floating Point

Floating Point

Floating Point

Floating Point

Floating point

Floating Point

Floating point

Floating Point

Floating Point

Using Dyninst to Measure Floating-point Error

Floating Point

Floating Point

Floating Point