Floating Point Analysis
Download
1 / 30

Floating Point Analysis Using Dyninst - PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on

Floating Point Analysis Using Dyninst. Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor. Background. Floating point represents real numbers as (± sgnf × 2 exp ) Sign bit Exponent Significand ( “ mantissa ” or “ fraction ” ) Finite precision

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Floating Point Analysis Using Dyninst' - vaughn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Floating Point Analysis

Using Dyninst

Mike Lam

University of Maryland, College Park

Jeff Hollingsworth, Advisor


Background

  • Floating point represents real numbers as (± sgnf × 2exp)

    • Sign bit

    • Exponent

    • Significand (“mantissa” or “fraction”)

  • Finite precision

    • Single-precision: 24 bits (~7 decimal digits)

    • Double-precision: 53 bits (~16 decimal digits)

8

4

32

0

16

IEEE Single

Exponent (8 bits)

Significand (23 bits)

8

4

64

32

0

16

IEEE Double

Exponent (11 bits)

Significand (52 bits)

2


Motivation

  • Finite precision causes round-off error

    • Compromises certain calculations

    • Hard to detect and diagnose

  • Increasingly important as HPC scales

    • Computation on streaming processors is faster in single precision

    • Data movement in double precision is a bottleneck

    • Need to balance speed (singles) and accuracy (doubles)

3


Our Goal

Automated analysis techniques to inform developers about floating point behavior and make recommendations regarding the use of floating point arithmetic.

4


Framework

CRAFT: Configurable Runtime Analysis for Floating-point Tuning

  • Static binary instrumentation

    • Read configuration settings

    • Replace floating-point instructions with new code

    • Rewrite modified binary

  • Dynamic analysis

    • Run modified program on representative data set

    • Produce results and recommendations

5


Previous Work

  • Cancellation detection

    • Reports loss of precision due to subtraction

    • Paper appeared in WHIST‘11

  • Range tracking

    • Reports min/max values

  • Replacement

    • Implements mixed-precision configurations

    • Paper to appear in ICS’13

6


Mixed Precision

  • Use double precision where necessary

  • Use single precision everywhere else

  • Can be difficult to implement

1: LU ← PA

2: solve Ly = Pb

3: solve Ux0 = y

4: for k = 1, 2, ... do

5: rk ← b – Axk-1

6: solve Ly = Prk

7: solve Uzk = y

8: xk ← xk-1 + zk

9: check for convergence

10: end for

Mixed-precision linear solver algorithm

Red text indicates steps performed in double-precision (all other steps are single-precision)

7



Implementation

  • In-place replacement

    • Narrowed focus: doubles  singles

    • In-place downcast conversion

    • Flag in the high bits to indicate replacement

8

4

64

32

0

16

Double

downcast conversion

8

4

64

32

0

16

Replaced

Double

7

F

F

4

D

E

A

D

Non-signalling NaN

8

4

32

0

16

Single

9


Example

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd0x601e38(%rax, %rbx, 8)  %xmm0

2 mulsd-0x78(%rsp)  %xmm0

3 addsd-0x4f02(%rip)  %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

10


Example

gvec[i,j] = gvec[i,j] * lvec[3] + gvar

1 movsd0x601e38(%rax, %rbx, 8)  %xmm0

check/replace -0x78(%rsp) and %xmm0

2 mulss-0x78(%rsp)  %xmm0

check/replace -0x4f02(%rip) and %xmm0

3 addss-0x20dd43(%rip)  %xmm0

4 movsd %xmm0 0x601e38(%rax, %rbx, 8)

11


Block Editing (PatchAPI)

original instruction in block

block splits

double  single conversion

initialization

cleanup

check/replace

12


Automated Search

  • Manual mixed-precision analysis

    • Hard to use without intuition regarding potential replacements

  • Automatic mixed-precision analysis

    • Try lots of configurations (empirical auto-tuning)

    • Test with user-defined verification routine and data set

    • Exploit program control structure: replace larger structures (modules, functions) first

    • If coarse-grained replacements fail, try finer-grained subcomponent replacements

13




AMGmk Results

  • Algebraic MultiGrid microkernel

    • Multigrid method is highly adaptive

    • Good candidate for replacement

  • Automatic search

    • Complete conversion (100% replacement)

  • Manually-rewritten version

    • Speedup: 175 sec to 95 sec (1.8X)

    • Conventional x86_64 hardware

16


SuperLU Results

  • Package for LU decomposition and linear solves

    • Reports final error residual

    • Both single- and double-precision versions

  • Verified manual conversion via automatic search

    • Used error from provided single-precision version as threshold

    • Final config matched single-precision profile (99.9% replacement)

17


Retrospective

  • Twofold original motivation

    • Faster computation (raw FLOPs)

    • Decreased storage footprint and memory bandwidth

      • Domains vary in sensitivity to these parameters

  • Computation-centric analysis

    • Less insight for memory-constrained domains

    • Sometimes difficult to translate instruction-level recommendations to source code-level transformations

  • Data-centric analysis

    • Focus on data motion, which is closer to source code-level structures

18


Current Project

  • Memory-based replacement

    • Perform all computation in double precision

    • Save storage space by storing single-precision values in some cases

  • Implementation

    • Register-based computation remains double-precision

    • Replace movement instructions (movsd)

      • Memory to register: check and upcast

      • Register to memory: downcast if configured

    • Searching for replaceable writes instead of computes

19


Preliminary Results

All benchmarks were single core versions compiled by the Intel Fortran compiler with optimization enabled. Tests were performed on an Intel workstation with 48GB of RAM running 64-bit Linux.

20









Future Work

  • Case studies

  • Search convergence study

28


Conclusion

Automated binary instrumentation techniques can be used to implement mixed-precision configurations for floating point code, and memory-based replacement provides actionable results.

29


Thank you!

sf.net/p/crafthpc

30


ad