Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation Tor Aamodt and Paul Chow University of Toronto

Presentation Outline • Background / Motivation • Floating-to-Fixed-Point Conversion • Architectural Support • Experimental Results • Summary / Future Directions Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Background:University of Toronto DSP Project • Motivation: DSP Compiler/Architecture Co-design • First Generation Silicon (Sean Peng’s M.A.Sc. Thesis) taped- out Sept. 30, 1999: 108 pin PGA / 0.35 µm CMOS / 63 MHz • 16-bit Fixed-Point VLIW with Two-Level Instruction Fetching • Harvard Memory Architecture • 5 stage pipeline: IF1  IF2  ID  EX  WB • 7 function units: • 2 integer units: 16.0 multiply & 1.15 multiply operations • 2 address units: modulo addressing • 2 memory units: each tied to one data memory bank • 1 control unit Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

sign bit 8 bit exponent (excess 127) 23+1 bit normalized mantissa IWL sign bit integer part fractional part Background:Fixed-Point versus Floating-Point 32 bit Floating-Point (IEEE): Fixed-Point: Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Function Unit Cost significantly less This factor motivates us to find ways of coping with the shortcomings of fixed-point representations Dynamic Range of |x| [0,2IWL) (2-126, 2127) Precision of x: |x / x| x -1 2(1+IWL - WL) 2-23 Background:Fixed-Point versus Floating-Point Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Motivation • Why convert floating-point code to fixed-point code? Saves area and power. • Why automate the process? Manual conversion is time-consuming and error-prone. • What qualities are we looking for in an automated conversion system? Good signal quality*. Fast code. Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

+ an infinitesimally small number. Why? e.g.  log22  = 1 Input, program variable, intermediate result, output For all definitions of , and all inputs x Background:Fixed-point Numerical Representations in Signal Processing • Consider a program P with associated inputs x(k)  SP. Example: P an IIR filter, SPthe set of all human speech samples x(k). • Signal Scaling: Integer Word Length (IWL) • definition: Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Addition / Subtraction Overflow Guard Bits A: >> 1  (+1) B: n IWLA A: IWLB B: IWLA+ IWLB ??? A*B: Background:Fixed-Point Arithmetic Operations >> n (binary point alignment) Multiplication Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Presentation Outline • Background Material / Motivation • Floating-to-Fixed-Point Conversion • Architecture Support • Experimental Results • Summary / Future Directions Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Conversion Process:Previous Work • ‘Worst-Case Evaluation’: Markus Willems et. al. FRIDGE: An Interactive Code Generation Environment for HW/SW CoDesign. ICASSP, April 1997. • A ‘Statistical’ Approach: Ki-Il Kum, Jiyang Kang, and Wonyong Sung. A Floating-Point to Fixed-Point C Converter for Fixed-Point Digital Signal Processors. In Proc. 2nd SUIF Compiler Workshop, August 1997. Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Conversion Process: Overview “sin(x)”  “utdsp_sin(x)” float *p, x, y, A[N], B[N]; for( int i=0; i < N; i++ ){ p = (condition) ? A : B; y += x*p[i]; } float fubar( float *p ) { float sum = 0.0; for( int i=0; i < N; i++) sum += p[i]; } Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Consider the ANSI C code: float a, b, x[N]; y = a*x[i] + b*x[i+1]; tmp_1 = a*x[i]; tmp_2 = b*x[i+1]; y = tmp_1 * tmp_2; * a Equivalent Expression Tree: ID Assignment: * x[i] “1” : tmp_1 y + b “0” : x[i+1] “2” : tmp_2 Conversion Process:Collecting Dynamic Range Information Code Instrumentation: profile(tmp_1,1); profile(tmp_2,2); profile(y,0); fin Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

int a, b, x[N]; y = a•x[i] >> 2 + b•x[i+1]; 1. Type Conversion 2. Scaling Operations 3. Fractional Fixed-Point Operations Conversion Process:Desired Result Continuation of Previous Example : float a, b, x[N]; y = a*x[i] + b*x[i+1]; Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Conversion Process:Type Conversion / Scaling Operation Generation • Type conversion: {float, double} int • Scaling Operations are added to expression trees using a post-order traversal... • Two previous algorithms from the literature for generating scaling operations... • Neither use Intermediate Result Profile data, instead, they combine range information from leaf nodes in a bottom-up fashion. • Is Useful Information Lost? Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Conversion Process:IRP: Using Intermediate ResultProfileData • ‘Worst-Case Evaluation’: Markus Willems et. al. FRIDGE: An Interactive Code Generation Environment for HW/SW CoDesign. ICASSP, April 1997. • A ‘Statistical’ Approach: Ki-Il Kum, Jiyang Kang, and Wonyong Sung. A Floating-Point to Fixed-Point C Converter for Fixed-Point Digital Signal Processors. In Proc. 2nd SUIF Compiler Workshop, August 1997. • UTDSP Algorithms: IRP, IRP-SA • Each node  has a measured IWL and a current IWL • Measured: IWL as determined by profiling • Current: IWL due to scaling operations within  Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Example: “A op B”:  IWLA op B measured IWLA op B current   IWLA measured IWLA current ? IWLB measured IWLB current   op Converted Sub-Expressions A B Scaling Operation Generation Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

For example, assume |A| > |B|, and IWLA+B measured  IWLA measured “A ± B” A: B: >> n n IRP: Additive Operations “A  B”  “(A << nA)  (B >> [n-nB])” where: nA = IWLA current - IWLA measured nB = IWLA current - IWLB measured n = IWLA measured - IWLB measured IWLA+B current = IWLA measured Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

“A • B”  “(A << nA) • (B << nB)” where: nA = IWLA current - IWLA measured nB = IWLA current - IWLB measured  IWLA•B current = nA + nB Note: Typoin Notes! IRP: Multiplication IWLA•B current =IWLA measured+ IWLB measured Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Question: Is information discarded unnecessarily here? Answer: Yes!Consider the following alternative: y = (a*x[i]<<1) + b*x[i+1] Assuming 2’s-complement arithmetic, this expression results in a more precise answer. IRP-SA: Using ‘Shift Absorption’ Problem: y = (a*x[i] + b*x[i+1]>>1) << 1 Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Fractional Multiplication with integrated Left Shift: A: Left Shift B: A*B: Architectural Support Common occurrence (using IRP-SA): A•B << n Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Experimental Results • Four test-cases presented in paper: (1) 4th Order IIR Filter (2) 1024 Point Radix 2 Decimation in Time FFT (3) Nonlinear Feedback Control System (4) 16th Order Lattice Filter • Look at (1) in detail, summarize results for others. • Explore some interesting properties exhibited in (4) that are indicative of possible future improvements. Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

20 0 -20 Magnitude (dB) -40 -60 -80 -100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 ´ p Normalized Frequency ( rad/sample) 100 0 Phase (degrees) -100 -200 -300 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ´ p Normalized Frequency ( rad/sample) Experimental Results:4th Order IIR Filter • 4th Order Chebyshev Type II Low-Pass Filter • Designed using MATLAB’s cheby2 command • Transfer Function: Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

14 Bit 16 Bit Algorithm w/o FMLS w/ FMLS w/o FMLS w/ FMLS SNU-4 44.7 dB 56.4 dB 56.4 dB 44.7 dB 45.6 dB 57.1 dB WC 45.6 dB 57.1 dB IRP 49.2 dB 49.3 dB 60.9 dB 62.0 dB IRP-SA 48.8 dB 53.5 dB 61.0 dB 66.9 dB Experimental Results4th Order IIR Filter (cont’d) • Filter Realization: • MATLAB’s tfsos command (pole-zero pairing) • 2 Cascaded Direct-Form IIR filters Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Experimental Results4th Order IIR Filter (cont’d) IRP: (A2[0]*t2 - A2[1]*D2[0] << 1) + (A2[2]*D2[1] << 1 ) << 2 IRP-SA: (A2[0]*t2 << 3) - (A2[1]*D2[0] << 3) + (A2[2]*D2[1] << 3) Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

14 Bit 16 Bit Algorithm w/o FMLS w/ FMLS w/o FMLS w/ FMLS SNU-4 28.7 dB 36.7 dB 36.7 dB 28.7 dB 28.7 dB 36.7 dB WC 28.7 dB 36.7 dB IRP 28.7 dB 34.9 dB 36.7 dB 44.6 dB IRP-SA 28.7 dB 34.9 dB 36.7 dB 44.6 dB Experimental Results:1024-Point Radix-2 FFT Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Experimental Results:Rotational Inverted Pendulum U of T System Control Group Non-linear Testbench Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

14 Bit 16 Bit Algorithm w/o FMLS w/ FMLS w/o FMLS w/ FMLS SNU-4 4.0 dB 30.7 dB 54.9 dB 42.7 dB 54.3 dB 66.1 dB WC 47.3 dB 59.2 dB IRP 53.1 dB 58.4 dB 65.8 dB 71.8 dB IRP-SA 52.8 dB 59.4 dB 64.4 dB 72.0 dB Experimental Results:Rotational Inverted Pendulum Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Experimental Results:Rotational Inverted Pendulum - 12-bit Controller Comparison WC : 32.8 dB IRP-SA: 41.1 dB IRP-SA w/ fmls: 48.0 dB Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Experimental Results:16th Order Lattice Filter t h 16 Order Elliptic Bandpass Filter Transfer Function 20 0 -20 Magnitude (dB) -40 -60 -80 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ´ p Normalized Frequency ( rad/sample) 1000 500 0 Phase (degrees) -500 -1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ´ p Normalized Frequency ( rad/sample) Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

32 Bit w/o Loop Unrolling 16 Bit w/ Loop Unrolling Algorithm w/o FMLS w/ FMLS w/o FMLS w/ FMLS SNU-4 22.8 dB 47.1 dB 47.0 dB 22.8 dB 28.1 dB 48.3 dB WC 28.1 dB 48.3 dB IRP 36.1 dB 36.2 dB 51.3 dB 51.3 dB IRP-SA 36.1 dB 36.2 dB 51.3 dB 50.9 dB Experimental Results:Lattice Filter Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Experimental Results:Lattice Filter #define N 16; double state[N+1], K[N], V[N+1]; double lattice( double x ) { double y = 0.0; for( int i=0; i < N; i++ ) { x = x - K[N-i-1] * state[N-i-1]; state[N-i] = state[N-i-1] + K[N-i-1]*x; y = y + V[N-i]*state[N-i]; } state[0] = x; return y + V[0]*state[0]; } Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Experimental Results:Lattice Filter • Observation: Wide Dynamic Ranges of “state”, “V”, “x”, and “y” are due to ‘Name Dependencies’ of array elements and accumulators when assigning integer word lengths. • Can use Loop Unrolling + Renaming to break dependencies and achieve far better results (iteration dependant analysis mentioned in FRIDGE paper—however no experimental results reported) Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Summary • Intermediate result profile data can used to reduce numerical error of fixed-point code. • A fractional multiply with integrated left shift operation can improve the results, especially when combined with the IRP-SA algorithm. • Improvements between 3.0 dB and 12.8 dB have been observed so far. Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Future Directions • Structural Transformations • Extended Precision Arithmetic • Overflows due to accumulated rounding error — use two profiling phases to estimate the effect of ‘second-order’ interactions. Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation

Numerical Error Minimizing Floating-Point to Fixed-Point ANSI C Compilation