1 / 43

A case for 16-bit floating point data: FPGA image and media processing

A case for 16-bit floating point data: FPGA image and media processing. Daniel Etiemble and Lionel Lacassagne University Paris Sud, Orsay (France) de@lri.fr. Summary. Graphics and media applications integer versus FP computations Accuracy Execution speed Compilation issues

myron
Download Presentation

A case for 16-bit floating point data: FPGA image and media processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A case for 16-bit floating point data: FPGA image and media processing Daniel Etiemble and Lionel Lacassagne University Paris Sud, Orsay (France) de@lri.fr Daniel Etiemble

  2. Summary • Graphics and media applications • integer versus FP computations • Accuracy • Execution speed • Compilation issues • A niche for 16-bit floating point format (F16 or “half”) • Methodology and benchmarks • Hardware support • Customization of SIMD16-bit FP operators on a FPGA soft core (Altera NIOS II CPU) • The SIMD 16-bit FP instructions • Results • Conclusion Daniel Etiemble

  3. Integer or FP computations? • Both formats are used in graphics and media processing • Example: Apple vImage library has four image types with Four pixel types: • Unsigned char (0-255) or Float (0.0-1.0) for color or alpha values • Set of 4 unsigned chars or floats for Alpha, Red, Green, Blue • Trade-offs • Precision and dynamic range • Memory occupation and cache footprint • Hardware cost (embedded applications) • Chip area • Power dissipation Daniel Etiemble

  4. Integer or FP computations? (2) • General trend to replace FP computations by fixed-point computations • Intel GPP library: “using Fixed-Point instead of Floating-Point for better 3D Performance” (G. Kolly) • Intel Optimizing Center, http://www.devx.com/Intel/article/16478 • Techniques for automatic floating-point to fixed-point conversions for DSP code generation (Menard et al) Daniel Etiemble

  5. Menard et al approach Precision LASTI, Lannion, France Methodology FP algorithm Fixed-point hardware Correct algorithm   SW design (DSP)  Optimize the « mapping » of the algorithm on a fixed architecture  HW design (ASIC-FPGA)  Optimize data path width  Minimize chip area Minimize Execution time & code size Maximize precision Daniel Etiemble

  6. Integer or FP computations? (3) • Opposite option: Customized FP formats • “Lightweight” FP arithmetic (Fang et al) to avoid conversions • With IDCT: FP numbers with 5-bit exponent and 8-bit mantissa are sufficient to get a PSNR similar to 32-bit FP numbers • To compare with “half” format Daniel Etiemble

  7. How to help a compiler to “vectorize”? Integers: different input and output formats N bits + N bits => N+1 bits N bits * N bits => 2N bits FP numbers: same input and output formats Example: a Deriche filter on a size*size points image Integer or FP computations? (4) #define byte unsigned char byte **X, **Y; int32 b0, a1, a2; for(i=0; i<size; i++) { for(j=0; j<size; j++) { Y[i][j] = (byte) (b0 * X[i][j] + a1 * Y[i-1][j] + a2 * Y[i-2][j]) >> 8);}} for (i=size-1;i>=0;i--) { for(j=0; j<size; j++) { Y[i][j] = (byte) (b0 * X[i][j] +a1 * Y[i+1][j] + a2* Y[i+2][j]) >> 8);}} Compiler vectorization is impossible. With 8-bit coefficients, this benchmark can be manually vectorized. The vectorization is possible only if the programmer has a detailed knowledge of the used parameters. Float version is easily vectorized by the compiler Daniel Etiemble

  8. Cases for 16-bit FP formats • Computation when data range exceeds “16-bit integer” range without needing “32-bit FP float” range • Graphics and media applications • Not for GPU (F16 already used in NVidia GPUs) • For embedded applications • Advantages of 16-bit FP format • Reduce memory occupation (cache footprint) versus 32-bit integer or FP formats • CPU without SIMD extensions (low-end embedded CPUs) • 2 x wider SIMD instructions compared to float SIMD • CPU with SIMD extensions (high-end embedded CPUs) • Huge advantage of SIMD float operations versus SIMD integer operations both for compiler and manual vectorization. Daniel Etiemble

  9. Example: Points of Interest Daniel Etiemble

  10. Points of interests (PoI) in images Ix*Ix Sxx Ix FI Image (Sxx*Syy-Sxy2 ) - 0.05 (Sxx+Syy)2 Ix*Iy Sxy Iy 3 x 3 Gradient (Sobel) Iy*Iy Syy Threshold short byte int int byte 3 x 3 Gauss filters • Integer computation mixes char, short and int and prevents an efficient use of SIMD parallelism • F16 computations would profit from SIMD parallelism with an uniform 16-bit format Harris algorithm Daniel Etiemble

  11. 16-bit Floating-Point formats • Some have been defined in DSPs but rarely used • Example: TMS 320C32 • Internal FP type (immediate operand) • 1 sign bit, 4-bit exponent field and 11-bit fraction • External FP type (storage purposes) • 1 sign bit, 8-bit exponent field and 7-bit fraction • “Half” format Daniel Etiemble

  12. “Half” format • 16-bit version of IEEE 754 simple and double precision versions. • Introduced by ILM for OpenEXR format • Defined in Cg (NVidia) • Motivation: • “16-bit integer based formats typically represent color component values from 0 (black) to 1 (white), but don’t account for over-range value (e.g. a chrome highlight) that can be captured by film negative or other HDR displays… Conversely, 32-bit floating-point TIFF is often overkill for visual effects work. 32-bit FP TIFF provides more than sufficient precision and dynamic range for VFX images, but it comes at the cost of storage, both on disk and memory” Daniel Etiemble

  13. Validation of the F16 approach • Accuracy • Results presented in ODES-3 (2005) and CAMP’05 (2005) • Next slides. • Performances with General Purpose CPUs (Pentium 4 and Power PC G4-G5) • Results presented in ODES-3 (2005) and CAMP’05 (2005) • Performance with FPGAs (this presentation) • Execution time • Hardware cost (and power dissipation) • Other embedded hardware (to be done) • SoC • Customizable CPU (ex: Tensilica approach) Another time  Daniel Etiemble

  14. Accuracy • Comparison of F16 computation results with F32 computation results • Specificities of FP formats • Rounding? • Denormals? • NaN? Daniel Etiemble

  15. 1 8 23 xxxxxxxxxx 0000000000000 PE =1023 à 1039 5 bits : 1-31 10 bits Impact of F16 accuracy and dynamic range • Simulation of “half” format with “float” format with actual benchmarks or applications • Impact of reduced accuracy and range on results • F32-computed and F16-computed images are compared with PSNR measures. • Four different functions: ftd, frd, ftn, frd to simulate the F16 • Fraction : truncation or rounding • With or without denormals • For any benchmark, manual insertions of one function (ftd / frd /ftn / frd) • Function call before any use of a “float” value • Function call after any operation producing a “float” value Daniel Etiemble

  16. Impact of F16 accuracy and dynamic range • Benchmark 1 : zooming (A. Montanvert, Grenoble) • “Spline” technique for x1, x2 and x4 zooms • Benchmark 2 : JPEG (Mediabench) • 4 different DCT/IDCT functions • Integer/Fast integer/F32/F16 • Benchmark 3 : Wavelet transform (L. Lacassagne, Orsay) • SPIHT (Set Partioning in Hierarchical Trees) Daniel Etiemble

  17. Accuracy (1): Zooming benchmark • Denormals are useless • No significant difference between truncation and rounding for mantissa • Minimum hardware (no denormals, truncation) is OK  Daniel Etiemble

  18. Accuracy (2) :JPEG (Mediabench) 512 x 512 images 256 x 256 images Difference (db) final image compressed - uncompressed original image Daniel Etiemble

  19. Accuracy (3): Wavelet transform 512 x 512 or 1024 x 1024 images Daniel Etiemble

  20. Accuracy (4) : Wavelet transforms Images 256 x 256 Daniel Etiemble

  21. Convolution operators Horizontal-vertical version of Deriche filter Deriche gradient Image stabilization Points of Interest Achard Harris Optical flow FDCT (JPEG 6-a) Benchmarks for(i=0; i<size-1; i++) { for(j=0; j<size; j++) { Y[i][j] = (byte) ((b0 * X[i][j] + a1 * Y[i-1][j] + a2 * Y[i-2][j]) >> 8);}} for (i=size-1;i>=0;i--) { for(j=0; j<size; j++) { Y[i][j] = (byte) ((b0 * X[i][j] + a1 * Y[i+1][j] + a2 * Y[i+2][j]) >> 8);}} Deriche: horizontal vertical version Daniel Etiemble

  22. Altera NIOS development kit (Cyclone edition) EP1C20F400C7 FPGA device NIOS II/f CPU (50-MHz) Altera IDE GCC tool chain (-O3 option) High_res_timer (Nb of clock cycles for execution time) VHDL description of all the F16 operators Arithmetic operators Data handling operators Quartus II design software NIOS II/f Fixed features 32-bit RISC CPU Branch prediction Dynamic branch predictor Barrel shifter Customized instructions Parameterized features HW integer multiplication and division 4 KB instruction cache 2 KB data cache HW and SW support Daniel Etiemble

  23. Customization of SIMD F16 instructions Data manipulation ADD/SUB, MUL, DIV With a 32-bit CPU, it makes sense to implement F16 instructions as SIMD 2 x 16-bits instructions Daniel Etiemble

  24. Data conversions: 1 cycle Bytes to/from F16 Shorts to/from F16 Conversions and shifts: 1 cycle Accesses to (i, i-1) or (i+2, i+1) and conversions Arithmetic instructions ADD/SUB : 2 cycles (4 for F32) MULF : 2 cycles (3 for F32) DIVF : 5 cycles DP2 : 1 cycle i+3 i+2 i+1 i i+3 i+2 i+3 i+2 i+1 i B2F16H B2F16L B2FSRL B2FSRH i+1 i-1 i SIMD F16 instructions Daniel Etiemble

  25. Execution time: basic vector operations Copy Vector Add and Mul Vector Scalar Add and Mul F32 I32 F16 Instruction latencies Daniel Etiemble

  26. Execution time: basic vector operations • Speedup • SIMD F16 versus scalar I32 or F32 • Smaller cache footprint for F16 compared to I32/F32 • F16 latencies are smaller than F32 latencies Daniel Etiemble

  27. Benchmark speedups • Speedup greater than 2.5 versus F32 • Speedup from 1.3 to 3 versus I32 • Depends on the add/mul ratio and amount of data manipulation • Even scalar F16 can be faster than I32 (1.3 speedup for JPEG DCT) NO MUL Daniel Etiemble

  28. Hardware cost F32 F16 Daniel Etiemble

  29. Concluding remarks • Intermediate level graphics benchmarks generally need more than I16 (short) or I32 (int) dynamic ranges without needing F32 (float) dynamic range • On our benchmarks, graphical results are not significantly different when using F16 instead of F32 • A limited set of SIMD F16 instructions have been customized for NIOS II CPU • The hardware cost is limited and compatible with to-day FPGA technologies • The speedups range from 1.3 to 3 (generally1.5) versus I32 and are greater than 2.5 versus F32 • Similar results have been found for general-purpose CPUs (Pentium4, PowerPC) • Tests should be extended to other embedded approaches • SoCs • Customizable CPUs (Tensilica approach) Daniel Etiemble

  30. References • OpenEXR, http://www.openexr.org/details.html • W.R. Mark, R.S.Glanville, K. Akeley and M.J. Kilgard, “Cg: A system for programming graphics hardware in a C-like language. • NVIDIA, Cg User’s manual, http://developer.nvidia.com/view.asp?IO=cg_toolkit • Apple, “Introduction to vImage”, http://developer.apple.com/documentation/Performance/Conceptual/vImage/ • G. Kolli, “Using Fixed-Point Instead of Floating Point for Better 3D Performance”, Intel Optimizing Center, http://www.devx.com/Intel/article/16478 • D. Menard, D. Chillet, F. Charot and O. Sentieys, “Automatic Floating-point to Fixed-point Conversion for DSP Code Generation”, in International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES 2002) • F. Fang, Tsuhan Chen, Rob A. Rutenbar, “Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform” in EURASIP Journal on Signal Processing, Special Issue on Applied Implementation of DSP and Communication Systems • R. Deriche. “Using Canny's criteria to derive a recursively implemented optimal edge detector”. The International Journal of Computer Vision, 1(2):167-187, May 1987. • A. Kumar, “SSE2 Optimization – OpenGL Data Stream Case Study”, Intel application notes, http://www.intel.com/cd/ids/developer/asmo-na/eng/segments/games/resources/graphics/19224.htm • Sample code for the benchmarks available: http://www.lri.fr/~de/F16/codetsi • Multi-Chip Projects, “Design Kits”, http://cmp.imag.fr/ManChap4.html • J. Detrey and F. De Dinechin, “A VHDL Library of Parametrisable Floating Point and LSN Operators for FPGA”, http//www.ens-lyon.fr/~jdetrey/FPLibrary Daniel Etiemble

  31. Back slides • F16 SIMD instructions on General Purpose CPUs Daniel Etiemble

  32. Microarchitectural assumptions for Pentium 4 and Power PC G5 • The F16 new instructions are compatible with the present implementation of the SIMD ISA extensions • 128-bit SIMD registers • Same number of SIMD registers • Most SIMD 16-bit integer instructions can be used for F16 data • Transfers • Logical instructions • Pack/unpack, Shuffle, Permutation instructions • New instructions • F16 arithmetic ones : add, sub, mul, div, sqrt • Conversion instructions • 16-bit integer to/from 16-bit FP • 8-bit integer to/from 16-bit FP Daniel Etiemble

  33. Some P4 instruction examples Latencies and throughput values are similar to the corresponding ones of P4 FP instructions InstructionLatency ADDF16 4 MULF16 6 CBL2F16 4 CBH2F16 4 CF162BL 4 CF162BH 4 XMM 8 bytes 8 bytes CBL2F16 CBH2F16 XMM Smaller latencies ADDF16 2 MULF16 4 CONV 2 Byte to Half conversion instructions Daniel Etiemble

  34. Measures • Hardware “simulator” • IA-32 • 2.4 GHz Pentium 4 with 768-MB running Windows 2000 • Intel C++ 8 compiler with QxW option, “maximize speed” • Execution time measured with RDTSC instruction • PowerPC • 1.6 GHz PowerPC G5 with 768-MB DDR400 running Mac OS X.3 • Xcode programming environment including gcc 3.3 • Measures • Average values of at least 10 executions (excluding abnormal ones) Daniel Etiemble

  35. SIMD Execution time (1) Deriche benchmarks 73 * • SIMD integer results are incorrect (insufficient dynamic range) • F16 results are close to “incorrect” SIMD integer results • F16 results are significantly better than 32-bit FP results * * * * * Daniel Etiemble

  36. SIMD Execution time (2) : Scan benchmarks Cumulative sum and sum of square of precedent pixel values execution time according to input-output values * * * * * * * • Copy corresponds to the lower bound in execution time (memory-bounded) • Byte-short for +scan and Byte-short and Byte-integer for +*scan give incorrect results (insufficient dynamic range) • Same results as for Deriche benchmarks • F16 results are close to incorrect SIMD integer results • F16 results have significant speed-up compared to Float-Float for both Scans, and compared to Byte-Float and Float-Float for +*scan Daniel Etiemble

  37. SIMD Execution time (2) : OpenGL data stream case • Compute for each triangle the min and max values of vertice coordinates. • Most computation time is spent in AoS to SoA conversion • Results • Altivec is far better, but the relative F16/F32 speed-up is similar Daniel Etiemble

  38. Overall comparison (1/2/3) F16 version versus float version Speed-up left F16 versus “incorrect” 16-bit integer version right. Daniel Etiemble

  39. SIMD Execution time (4): Wavelet transform • Transformée en ondelettes Pentium 4 F32/F16 Speed-up Horizontal Overall Vertical Image size Daniel Etiemble

  40. SIMD Execution time (4): Wavelet transform PowerPC Horizontal F32/F16 Execution Time Overall Vertical Image size Daniel Etiemble

  41. Chip area “rough” evaluation • Same approach as used by Tulla et al for the Mediabreeze architecture • VHDL models of FP operators • J. Detrey and F. De Dinechin (ENS Lyon) • Non pipelined and pipelined versions • Adder: Close path and large path for exponent values • Divider: Radix-4 SRT algorithm • SQRT: Radix-2 SRT algorithm • Cell based library • ST 0.18µm HCMOS8D technology • Cadence 4.4.3 synthesis tool (before placement and routing) • Limitations • Full-custom VLSI ≠ VHDL + Cell-based library • Actual implementation in the P4 (G5) data path is not considered Daniel Etiemble

  42. 16-bit and 64-bit operators Two-path approach is too “costly” for 16-bit FP adder. A straightforward approach would be sufficient Daniel Etiemble

  43. Chip area evaluation Chip area (mm2) 16-bit FP FU chip area is about 5.5% of the 64-bit FP FU Eight such units would be 11% of the four corresponding 64-bit ones Daniel Etiemble

More Related