Overview September 2004

High Speed Energy Efficient Architecture for Finite Ridgelet Transform Shrutisagar Chandrasekaran and Abbes Amira Overview September 2004

Outline • Research Objectives • Introduction • Discrete Ridgelet Transform • Finite Radon Transform • Discrete Wavelet Transform • FRIT Architecture • FPGA Implementations and Results • Conclusions • Future Work and Acknowledgements

Research Objectives • To evaluate and model power consumption of FPGA based designs at various levels of abstraction and to evolve and implement strategies for low power energy efficient design • To develop a high level framework for FPGAs based matrix algorithms implementation such as Ridglet transform, matrix multiplication, SVD, DCT, DWT..etc used in image and signal processing applications. • To efficiently implement the Finite Ridgelet Transform (FRIT) on FPGA using Handel C, for satellite based onboard image compression within the ongoing Framework development

Research Objectives Application User System Architect • Estimating Performance Measures • (Power, Area, Max Frequency…etc) • Capturing Platform Features at higher level RLC DWT CSC FPGA Configuration Implementation Reconfiguration Compilation MM DCT (1D,2D) FFT (1D, 2D) DWT (1D, 2D) FRAT (1D, 2D) FRIT (1D,2D) SVD QR VHDL Handel-C Schematic Hybrid EDIF Bitstream

Introduction • Discrete Wavelet Transforms (DWT) have become powerful tools in a wide range of applications including • Image/Video Compression (JPEG2000, MPEG-4) • Aerospace applications (Data denoising, Satellite/Astronomical image compression, analysis) • Image/Video Enhancement, Segmentation • Telecommunication • The advantage of DWT over existing transforms such as Discrete Fourier Transform (DFT) and Discrete Cosine Transform (DCT) is that the DWT performs a multiresolution analysis of a signal with localization in both time and frequency

Introduction • The wavelet transform has many limitations when it comes to representing straight lines and edges in images • To overcome the weakness of wavelets in higher dimensions, Candes recently proposed the Ridgelet transform which deals effectively with line singularities in 2-D • However, the complexity of its implementation still remains as a heavy burden on standard microprocessors where large amounts of data have to be processed • Therefore, VLSI/FPGA implementations of the Ridgelet Transforms are needed for real-time applications.

The Finite Ridgelet Transform • The FRIT provides a sparse representation for functions defined on the continuum plane • The transform allows representing edges and other singularities along curves in a more efficient way • The basic idea is to map a line singularity in the two-dimensional (2-D) domain into a point by means of the Radon transform. • Then, a one-dimensional (1-D) wavelet is performed to deal with the point singularity in the Radon domain

The Finite Ridgelet Transform

The Finite Ridgelet Transform • The two fundamental buliding blocks of the FRIT are the FRAT and DWT • The FRAT pseudocode is mapped onto hardware after performing energy and speed optimisations including parallelism and pipelining • Experimental results in Matlab have shown that simple lower order wavelets yield better compression (lesser entropy) when transformingfrom FRAT domain to FRIT domain • The HAAR wavelet gives better results than the CDF2.2 and other higher order wavelets, in terms of minimising the entropy in the Ridgelet domain

The Finite Ridgelet Transform • It is able to transform two dimensional images with lines into a domain of possible line parameters, where each line in the image will give a peak positioned at the corresponding line parameters • Numerous discretisations of the Radon transforms have been devised to approximate the continuous formulae • However, most of them were not designed to be invertible transforms for digital images. Alternatively, the Finite Radon Transform (FRAT) theory (which means transform for finite length signals) originated

The Finite Radon Transform • The FRAT is defined as summations of image pixels over a certain set of lines. • Lk;l denotes the set of points that make up a line on the lattice Z2pas follows: • Computing the kthRadon projection, i.e., the kthrow of the array, we need to pass all pixels of the original image once and use p histogrammers: one for every pixel in the row.

The Finite Radon Transform for k=0:(p-1) n = k; for j = 0:(p-1) n = n - k; if n < 0 n = n+p; end l = n - 1; for I = 0:(p-1) l = l +1; if l >= p l = l - p; end FRAT(k,l) = FRAT(k,l) + f(i,j); end end end for j=0:(p-1) for i=0:(p-1) FRAT(p,j) = FRAT(p,j) + f(i,j); end end • The FRAT is defined as summations of image pixels over a certain set of lines. FRAT Pseudocode

Discrete Wavelet Transform • The work by Daubechies and Mallat led to the discrete filter based interpretation of wavelets • Wavelets can be implemented as a set of filter banks comprising a high-pass and a low-pass filter, each followed by down-sampling by two

Discrete Wavelet Transform • Though the simplest wavelet, the HAAR DWT gives the best performance in terms of entropy reduction • Integer to Integer Lifting version of the HAAR DWT is used to ensure that it is fully invertible • In place transform is performed to reduce the number and size of on-chip buffers

FRIT Architecture • Once the Radon and wavelet transform have been implemented, the Ridgelet transform is straightforward • Each output of the radon projection, i.e, each row of radon transformed image, is simply passed through the wavelet transform • Dual output buffer configuration is used so that the FRAT and the DWT can be performed simultaneously on the chip • In place lifting DWT is performed in the second output buffer containing the FRAT vectors

FRIT Architecture • One input pixel processed on each clock cycle • No clock edges wasted in buffering input tile • Fully pipelined input section • The controller has (p+1) counters which generate address and read/write status of output vectors • Double buffered O/P section to perform DWT in parallel

FRIT Architecture • p+1 FRAT vectors are decomposed in parallel, p is the Block size • Lifting architecture is used to perform the 1D Haar wavelet transform • In place decomposition performed to reduce internal buffer size

FRIT Architecture

FPGA Implementations and Results • In order to verify the performance of the proposed architectures, designs have been prototyped on the Celoxica RC1000 board containing the Xilinx XCV2000E FPGA • Available on chip logic resource include - Slices : 19200 - CLB Array : 80 x 120 - Block RAM : 655,360 bits - Distributed RAM : 614,400 bits • The RC1000 has 4 memory banks which communicate with the host by means of DMA transfers

FPGA Implementations and Results • The design has also been synthesised on the Radiation Hardened QPro Virtex-II FPGA, as it is the preferred Xilinx FPGA for deployment onboard satellites • Industry First Radiation Hardened Platform FPGA Solution • Guaranteed total ionizing dose to 200 krad(si) and latch-up immune to LET > 160 MeV-cm2/mg. SEU upsets < 1.5E-6 per device day achievable with recommended redundancy implementation • Certified to MIL-PRF-38535 standard • Guaranteed over the full military temperature range (–55° C to +125° C)

FPGA Implementations and Results Design Flow

FPGA Implementations and Results • Handel-C adds constructs to ANSI-C to enable DK to directly implement hardware • Fully synthesizable HW programming language based on ANSI-C • Implements C algorithm direct to optimized FPGA or outputs RTL from C Handel-C Additions for hardware Majority of ANSI-C constructs supported by DK Parallelism Timing Interfaces Clocks Macro pre-processor RAM/ROM Shared expression Communications Handel-C libraries FP library Bit manipulation Control statements (if, switch, case, etc.) Integer Arithmetic Functions Pointers Basic types (Structures, Arrays etc.) #define #include Software-only ANSI-C constructs Recursion Side effects Standard libraries Malloc

FPGA Implementations and Results FRAT Implementation • An empirical study has shown that the choice of a block size p=7 gives the best balance of power and performance for the FRAT:

FPGA Implementations and Results Comparison of performance metrics of the FRAT sub-block with existing work [1] C.A.Rahman and W.Badawy, “Architectures the Finite Radon Transform”, IEE Electronic Letters, Vol. 40, No. 15, July 2004 * Implemented using Matlab on a 1.8 GHz Pentium 4 workstation equipped with 1GB DDR RAM

FPGA Implementations and Results Various performance metrics of the FRIT implemented on the Virtex-E and the QPro Virtex-II FPGAs

FPGA Implementations and Results • FRIT achieves the best results in terms of reducing the entropy of the image • This means that better compression can be achieved

FPGA Implementations and Results Source 1 FRIT Domain

Conclusions • The Ridgelet transform was recently introduced to • overcome the weakness of wavelet transforms • An architecture and its efficient FPGA implementation • for the Finite Ridgelet transform have been proposed • The implementations have been carried out for different input image sources • The implementation results show that proposed implementation outperforms existing work in terms of both area and system speed

Future work and Acknowledgments • Develop Complete on-chip compression engine for satellite images • Explore the effect of Algorithmic, architectural and RTL level optimisations to minimise power consumption Acknowledgments Celoxica (Mr. Roger Gook) and EPSRC for supporting this work

Overview September 2004