Wavelet “Block-Processing” for Reduced Memory Transfers

Wavelet “Block-Processing” for Reduced Memory Transfers MAPLD 2005 Conference Presentation William Turri (wturri@systranfederal.com) Ken Simone (kcsim07@yahoo.com) Systran Federal Corp. 4027 Colonel Glenn Highway, Suite 210 Dayton, OH 45431-1672 937-429-9008 x104

Research Goals • Develop, test, and implement an efficient wavelet transform algorithm for fast hardware compression of SAR images • Algorithm should make optimal use of available memory, and minimize the number of memory access operations required to transform an image

Wavelet Transform Background Original Image Row Transformed Image Wavelet Transformed Image Low Frequency (Scaling) Coefficients High Frequency (Wavelet) Coefficients

Multiple Resolution Levels Wavelet Transformed Images (MR-Level = 1) (MR-Level = 2) (MR-Level = 3)

Results of Applying Wavelet Transform Wavelet Coefficients MR-Level = 1 Scaling Coefficients Wavelet Coefficients MR-Level = 3 Wavelet Coefficients MR-Level = 2

Preliminary Investigation • Prior work has made use of the Integer Haar Wavelet Transform • This wavelet is computationally simple • Each filter (low and high pass) has only two taps • Haar does not provide a sharp separation between high and low frequencies • More complex wavelets provide generally better quality, at the cost of increased computational complexity

Standard Memory Requirements • The most basic implemenation of the transform requires that all rows be transformed by the wavelet filters before the columns • This approach requires an intermediate storage area, most likely SRAM or SDRAM on a hardware implementation • These redundant memory access operations greatly reduce the performance of the overall implementation

L H LL HL LH HH Standard Memory Requirements Original Image Intermediate Image Transformed Image Intermediate Image Creates Redundant Memory Access…

Block-Based Approach • This approach processes rows and columns together, in a single operation, thus eliminating the need for the intermediate storage area • All memory writes to this intermediate area are eliminated • All memory reads from this intermediate area are eliminated • Peformance is increased considerably

Block-Based Processing (1) Standard transform operations can be algebraically simplified…

Block-Based Processing (2) LL HL LH HH …to produce four fully transformed coefficients.

Other Wavelets? • The Integer Haar wavelet is easy to reduce algebraically • Only 4 pixels need to be read into the processor, managed, and transformed • SFC’s actual SAR compression solution uses the more complex 5/3 wavelet transform • 5/3 was found experimentally to preserve more quality when used to compress SAR images • This transform requires 5 pixels per row/column operation, and will require 25 pixels to be fetched, managed, and transformed • Reducing the 5/3 to a “block processing” approach incurrs significant overhead for tracking the current location within the image • It is more feasible to seek a pipelined solution than to apply the “block processing” approach • Pipelining will reduce the inefficiencies introduced when managing intermediate transform data

Prefetch/Pipeline Approach • Another approach to improving performance is the pipelining of row and column transform operations • For a wavelet transform, the column transform operations can begin as soon as a minimum number of rows have been transformed • For the 5/3 operation, the column transform operations can begin once the first three rows have been transformed • Intermediate memory transfers are greatly reduced, although not eliminated • This approach will be dependent upon the specific processor and memory configuration being used for implementation

Architecture Our board, the Nallatech BenNUEY-PCI-4E, provides opportunities for parallel processing and pipelining Nallatech BenDATA-DD Module Xilinx Virtex-II (2V3000) 1 GB SDRAM Nallatech BenNUEY-PCI-4E Xilinx Virtex-II Pro (2VP50) 4 MB ZBT SRAM Ethernet Connectivity +

Design Challenges • Image data will be stored in the 1 GB SDRAM • Original data can occupy up to half the total space • Transformed data will occupy the other • Memory is addressable as 32-bit words • Each memory read/write will involve four “packed” 8-bit pixel values • Row Challenge: each transform requires three pixels, but they can only be read in groups of four across a single row • Column Challenge: each transform requires three pixels, but they are packed by rows, not by columns

Row Solution • Prefetching/pipelining uses two 32-bit registers, WordA and WordB, which allow transform data to be prefetched so that the “pipe” is always full • Prefetching/pipelining enables efficient utilization of available resources • Prefetching/pipelining produces a deterministic, repeating pattern after only four operations, as shown on the following slides…

= value active in current operation Row Operations (1) 1st Row Op 2nd Row Op Data_In (32 bits) p0 p1 p2 p3 Data_In (32 bits) Data_In (32 bits) 1. Fetch 2 words 1. Don’t fetch p4 p5 p6 p7 p4 p5 p6 p7 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 2. Load Word A 2. Preserve Word A p0 p1 p2 p3 p0 p1 p2 p3 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 3. Load Word B 3. Preserve Word B p4 p5 p6 p7 p4 p5 p6 p7 • 4. Perform 1st transform • 4. Perform 2nd transform 3rd Row Op 4th Row Op Data_In (32 bits) Data_In (32 bits) 1. Don’t fetch 1. Fetch next word p4 p5 p6 p7 p8 p9 p10 p11 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 2. Preserve Word A 2. Load Word A p0 p1 p2 p3 p8 p9 p10 p11 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 3. Preserve Word B 3. Preserve Word B p4 p5 p6 p7 p4 p5 p6 p7 • 4. Perform 3rd transform • 4. Perform 4th transform

= value active in current operation Row Operations (2) 5th Row Op 6th Row Op Data_In (32 bits) Data_In (32 bits) 1. Don’t fetch 1. Fetch next word p8 p9 p10 p11 p12 p13 p14 p15 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 2. Preserve Word A 2. Preserve Word A p8 p9 p10 p11 p8 p9 p10 p11 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 3. Preserve Word B 3. Load Word B p4 p5 p6 p7 p12 p13 p14 p15 • 4. Perform 5th transform • 4. Perform 6th transform 7th Row Op 8th Row Op Data_In (32 bits) Data_In (32 bits) 1. Don’t Fetch 1. Fetch next word p12 p13 p14 p15 p16 p17 p18 p19 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 WordA_Pix1 WordA_Pix2 WordA_Pix3 WordA_Pix4 2. Preserve Word A 2. Load Word A p8 p9 p10 p11 p16 p17 p18 p19 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 WordB_Pix1 WordB_Pix2 WordB_Pix3 WordB_Pix4 3. Preserve Word B 3. Preserve Word B p12 p13 p14 p15 p12 p13 p14 p15 Etc… • 4. Perform 7th transform • 4. Perform 8th transform

Column Solution • Since four pixels must be read from across four columns, an efficient solution is to process four columns in parallel • Rather than transforming one column completely, we will transform four columns partially • For efficiency, column processing will begin as soon as three rows have been fully transformed • Row processing will continue after column processing has begun!

Column Operations d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 Operation 1: Process first coefficients for columns 0 - 3 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 Operation 2: Process first coefficients for columns 4 - 7 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 Operation 3: Process first coefficients for columns 8 - 11 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 Operation 4: Process first coefficients for columns 12 - 15 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1 d1 d1 d1 d1 d1 d1 d1 d1 r1 r1 r1 r1 r1 r1 r1 r1

Ideal Implementation • An ideal implementation would use only the resources (registers and memory) available internally on the FPGA • Eliminates slow interfaces between chips and external memory • Provides great flexibility in how memory is managed • Impractical in today’s FPGA devices • Internal resources are too limited in single devices • Sufficient resources across multiple devices are prohibitively expensive

Alternate Implementation 1 • One alternate implementation would use only the FPGA and memory available on the BenDATA-DD module • Simplifies the interface between the FPGA and the memory, since the FPGA and SDRAM are located in close physical proximity and without intermediate devices • A complete implementation of wavelet compression may not fit into the single FPGA on the BenDATA-DD module

Alternate 1, Level 1 WPT Pass 1a WPT Pass 1b

Alternate Implementation 2 • A second alternate implementation would distribute processing between multiple FPGAs (on the motherboard and the module) and between the SRAM (motherboard) and SDRAM (module) • Allows larger design to be distributed among multiple devices • Increases opportunities for parallel processing • Increases design complexity and decreases performance • Data must be shared between the two FPGAs via some transport mechanism (such as a FIFO)

Wavelet Transform Level 1 WPT Pass 1a WPT Pass 1b Row-transformed coefficients (source for column transform)

Wavelet Transform Level 2 WPT Pass 2a WPT Pass 2b

Wavelet Transform Level 3 WPT Pass 3a WPT Pass 3b

Conclusions/Suggestions • A prefetch/pipelined implementation of the 5/3 wavelet transform effectively removes the need for redundant access to intermediate data • This implementation can be extended to other wavelet transforms (more or less complex than the 5/3) • Final implementation of prefetching/pipelining will depend upon the architecture being used, and details such as memory bus width

Wavelet “Block-Processing” for Reduced Memory Transfers