slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Wavelet “Block-Processing” for Reduced Memory Transfers PowerPoint Presentation
Download Presentation
Wavelet “Block-Processing” for Reduced Memory Transfers

Loading in 2 Seconds...

play fullscreen
1 / 30

Wavelet “Block-Processing” for Reduced Memory Transfers - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Wavelet “Block-Processing” for Reduced Memory Transfers. MAPLD 2005 Conference Presentation. William Turri (wturri@systranfederal.com) Ken Simone (kcsim07@yahoo.com) Systran Federal Corp. 4027 Colonel Glenn Highway, Suite 210 Dayton, OH 45431-1672 937-429-9008 x104. Research Goals.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Wavelet “Block-Processing” for Reduced Memory Transfers' - dezso


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Wavelet “Block-Processing” for

Reduced Memory Transfers

MAPLD 2005 Conference Presentation

William Turri (wturri@systranfederal.com)

Ken Simone (kcsim07@yahoo.com)

Systran Federal Corp.

4027 Colonel Glenn Highway, Suite 210

Dayton, OH 45431-1672

937-429-9008 x104

research goals
Research Goals
  • Develop, test, and implement an efficient wavelet transform algorithm for fast hardware compression of SAR images
    • Algorithm should make optimal use of available memory, and minimize the number of memory access operations required to transform an image
wavelet transform background
Wavelet Transform Background

Original Image

Row Transformed

Image

Wavelet Transformed

Image

Low Frequency (Scaling)

Coefficients

High Frequency (Wavelet)

Coefficients

multiple resolution levels
Multiple Resolution Levels

Wavelet Transformed Images

(MR-Level = 1)

(MR-Level = 2)

(MR-Level = 3)

results of applying wavelet transform
Results of Applying Wavelet Transform

Wavelet Coefficients

MR-Level = 1

Scaling

Coefficients

Wavelet Coefficients

MR-Level = 3

Wavelet Coefficients

MR-Level = 2

preliminary investigation
Preliminary Investigation
  • Prior work has made use of the Integer Haar Wavelet Transform
    • This wavelet is computationally simple
    • Each filter (low and high pass) has only two taps
    • Haar does not provide a sharp separation between high and low frequencies
  • More complex wavelets provide generally better quality, at the cost of increased computational complexity
standard memory requirements
Standard Memory Requirements
  • The most basic implemenation of the transform requires that all rows be transformed by the wavelet filters before the columns
    • This approach requires an intermediate storage area, most likely SRAM or SDRAM on a hardware implementation
    • These redundant memory access operations greatly reduce the performance of the overall implementation
standard memory requirements1

L

H

LL

HL

LH

HH

Standard Memory Requirements

Original Image

Intermediate Image

Transformed Image

Intermediate Image Creates Redundant

Memory Access…

block based approach
Block-Based Approach
  • This approach processes rows and columns together, in a single operation, thus eliminating the need for the intermediate storage area
    • All memory writes to this intermediate area are eliminated
    • All memory reads from this intermediate area are eliminated
    • Peformance is increased considerably
block based processing 1
Block-Based Processing (1)

Standard transform operations can be

algebraically simplified…

block based processing 2
Block-Based Processing (2)

LL

HL

LH

HH

…to produce four fully transformed

coefficients.

other wavelets
Other Wavelets?
  • The Integer Haar wavelet is easy to reduce algebraically
    • Only 4 pixels need to be read into the processor, managed, and transformed
  • SFC’s actual SAR compression solution uses the more complex 5/3 wavelet transform
    • 5/3 was found experimentally to preserve more quality when used to compress SAR images
    • This transform requires 5 pixels per row/column operation, and will require 25 pixels to be fetched, managed, and transformed
    • Reducing the 5/3 to a “block processing” approach incurrs significant overhead for tracking the current location within the image
  • It is more feasible to seek a pipelined solution than to apply the “block processing” approach
    • Pipelining will reduce the inefficiencies introduced when managing intermediate transform data
prefetch pipeline approach
Prefetch/Pipeline Approach
  • Another approach to improving performance is the pipelining of row and column transform operations
    • For a wavelet transform, the column transform operations can begin as soon as a minimum number of rows have been transformed
    • For the 5/3 operation, the column transform operations can begin once the first three rows have been transformed
  • Intermediate memory transfers are greatly reduced, although not eliminated
  • This approach will be dependent upon the specific processor and memory configuration being used for implementation
architecture
Architecture

Our board, the Nallatech BenNUEY-PCI-4E, provides

opportunities for parallel processing and pipelining

Nallatech BenDATA-DD Module

Xilinx Virtex-II (2V3000)

1 GB SDRAM

Nallatech BenNUEY-PCI-4E

Xilinx Virtex-II Pro (2VP50)

4 MB ZBT SRAM

Ethernet Connectivity

+

design challenges
Design Challenges
  • Image data will be stored in the 1 GB SDRAM
    • Original data can occupy up to half the total space
    • Transformed data will occupy the other
  • Memory is addressable as 32-bit words
    • Each memory read/write will involve four “packed” 8-bit pixel values
      • Row Challenge: each transform requires three pixels, but they can only be read in groups of four across a single row
      • Column Challenge: each transform requires three pixels, but they are packed by rows, not by columns
row solution
Row Solution
  • Prefetching/pipelining uses two 32-bit registers, WordA and WordB, which allow transform data to be prefetched so that the “pipe” is always full
  • Prefetching/pipelining enables efficient utilization of available resources
  • Prefetching/pipelining produces a deterministic, repeating pattern after only four operations, as shown on the following slides…
row operations 1

= value active in

current operation

Row Operations (1)

1st Row Op

2nd Row Op

Data_In (32 bits)

p0

p1

p2

p3

Data_In (32 bits)

Data_In (32 bits)

1. Fetch 2 words

1. Don’t fetch

p4

p5

p6

p7

p4

p5

p6

p7

WordA_Pix1

WordA_Pix2

WordA_Pix3

WordA_Pix4

WordA_Pix1

WordA_Pix2

WordA_Pix3

WordA_Pix4

2. Load Word A

2. Preserve Word A

p0

p1

p2

p3

p0

p1

p2

p3

WordB_Pix1

WordB_Pix2

WordB_Pix3

WordB_Pix4

WordB_Pix1

WordB_Pix2

WordB_Pix3

WordB_Pix4

3. Load Word B

3. Preserve Word B

p4

p5

p6

p7

p4

p5

p6

p7

  • 4. Perform 1st transform
  • 4. Perform 2nd transform

3rd Row Op

4th Row Op

Data_In (32 bits)

Data_In (32 bits)

1. Don’t fetch

1. Fetch next word

p4

p5

p6

p7

p8

p9

p10

p11

WordA_Pix1

WordA_Pix2

WordA_Pix3

WordA_Pix4

WordA_Pix1

WordA_Pix2

WordA_Pix3

WordA_Pix4

2. Preserve Word A

2. Load Word A

p0

p1

p2

p3

p8

p9

p10

p11

WordB_Pix1

WordB_Pix2

WordB_Pix3

WordB_Pix4

WordB_Pix1

WordB_Pix2

WordB_Pix3

WordB_Pix4

3. Preserve Word B

3. Preserve Word B

p4

p5

p6

p7

p4

p5

p6

p7

  • 4. Perform 3rd transform
  • 4. Perform 4th transform
row operations 2

= value active in

current operation

Row Operations (2)

5th Row Op

6th Row Op

Data_In (32 bits)

Data_In (32 bits)

1. Don’t fetch

1. Fetch next word

p8

p9

p10

p11

p12

p13

p14

p15

WordA_Pix1

WordA_Pix2

WordA_Pix3

WordA_Pix4

WordA_Pix1

WordA_Pix2

WordA_Pix3

WordA_Pix4

2. Preserve Word A

2. Preserve Word A

p8

p9

p10

p11

p8

p9

p10

p11

WordB_Pix1

WordB_Pix2

WordB_Pix3

WordB_Pix4

WordB_Pix1

WordB_Pix2

WordB_Pix3

WordB_Pix4

3. Preserve Word B

3. Load Word B

p4

p5

p6

p7

p12

p13

p14

p15

  • 4. Perform 5th transform
  • 4. Perform 6th transform

7th Row Op

8th Row Op

Data_In (32 bits)

Data_In (32 bits)

1. Don’t Fetch

1. Fetch next word

p12

p13

p14

p15

p16

p17

p18

p19

WordA_Pix1

WordA_Pix2

WordA_Pix3

WordA_Pix4

WordA_Pix1

WordA_Pix2

WordA_Pix3

WordA_Pix4

2. Preserve Word A

2. Load Word A

p8

p9

p10

p11

p16

p17

p18

p19

WordB_Pix1

WordB_Pix2

WordB_Pix3

WordB_Pix4

WordB_Pix1

WordB_Pix2

WordB_Pix3

WordB_Pix4

3. Preserve Word B

3. Preserve Word B

p12

p13

p14

p15

p12

p13

p14

p15

Etc…

  • 4. Perform 7th transform
  • 4. Perform 8th transform
column solution
Column Solution
  • Since four pixels must be read from across four columns, an efficient solution is to process four columns in parallel
    • Rather than transforming one column completely, we will transform four columns partially
  • For efficiency, column processing will begin as soon as three rows have been fully transformed
    • Row processing will continue after column processing has begun!
column operations
Column Operations

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

Operation 1: Process first coefficients for

columns 0 - 3

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

Operation 2: Process first coefficients for

columns 4 - 7

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

Operation 3: Process first coefficients for

columns 8 - 11

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

Operation 4: Process first coefficients for

columns 12 - 15

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

d1

d1

d1

d1

d1

d1

d1

d1

r1

r1

r1

r1

r1

r1

r1

r1

ideal implementation
Ideal Implementation
  • An ideal implementation would use only the resources (registers and memory) available internally on the FPGA
    • Eliminates slow interfaces between chips and external memory
    • Provides great flexibility in how memory is managed
  • Impractical in today’s FPGA devices
    • Internal resources are too limited in single devices
    • Sufficient resources across multiple devices are prohibitively expensive
alternate implementation 1
Alternate Implementation 1
  • One alternate implementation would use only the FPGA and memory available on the BenDATA-DD module
    • Simplifies the interface between the FPGA and the memory, since the FPGA and SDRAM are located in close physical proximity and without intermediate devices
  • A complete implementation of wavelet compression may not fit into the single FPGA on the BenDATA-DD module
alternate 1 level 1
Alternate 1, Level 1

WPT Pass 1a

WPT Pass 1b

alternate 1 level 2
Alternate 1, Level 2

WPT Pass 2a

WPT Pass 2b

alternate 1 level 3
Alternate 1, Level 3

WPT Pass 3a

WPT Pass 3b

alternate implementation 2
Alternate Implementation 2
  • A second alternate implementation would distribute processing between multiple FPGAs (on the motherboard and the module) and between the SRAM (motherboard) and SDRAM (module)
    • Allows larger design to be distributed among multiple devices
    • Increases opportunities for parallel processing
  • Increases design complexity and decreases performance
    • Data must be shared between the two FPGAs via some transport mechanism (such as a FIFO)
wavelet transform level 1
Wavelet Transform Level 1

WPT Pass 1a

WPT Pass 1b

Row-transformed coefficients

(source for column transform)

wavelet transform level 2
Wavelet Transform Level 2

WPT Pass 2a

WPT Pass 2b

wavelet transform level 3
Wavelet Transform Level 3

WPT Pass 3a

WPT Pass 3b

conclusions suggestions
Conclusions/Suggestions
  • A prefetch/pipelined implementation of the 5/3 wavelet transform effectively removes the need for redundant access to intermediate data
  • This implementation can be extended to other wavelet transforms (more or less complex than the 5/3)
  • Final implementation of prefetching/pipelining will depend upon the architecture being used, and details such as memory bus width