Implementation of a High Rate Modular JPEG2000 Encoder in a Virtex2 FPGA

Implementation of a High Rate Modular JPEG2000 Encoder in a Virtex2 FPGA Presented by Damon Van Buren SEAKR Engineering MAPLD 2003 Paper P72

Introduction • SEAKR Engineering provides data storage and on-board processing solutions for satellites and spacecraft. • We strive to provide key technologies and capabilities to our customers. • Commercial imaging satellites are experiencing rapid growth in imaging capacity, leading to higher data rates. • Many upcoming systems produce image data at several Gbits per second. • This is driving an increase in on-board data storage capacity and downlink bandwidth. • Compression is an excellent solution: • Compression improves operational efficiency, reducing overall system cost for the same imaging capability. • Roughly a 2 to 1 reduction in image size for lossless compression. • Collection time, storage capacity, and downlink capability are all effectively doubled by lossless compression. • Compression of images on-board the satellite has unique challenges.

Desired Compression Features • Commercial Satellite Imaging has unique requirements which vary significantly from other applications. • High Data Rate • Compression must often be performed in real time, prior to storage. • Excellent Compression Performance • Purchasers of satellite imagery are sensitive to reductions in image quality caused by lossy compression. • Scientific users prefer undistorted data (bit true). • Flexible Compressed Data Format • Allows system operators to make the best use of limited downlink capacity. • Space-Qualified • Must survive hazards of space, including radiation. • Low Risk • Errors are difficult to fix after launch. • Low Cost • Commercial customers require cost effective solutions.

JPEG2000 Features • The JPEG2000 standard meets the requirements for excellent compression performance in lossy or lossless modes, while providing a flexible compressed data format. • International Standard • Wavelet based • High quality lossy images with comp. ratios > 100:1 • Flexible Compression • Many encoding options. • Packet oriented • Allows random access to the code stream. • Makes compressed data more robust in the presence of bit errors. • Compressed data can be accessed at random, depending on user requirements. • Allows selection of image quality, spatial region, resolution, and color component after compression.

JPEG2000 Compression Examples Original Lossless 3.6:1

Compression Examples (continued) Original 50:1

JPEG2000 Coding Steps • Image is broken into tiles • Tiles are wavelet transformed • 5/3 reversible or 9/7 irreversible, also user defined. • Selectable number of transform levels. • Each subband from the transform is further broken up into code blocks (typically 32x32 or 64x64) for entropy coding. • Each code block is entropy coded, starting from the top bit plane and working down. • The current bit of each pixel is passed to an arithmetic coder, along with context information. • The MQ encoder takes advantage of any skewing of the probability for each context, and adapts contexts as the coding progresses. • Packets are formed by combining the entropy coder outputs from a single resolution.

Tiling • Breaks up image into smaller regions for coding. • Allows user to select and decode a specific region of the image. • Tiles are further divided into code blocks. • For each tile, packets are formed which include information for a single resolution, layer, and color. • Packets may be selectively decoded to uncompress portions of a tile, without decoding the entire image. Tiles T0 T1 T2 T3 T4 Image T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 T24

Wavelet Transform • Separates image into high and low frequency components in the horizontal and vertical directions. • Vertical and horizontal wavelet transforms are applied to each tile “n” times. • “n” transform levels produce “n+1” different resolutions. • Resolutions are coded in separate packets or tile parts, allowing the user to choose the resolution for the tile during decoding. Tile After 2 Transform Levels Resolution 0 Resolution 1 Resolution 2 Resolution 0 Resolution 1 Resolution 2

Entropy Coding • Following the wavelet transform, the wavelet transform coefficients are entropy coded. • Transform subbands are divided into smaller blocks called code blocks, each of which is entropy coded separately. • Entropy coding starts with the highest bit plane which is significant (non-zero) and works down. • Three passes are made for each bit plane: • Significance propagation • Magnitude refinement • Cleanup • For the first (highest) bit plane, only the cleanup pass is performed. • For each bit, a context is computed, which depends on the pass type, signs, and significance of its neighbors. • The value of each bit is arithmetic coded, using the context information.

JPEG2000 Code Stream • Code stream is divided into packets, allowing random access to contents of the compressed data stream. • Groups of packets form tile-parts. • Image data from a single tile, layer, resolution, and color component are encoded to form a packet. • Packets have a header giving basic packet info: • Zero length packet. • Code block inclusion. • Number of “insignificant” upper bit planes for each code block. • Number of coding passes (bit planes) for each block in this packet. • Length of data for each code block. • After the header, the data for each included code block is placed in the packet in raster-scan order.

JPEG2000 Decode Flexibility Choice of Resolution Same Compressed Data Stream JPEG2000 Decoder Choice of Component JPEG2000 Decoder JPEG2000 Encoder Choice of Region JPEG2000 Decoder Choice of Quality JPEG2000 Decoder

Flexible Downlink • The flexible data stream from JPEG2000 can be used to give operators better control of downlink usage. • Operator can select the tile parts to give the desired resolution, quality, spatial area, and color component prior to downlinking. • Downlink bandwidth is reduced to what is needed for the desired region(s) of the image, at the desired quality and resolution!

JPEG2000 Implementation Challenges • Extremely operation intensive • Each bit must be processed many times, for the wavelet transform, entropy coding, MQ coding, packet generation. • Complex algorithm • Some operations cannot be pipelined, because of feedback paths between functional blocks. • Behaviorally efficient (cycle efficient) implementation leads to very large combinatorial clouds between some synchronous registers. • Many parameters must be tracked individually for each code block. • Very memory intensive • Each pixel must be accessed many times, so many small buffers are needed to get good throughput. • Xilinx Virtex-II FPGAs are the only solution that meets all the requirements: • Huge amounts of memory and logic. • Space Qualified. • In-system re-programmable. • Fast.

Xilinx Virtex-II 6000 • Highest possible performance in a space-qualified programmable part. • Exceeds DSP processor performance by 1 to 3 orders of magnitude, depending on application. • 144 18kbit block RAMs. • 144 dedicated 18x18 bit multipliers. • 33,792 logic slices. • System clock rates up to several hundred MHz. • High-rate, flexible I/O. • Reconfigurable • Configuration time is ~50 ms.

JPEG2000 Architecture Drivers • To achieve high data rates (600 Mbps), the processing must be paralleled as much as possible. • The “tall pole in the tent” is the arithmetic coding, because the coding of a single data bit with its context can take several clock cycles. • Significance propagation coding is also a challenge, because each coefficient must be accessed many times, as each bit plane is processed. • Other operations, such as wavelet transform, code block loading, and packet generation are much more efficient, and require fewer parallel paths. • A pipelined architecture with many entropy coders in parallel was used to achieve the required throughput.

Architecture Description • Processes 256x256 tiles. • Pipelined architecture, using separate external memories for image, tile, and compressed data storage. • 19 Entropy coders working in parallel to improve throughput, one for each code block. • Most code blocks are 64x64. • For a 3 level transform, the two lowest resolutions are 32x32. • FIFO buffering between the stages improves data flow efficiency. • A rolling wavelet transform is used to reduce memory accesses and improve efficiency. • Entropy coder outputs are formed into layers, giving each tile a progressive output format. • Performs lossy or lossless compression.

Architecture Block Diagram

Horizontal 5/3 Vertical 5/3 Rolling Wavelet Transform • Rolling wavelet transform performs vertical and horizontal 5/3 wavelet transforms in one step. • Prior to the first transform, the DC offset is subtracted. • Three levels of transform are performed. • Vertical transform is performed serially on pixels, from top to bottom of each column. • Several columns of vertical coefficients are buffered by FIFOs. • Horizontal transform is performed using five previous columns. • Output coefficients are buffered in external memory.

Rolling Wavelet Block Diagram

Code Block Loader • The code block loader reads the coefficients for each code block from the external tile buffer and loads the code block buffers of the entropy coders. • The loader formats the data into the correct word sizes, and converts the coefficients into sign & magnitude from twos compliment. • The loader is pipelined with the wavelet transform and the entropy coders, so that it loads code blocks as soon as the source coefficients and the destination memory are ready.

Entropy Coder • The entropy coder performs the bulk of the work in doing the compression. • Because the entropy coding is the most cycle-intense part of the compression, efficient use of each clock cycle is a must! • There are two stages: the context formation stage and the arithmetic coder stage. • FIFO buffers are used to facilitate data flow between blocks. • Maximum input pixel size of 13 bits is determined by the Xilinx V2 block RAM data width of 18 bits. • 12 data bits (after DC sub.) + 4 bits for growth during WT + 1 guard bit + 1 sign bit = 18 bits. • Currently the entropy coder requires ~1.8 cycles to code each bit of the incoming code block.

Context Formation • Context formation uses the adjacent coefficients to create a context for the current coefficient. • For this design, it was important to make the throughput as high as possible, so several buffers were used. • Significance information for the code block was stored in a separate buffer. • Significance and sign information for the previous row were stored in a small FIFO, eliminating the need to access the previous row again. • Sign and data bits, along with their contexts, fit into a single 9-bit word. • The output context and data information were FIFO buffered as well. • The context formation function requires 1.5 cycles to execute all three pass types for each bit.

Context Formation Block Diagram

MQ Encoder • The MQ encoder performs arithmetic coding, using the contexts supplied. • For this design, a cycle-efficient approach was chosen, rather than a faster, multi-cycle approach. • Fewer clock cycles to encode each bit, on average. • Many logic levels (~15), resulting in slower clock frequencies (~55 MHz). • The context parser separates the data bit and its context from the sign bit and its context, and supplies them to the MQ coding function. • Control values are also read from the context FIFO, causing the MQ coder to terminate the code stream, or perform other functions. • The index for all 18 contexts is tracked and adjusted. • Qe values are stored in a single lookup table.

MQ Encoder Block Diagram

Packet Generators • Packets are formed from the code blocks of a single resolution. • Wavelet transform produces n+1 resolutions, where n is the number of transform levels. • Four separate packet generators are used for this design. • Packet generators read the contributions from each entropy coder for each layer, and generate packet headers, followed by the data for each code block. • Output packets are FIFO buffered.

Packet Generator Block Diagram

Layer Formation, Tile Header and Main Header Generation • The layer formation function waits for a packet contribution from each resolution toward a given layer, and then outputs the packets in the correct order. • The main image header is generated at the command of the JPEG2000 controller. • The tile header is also generated at the command of the JPEG2000 controller. Tile length information is updated after the body of the tile is completed for lossless compression. • For lossy compression, not all packets are included. The last packet is truncated to give the appropriate data size.

Virtex-II Implementation Results • The current version of the JPEG2000 coder was targeted to the V2-6000 FPGA. • Simulation and Routing Results: • Block RAMS: 119 out of 144, 82% • Slices: 23545 out of 33792, 69% • Max system clock ~50 MHz with -4 speed grade • Data Rate: Lossless ~450 Mbps, Lossy >1Gbps. • Up to 13 bit data. • Initial results look promising: • Cycles per bit are in line with predicted performance. • System clock is nearly what is needed. • Layer formation works correctly. • JPEG2000 compliant output files are generated. • Some improvements in pipelining and system clock are needed to achieve 600 Mbps data rate for lossless compression.

JPEG2000 Floorplan

Future Improvements • Implement selective arithmetic coding bypass. • Greatly improves throughput with little reduction in compression efficiency (typically < 1% reduction in compression performance) • Bypasses the arithmetic coder starting in the fifth significant bit plane of each code block. • Encode tile data in tile parts. • Enables progressive decoding on the image scale, rather than the tile scale. • Necessary for flexible downlink capability. • Increase system clock frequency by hand routing critical sections.

Implementation of a High Rate Modular JPEG2000 Encoder in a Virtex2 FPGA

Implementation of a High Rate Modular JPEG2000 Encoder in a Virtex2 FPGA

Presentation Transcript

Implementation of High-Rate JPEG2000 Coding on a Virtex-2 Pro Reconfigurable Computing Board

AN ENCODER FOR A 5GS

FPGA Implementation of H.264 Video Encoder

Reed Solomon Encoder Implementation

Design and Validation of a UWB Transmitter for FPGA Implementation

Implementation of FSM int o FPGA

FPGA Implementation of Lookup Algorithms

FPGA Implementation of Multipliers

JPEG2000

An Efficient FPGA Implementation of IEEE 802.16e LDPC Encoder

A Compact and Efficient FPGA Implementation of DES Algorithm

A 242mW, 10mm2 H.264/AVC High Profile Encoder

FPGA Implementation of Lookup Algorithms

FPGA IMPLEMENTATION OF A GREEDY ALGORITHM FOR SET COVERING

Figure 6.1. A convolutional encoder

First Flight: Successful Use of a High Rate LDPC Code With High Data Rate in a Restricted Band

JPEG2000

HIGH-STRAIN-RATE BEHAVIOR OF POLYCRYSTALLINE a -IRON

Overview of OFDM for a High Rate Extension

A modular approach

Benefits of a modular cleanroom