170 likes | 249 Views
This research paper presents optimized DCT/IDCT implementations on Verilog HDL for VLSI array structures in DSP applications. It compares performance metrics for different input methods and details the synthesis results on an ALTERA Cyclone IV GX FPGA using Quartus II tools. The analysis includes algorithm, implementations, performance benchmarks, results, conclusions, and references to related works.
E N D
Hardware Optimized DCT-IDCT Implementation on Verilog HDL RAHUL SRIKUMARECE734:VLSI ARRAY STRUCTURES FOR DSP 05/10/13
Contents • Algorithm • Implementations • Performance • Results • Conclusion • Future Work
Algorithm • 8 point DCT • 2D DCT = C*X*Transpose(C) • C – coefficient matrix
Algorithm(Cont’d) • 1D DCT = C*X • 2D DCT = Transpose(1D DCT)* C • 1D IDCT = Transpose(C) * 2D DCT • 2D IDCT =Transpose(1D IDCT) * Transpose(C)
Implementations Part 1 Input word length – 8 bits 1D DCT internal word length – 11 bits 2D DCT output word length – 9 bits 2D IDCT output word length – 8 bits 4 implementations were evaluated Serial In (SI) – 1 pixel at a time 2 Parallel In (2PI) – 2 pixels at a time 4 Parallel In (4PI) – 4 pixels at a time 8 Parallel In (8PI) – 8 pixels at a time
Implementations Part 2 • 8 registers of 8 bits each for coefficient storage. • very efficient when compared to 64 registers required for • 8*8 DCT/IDCT computation. • 2 RAMS each of 64 locations(8 bit wide) are used. • RAMS are enabled in the order • en_ram1_write->(en_ram1_read, en_ram2_write) • ->en_ram2_read
Performance 1 • Serial In (1 pixel at a time) • Read 8 inputs = 8 cycles • Register 8 inputs + sign extension = 1 cycle • Add/Sub = 1 cycle • Absolute value = 1 cycle • Multiplication = 1 cycle • Final addition = 2 cycles • Total = 14 cycles
Performance 2 • 2 Parallel In (2 pixel at a time) • Register 8 inputs + sign extension = 4 cycle • Add/Sub = 1 cycle • Absolute value = 1 cycle • Multiplication = 1 cycle • Final addition = 2 cycles • Total = 9 cycles
Performance 3 • 4 Parallel In (4 pixel at a time) • Register 8 inputs + sign extension = 2 cycle • Add/Sub = 1 cycle • Absolute value = 1 cycle • Multiplication = 1 cycle • Final addition = 2 cycles • Total = 7 cycles
Performance 4 • 8 Parallel In (8 pixel at a time) • Register 8 inputs + sign extension = 1 cycle • Add/Sub = 1 cycle • Absolute value = 1 cycle • Multiplication = 1 cycle • Final addition = 2 cycles • Total = 6 cycles
Synthesis • Target Platform : ALTERA Cyclone IV GX FPGA • Tool Used : Quartus II • Language Used : Verilog
Results 1 • Serial In has lowest synthesized combinational • area because of lowest number of wires needed to • feed in the data.
Results 2 • Serial In has lowest synthesized area due to least • number of storage elements and counters required • to process the data.
Results 3 • 8 parallel In takes 236 cycles in contrast to 246 for • serial in.
Conclusion • Serial In occupies ~6% less area than 8 parallel In with a • performance degradation that is comparatively • lower(~4%).
References • A Fast Hybrid Dct Architecture Supporting H.264, Vc-1, • Mpeg-2, Avs And Jpeg Codecs by Muhammad Martuza, Carl McCrosky and Khan Wahid at • 11TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCES, SIGNAL PROCESSING • AND ITS APPLICATIONS. • An Area Efficient Dct Architecture For Mpeg-2 Video Encoder by KyeounsooKim • and Jong-SeogKohin IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, VOL. 45, NO. 1, • FEBRUARY 1999. • Architecture Design of Shape-Adaptive Discrete Cosine Transform and Its Inverse for MPEG-4 • Video Coding byHui-Cheng Hsu et. Al inIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS • FOR VIDEO TECHNOLOGY, VOL. 18, NO. 3, MARCH 2008. • Integer DCT Based on Direct-Lifting of DCT-IDCT for Lossless-to-Lossy Image Coding by Taizo • Suzuki, Student Member, IEEE, and Masaaki Ikehara, Senior Member, IEEE in IEEE • TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 11, NOVEMBER 2010.