1 / 23

Context-Aware Fast 3D DCT/IDCT Algorithm for Low-power Video Codec in Mobile Embedded Systems

STREAMING DAY 2010 UDINE. Context-Aware Fast 3D DCT/IDCT Algorithm for Low-power Video Codec in Mobile Embedded Systems. Sergio Saponara , Luca Fanucci University of Pisa, Italy Contact: sergio.saponara@iet.unipi.it. Outline. Application of Multidimensional DCT in video coding

marlon
Download Presentation

Context-Aware Fast 3D DCT/IDCT Algorithm for Low-power Video Codec in Mobile Embedded Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STREAMING DAY 2010 UDINE Context-Aware Fast 3D DCT/IDCT Algorithm for Low-power Video Codec in Mobile Embedded Systems Sergio Saponara, Luca Fanucci University of Pisa, Italy Contact: sergio.saponara@iet.unipi.it STDAY2010, Udine, Sept. 2010

  2. Outline • Application of Multidimensional DCT in video coding • Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results • VLSI Architectures for 3D DCT • CMOS implementation results • Conclusion STDAY2010, Udine, Sept. 2010

  3. Application of Multidimensional DCT in video coding • Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results • VLSI Architectures for 3D DCT • CMOS implementation results • Conclusion STDAY2010, Udine, Sept. 2010

  4. 2D DCT for video coding 2D DCT allows for the reduction of spatial data redundancy - Conventional algorithm adopted in H.26X (by ITU-T) and MPEGX (by ISO/IEC) video CoDec (encoder/decoder) - 2D DCT is applied to image blocks of NxN pixels (usually N=8) Core of motion-compensated H.26x Encoder STDAY2010, Udine, Sept. 2010

  5. 3D DCT for video coding (1/2) 3D DCT extends the spatial compression properties to time With respect to H.26x/MPEGx CoDecs 3D DCT offers: - Lower cost: 3D DCT (spatio-temporal compression) instead of 2D DCT (for spatial compression) plus motion estimation (for temporal compression) - Symmetric complexity of decoder and encoder much lower than motion-compensated H.26x/MPEGx encoders - Optimal solution for applications requiring real-time coding/decoding in the same terminal: interactive TV and web services, video telephony, video conferencing, face recognition - Same coding efficiency for slow motion videos or small/medium image formats; higher error-resilience STDAY2010, Udine, Sept. 2010

  6. 3D DCT for video coding (2/2) 3D DCT is applied to cube of NxNxN pixels (usually N=8) As in H.26X/MPEGx each frame of a video is divided in blocks 1 Cube NxNxN = image blocks of NxN pixels belonging to N consecutive frames STDAY2010, Udine, Sept. 2010

  7. Application of Multidimensional DCT in video coding • Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results • VLSI Architectures for 3D DCT • CMOS implementation results • Conclusion STDAY2010, Udine, Sept. 2010

  8. 3D-DCT radix-factorization (1/2) • Equation of a N3-point 3D DCT • A direct implementation of the equation requires N3 multiplications and additions (MAC) • The N3-point 3D DCT is implemented by 3 N-point 1D DCT plus proper transposition matrixes • Complexity of 3NMAC • Memory cost: T1 of N2 words plus T2 of N3 words STDAY2010, Udine, Sept. 2010

  9. Blocks 0,..,N-1 1D DCT T1 1D DCT T2 1D DCT 3D-DCT radix-factorization (2/2) • Each N-point 1D-DCT is factorized in simpler radix-2 butterflies STDAY2010, Udine, Sept. 2010

  10. 3D-DCT/IDCT data correlation Switching bits between consecutive input samples With MissAmerica, Akiyo, Foreman, Coastguard up to 60-70% of the rows are null in IDCT mode Distribution of the amplitude of AC coefficients for Foreman vs. the coefficient number (1 to 512 in the 8x8x8 cube) STDAY2010, Udine, Sept. 2010

  11. Context-aware 3D-DCT • Insert before a 1D stage a pre-processor that for each row Xi of N samples: • analyzes the statistics of the DCT/IDCT input samples in each computing stage • based on heuristic rules decides if the DCT/IDCT computation can be avoided • If A = 0 and SAD = 0 or If A ≠ 0 and SAD<TH1 the transform result is forced to zero • In these cases the transform result is estimated to have a small residual energy and most likely would be cancelled by the quantizer STDAY2010, Udine, Sept. 2010

  12. % Computation saving Context-aware vs. classic 3D DCT/IDCT STDAY2010, Udine, Sept. 2010

  13. Rate-distortion performance of context-aware fast 3D DCT Rate-distortion curve for Akiyo PSNR variation at fixed bit-rate: context-aware vs. classic 3D DCT STDAY2010, Udine, Sept. 2010

  14. Application of Multidimensional DCT in video coding • Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results • VLSI Architectures for 3D DCT • CMOS implementation results • Conclusion STDAY2010, Udine, Sept. 2010

  15. Why VLSI HW design? A SW optimized design of a 3D DCT/IDCT reaches real-time time VGA 24 Hz on Intel Core 2 6300@1.86 GHz [T. Fryza et al.] The Core 2 6300 processor, in 65 nm CMOS, integrates two cores and up to 4 MB of L2 cache. The die size is 143 mm2 for 290 M transistors; at 1.86 GHz the power consumption is up to 65 W For battery-powered terminals VLSI HW design is needed STDAY2010, Udine, Sept. 2010

  16. Distribuited ArithmeticRAC (ROM+Accumulator) instead of a (Multiplier + Accumulator) 1D-DCT circuit engine STDAY2010, Udine, Sept. 2010

  17. Blocks 0,..,N-1 T1 T2 1 D DCT 1 D DCT 1 D DCT T2 Blocks 0, .., N-1 1 D DCT/IDCT 3D-DCT architectures: schemes 3D architectures with different degrees of parallelism and power vs. area trade-offs FULL PARALLEL (PA) CASCADE (CS) ITERATIVE (IT) STDAY2010, Udine, Sept. 2010

  18. 3D-DCT architectures: performance and complexity STDAY2010, Udine, Sept. 2010

  19. Application of Multidimensional DCT in video coding • Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results • VLSI Architectures for 3D DCT • CMOS implementation results • Conclusion STDAY2010, Udine, Sept. 2010

  20. CMOS implementation results0.18 m, 1.6 V, 6 metal levels standard-cell QCIF 4 CIF 16 CIF CIF 250 200 FULL PARALLEL CASCADE 150 ITERATIVE 100 50 0 0 15 30 45 60 75 Circuit complexity (Kgates) vs. Power consumption (mW) Dotted lines refer to the elaboration of the same video formats STDAY2010, Udine, Sept. 2010

  21. Power consumption with context-aware saving Power consumption of the CS architecture Power consumption of the PA architecture STDAY2010, Udine, Sept. 2010

  22. Conclusions 3D-DCT/IDCT is a promising solution for real-time, low-power, low-complexity, Implementation of video encoders and decoders in battery powered terminals STDAY2010, Udine, Sept. 2010

  23. Thanks for your attention!!! STDAY2010, Udine, Sept. 2010

More Related