530 likes | 665 Views
Introduction to H.264/SVC Differences, Possibilities, and Limits. Yao-Chung Lin Image, Video, and Multimedia Systems Group Information Systems Laboratory May 10 2006. Scalable Video Coding. A research topic over 20 years Single bitstream serves diversified clients
E N D
Introduction to H.264/SVC Differences, Possibilities, and Limits Yao-Chung Lin Image, Video, and Multimedia Systems Group Information Systems Laboratory May 10 2006
Scalable Video Coding • A research topic over 20 years • Single bitstream serves diversified clients • Display resolutions (QCIF, CIF, …, HDTV) • Frame rates (15Hz, 30Hz, …) • Bit rates/Qualities • Developing Standard • October 2003, MPEG Call for Proposal • March 2004, 14 proposals submitted and evaluated • 12 proposals are wavelet-based • 2 proposals are extension of H.264/AVC • October 2004, MPEG selected HHI proposal as starting point for H.264/ MPEG-4 AVC Amd.1 • ~2007, final draft will be released
Current Draft • Based on H.264 main profile • MCTF/Hierarchical B-picture (MCTF w/o update step) for temporal scalability • Layered pyramid prediction structure for spatial scalability • Layered, sub bit-plane, and (run,level) coding for SNR scalability
Overall Architecture of SVC A two layer example
Outline • Introduction • Scalabilities • Temporal Scalability • Spatial Scalability • SNR (Quality) Scalability • Other Details • Simulation Results • Conclusion & Discussion
Temporal Scalability • Group of picture (GOP) • Concepts of motion compensated temporal filtering (MCTF) • Hierarchical B-picture
Group of Picture • Instantaneous decoding refreshment (IDR) pictures • Intra coded picture • Also a key picture • A GOP with only one picture • Provide random access ability • Key pictures • The last picture in a GOP • Intra coded • Inter coded by previous key picture • Provide lowest temporal resolution • Non-key picture • Hierarchically predicted B pictures • High pass signal of MCTF • Provide various temporal resolutions • Note: Reference frame number can not be greater than 16
Group of Pictures An example of a group of picture Dyadic, 4 temporal levels [ITU 2006 January, R202]
Concepts of MCTF • Based on lifting scheme • Insures perfect reconstruction • Even if non-linear operations are used • Open loop • Non-recursive temporal decomposition • Prevent drift error • Improves efficient scalable coding, especially with FGS
Lifting Scheme r: reference index m: motion vector Similar to P-picture Similar to B-picture [ITU, 2006 January, R202]
Motion Modes • Variable block-size inter modes from 16x16 to 4x4 • Intra modes: 16x16, 8x8, 4x4 • Direct mode: 16x16, 8x8
Decomposition Structure [HHI Webpage: Scalable Extension of H.264/AVC]
Decomposition Structure • A dyadic decomposition structure for 2N-1 frames delay, where N = temporal decomposition level • Update steps do not cross the GOP border [HHI Webpage: Scalable Extension of H.264/AVC]
Low Delay Support [ITU, 2006 January, R202]
Removal of update step • Introduce high complexity to decoder • Derivation of the motion information for update step • Smaller block sizes • 9-bit residual motion compensation • Provide insignificant coding efficiency than that of closed-loop coding with hierarchical B picture (HB) • Rate-distortion performance of closed-loop coding with HB is higher or similar to that of MCTF-based coding for all test sequences • Except ‘City’ sequence which has 0.5 dB gain • After temporal pre-filtering the sequence, the MCTF gain becomes insignificant [ITU, 2005 July, P059]
Two Closed Loops FGS Layer [ITU, 2005 July, P059]
Spatial Scalability • Layered pyramid prediction structure • Inter-layer intra texture prediction • Inter-layer motion prediction • Inter-layer residual prediction • Extended Spatial Scalability • Cropping • Generic upsampling (non-dyadic spatial resampling)
Layered Pyramid Prediction Structure • Same concepts used in H.262/MPEG-2, H.263, MPEG-4 with additional inter-layer prediction • Each spatial resolution is coded as a new layer with texture and motion refinement • Same mechanism for coarse grain SNR scalability (Spatial downsampling ratio=1)
Inheritance of modes Previous Spatial Layer Current Layer For spatial scaling ratio = 2
Inter-layer Intra Texture Prediction • Unrestricted inter-layer intra texture prediction • Decode and predict from all lower layer in the bitstream • Not supported in the standard • Constrained inter-layer intra texture prediction • For MBs in non-key pictures • The co-located block in the previous layer are intra coded • Not supported in the standard • Constrained inter-layer intra texture prediction for single-loop decoding • For MBs in all pictures (including key pictures) • The co-located block in the previous layer are intra coded • Allow decoding (motion compensation) only current layer • Supported by the current SVC draft
Generation of Inter-layer Texture Prediction • Directly de-block filtering • 4-sample border extension • Interpolation • 2x: Half-pel interpolation filter of AVC • Otherwise: quarter-pel interpolation filter [Schwarz, ICIP 2005]
Inter-layer Motion Prediction • Intra base layer • If previous layer is inter, use scaled partitioning and motion vectors of base layer • If previous layer is intra, predict from previous layer • Quarter pel refinement • Only for reduced spatial resolution • Refine the scaled motion vector of previous layer by +1, 0, and -1 in quarter-sample precision • Send the refinement • None • Motion vector prediction from neighbor blocks • Motion vector prediction from previous layer
Inter-layer Residual Prediction • Predict the residual from previous layer residual • Upsample the residual • 2x: separable bi-linear filter [1,1]/2 • Otherwise: quarter-pel interpolation • Helpful while the motion information is unchanged or slightly changed from previous layer
SNR Scalability • Coarse grain scalability (CGS) • Layered coding • The same mechanism as spatial scalability • Re-quantize the coefficients with finer step • Fine grain scalability • Sub-bitplane arithmetic coding • Re-quantize the coefficients with finer step • Provide a continuous refinement from a quality base layer
Coarse Grain Scalability • Same mechanism as spatial scalability • Except no upsampling • Provide discrete quality refinement • Close to single layer RD performance, if dQP > 6
Fine Grain SNR Scalability • Represent the residual between the original prediction error and base layer representation • Quantized to a bisection step size (dQP~6) • Coded in transform domain for single inverse transform at decoder • Adaptive references for FGS (AR-FGS) provide leaky prediction attenuating drift error
Illustration of AR-FGS Zero Coef. Block [ITU, 2006 Jan. R202]
Outline • Introduction • Scalabilities • Temporal Scalability • Spatial Scalability • SNR (Quality) Scalability • Other details • Simulation Results • Discussion
Other Details • Fidelity resolution extension (FRExt) • Support 8x8 Transform (High Profile) • Increase coding efficiency especially for high-resolution source • Motion search block segment size down to 8x8 only • Weighted prediction • Scale the reference pictures for prediction • Find the weights at encoder • Explicitly send in syntax • Implicitly derive from temporal distance (an option for B-picture)
Other Details • FGS motion • Progressive refinement slice (FGS slice) contains motion data • Provide better prediction • Adaptive GOP Structure (AGS) • Divide a GOP into several sub GOPs by appropriate mode decision • Decreasing the distance between two low-pass pictures • 0.62 dB gain • Detail in [ITU O018] • Loss Aware rate distortion optimization • The mode/parameter decision consider the packet loss • Detail in [ITU P057]
JSVM • Written in C++ • Accessing from CVS • Current version: 5.2 • Last Update: May 2, 2006
Simulation Results • Temporal Scalability • GOP sizes [ITU, 2005 July, P014] • Open loop MCTF vs. closed loop HB [ITU, 2005 July, P059] • Spatial • Given the same base layer • Exam the inter-layer prediction • SNR • CGS, DQP = 2 or 6 • FGS • Key pictures predict from base representation • FGS motion optimized at 1/3 bit rate • Open loop MCTF helpful ? [ITU, P059]
Summary of Temporal Scalability Features • Hierarchical B pictures • B pictures gives 0.5~1 dB (IPP -> IBBPBBP) • Hierarchical prediction gives additional 0.5 ~ 1 dB • MCTF • Only ‘CITY’ has 0.5 dB gain compared to closed-loop HB • The gain is diminished by encoder MCTF pre-filtering • Improvement comes from hierarchical prediction structure
Simulation Results • Temporal Scalability • GOP sizes [ITU, 2005 July, P014] • Open loop MCTF vs. closed loop HB [ITU, 2005 July, P059] • Spatial • Given the same base layer, exam the inter-layer prediction • Multiple-loop decoding vs. single-loop decoding (constrained inter-layer prediction) [ITU, O074] • SNR • CGS, DQP = 2 or 6 • FGS • Key pictures predict from base representation • FGS motion optimized at 1/3 bit rate
Spatial Scalability [Schwarz, Marpe, and Wiegand, IWSSIP 05]
Spatial Scalability • [Schwarz, Marpe, and Wiegand, IWSSIP 05]
Constrained Inter-Layer Prediction CIF@30 CIF@15 QCIF@15 QCIF@7.5 CIF@15 Foreman, Munich test points
Constrained Inter-Layer Prediction 4CIF@60 CIF@30 QCIF@15 4CIF@30 CIF@30 QCIF@15 Crew, Munich test points
Summary of Inter-layer prediction tools • Inter-layer predictions bring ~2dB gain • Intra prediction ~1dB • Motion prediction 0.5~1dB • Residual prediction ~0.5dB • Constrained inter-layer intra prediction for single layer decoding • Provide low complexity decoding • Pay < 0.5 dB loss
Simulation Results • Temporal Scalability • GOP sizes [ITU, 2005 July, P014] • Open loop MCTF vs. closed loop HB [ITU, 2005 July, P059] • Spatial • Given the same base layer, exam the inter-layer prediction • Multiple-loop decoding vs. single-loop decoding (constrained inter-layer prediction) [ITU, O074] • SNR • CGS, DQP = 2 or 6 • FGS • Key pictures predict from base representation • FGS motion optimized at 1/3 bit rate • Open loop MCTF helpful ? [ITU, P059]
SNR Scalability [Schwarz, Marpe, and Wiegand, IWSSIP 05]
SNR Scalability [Schwarz, Marpe, and Wiegand, IWSSIP 05]