T325: Technologies for digital media

T325: Technologies for digital media Second semester – 2011/2012Tutorial 6– Video and Audio Coding (2-2) Arab Open University – Spring 2012

The JPEG standards were derived for the coding of still pictures. • JPEG has been used to code series of individual fields in video sequences for transmission over digital links. • Moving pictures can be coded simply by using a separate JPEG image for each picture (Motion-JPEG) • Further compression can be introduced by exploiting features of the MOTIONitself. The coding of moving pictures: MPEG Arab Open University – Spring 2012

When pictures are produced at a rate of 25 per second, there cannot be much change between one picture and the next as a scene evolves in time, except, occasionally, when the camera or video editor cuts abruptly from one scene to another. • There is, therefore, a considerable amount of temporal redundancywhich is exploited in MPEGby, in effect, only transmitting the differences between one scene and the next. The coding of moving pictures – MPEG Arab Open University – Spring 2012

In most cases, as a scene evolves, many parts of the scene, such as the background, do not change. • Furthermore, if there is movement, then the objects that do move may not, themselves, change much. • What is more likely to change is their location in the picture • Thus, although movement may destroy correlation between corresponding locations in consecutive pictures, a high degree of correlation is likely to be maintained between successive locations occupied by a moving object. Motion compensation Arab Open University – Spring 2012

Motion compensation Arab Open University – Spring 2012

In order to take this form of correlation into account, MPEG coding involves estimating the motion of objects between pictures. • Different objects in a scene are likely to move by different amounts and in different directions  Therefore, Motion is estimated for relatively small areas of a picture. • The 8 x8 pixelblocks used for the spatial compression are rather too small for this purpose, so sets of four 8x8 blocks, called macroblocks are used. Motion compensation Arab Open University – Spring 2012

Motion is estimated by: • Taking a macroblock in one picture and, • comparing it with each of a group of macroblocks around the same region in a subsequent picture. • This is called block matching The block most like the reference block is taken to indicate the motion of that reference block. Motion compensation Arab Open University – Spring 2012

The motion of the block between the two pictures is shown as a displacement (Δx, Δy) • This is called a motion vector in MPEG processing • Evaluation of motion vectors is known as motion estimation. • In most cases, there is a gap of several picturesbetween the pair of pictures used for motion estimation saves on processing. • The motion within the intervening pictures is estimated by interpolation. Motion compensation Arab Open University – Spring 2012

There are three types of MPEG picture: • I (Intra) pictures • Coded independently of other pictures using the coding techniques described earlier. • Form the starting point of a series of predicted and interpolated pictures. • P (Predicted) pictures • Coded using predictions based on the previous I or P picture using motion compensation. • B (Bi-directional) pictures • obtained by interpolation between the I or P pictures that precede and follow them. Picture types and motion compensation Arab Open University – Spring 2012

Picture types and motion compensation Arab Open University – Spring 2012

Motion estimation between two pictures establishes pairs of matching blocks in the two pictures and the motion vectors for each of these blocks. • From this information it is possible to make an initial estimate of the second picture by using the motion vectors to relocate matching blocks. • The difference between this initial estimate and the actual pixel values for the second block can then be evaluated. • Transmission of the motion vectors for each block in the first picture enables the receiver to derive the initial estimate. • This is then corrected on reception of the differences between this estimate and the actual second picture values. Picture types and motion compensation Arab Open University – Spring 2012

Because the size of a moving object is often greater than that of the macroblocks, there is often a strong correlation between the motion vectors of contiguous macroblocksthe difference between motion vectors for neighboring macroblocks tends to be small. • So-called differential coding is used to send just these differences to keep the number of bits required comparatively low. • Motion estimation for P pictures is obtained from the preceding I or P picture and, because there may be several B pictures in between, the correction values to be transmitted can be quite large. • In order to compress this information, it is coded using the same steps as for spatial coding : DCT, requantization, zigzag scan, run-length and Huffman encoding. Picture types and motion compensation Arab Open University – Spring 2012

Arab Open University – Spring 2012

http://www.tomshardware.com/reviews/video-guide-part-3,130-6.htmlhttp://www.tomshardware.com/reviews/video-guide-part-3,130-6.html Arab Open University – Spring 2012

In most cases, changes affect relatively few macroblocks information is sent only when there are changes in a macroblock. • The degree of compression for P pictures is significantly higher than that for I pictures. • B pictures use interpolated values of the motion vectors obtained from the motion estimation for the following P picture. • Because B pictures do not require separate motion estimation, considerably less processing is needed. • The interpolation (of B pictures) is carried out in three ways: • Forward: by applying the interpolated motion vectors to the macroblocks in the previous I or P picture • backwards from the macroblocks of the following I or P picture • Both ways: taking the average between the two directions • The option giving the smallest error is retained Picture types and motion compensation Arab Open University – Spring 2012

The term group of pictures is used in MPEG for a sequence of pictures starting with an I picture and including all subsequent P and B pictures up to the next I picture. • The structure of a group is specified in terms of two parameters: N, the total number of pictures in the group, and M, the number of adjacent B pictures plus 1. Thus M= 3 and N = 12 for the group Picture types and motion compensation Arab Open University – Spring 2012

MPEG material may be recorded for editing or other purposes which require random access. • Access points must coincide with the start of a group of pictures because of the interpolation and prediction processes. • Long groups of pictures could lead to excessive distances between access points. • With M= 12 and a picture rate of 25 per second, the interval between successive groups of pictures is 12/25, just under half a second. More than this would be unacceptable in many cases. • The number of B pictures between two consecutive P pictures is limited by the accuracy of the interpolation process as well as the processing delays. • Groups of pictures with N = 12 and M= 3, are commonly used. Picture types and motion compensation Arab Open University – Spring 2012

The processing techniques described so far are all available in MPEG-1 systems. • MPEG-2 systems can carry out all of these, but they can also operate at the higher bit rates and higher resolutions required for broadcast television. • MPEG-2 has been described as a tool box which allows for the provision of a whole range of picture quality parameters. • The selection of tools available in any particular implementation depends on what trade-off has been made between complexity (and, hence, cost) and picture quality MPEG-2 Arab Open University – Spring 2012

The digitized video input is reordered into macroblocks for motion estimation. • The forward path, left to right across the figure, represents the spatial compression process. • The loop which involves the predictor provides the corrections to the transmitted picture that ensure that a correct picture is reconstructed at the receiver when it applies the motion vectors to the received macroblocks. MPEG-2 - Encoder Arab Open University – Spring 2012

The motion vectors and mode control information are combined in the output multiplexer. • The multiplexer output is stored in a buffer. • If the buffer contents grow too fast, then the requantization steps are temporarily made wider to reduce the bit rate of the input to the multiplexer. MPEG-2 - Encoder Arab Open University – Spring 2012

The spatial compression is reversed and motion estimation data is used to reconstruct interpolated pictures. • The sizes of the steps used for inverse requantization are controlled using data extracted from the multiplexed coded input MPEG-2 - decoder Arab Open University – Spring 2012

The basic features of MPEG-1 and MPEG-2 audio coding are the same. • The main difference is in the number of audio channels provided in MPEG-2 as additional options, such as ‘surround sound’, which requires more than the two channels used for conventional stereo reproduction. • Three levels of compression have been specified. • They are referred to as coding layers and numbered I, II and III MPEG Audiocoding Arab Open University – Spring 2012

Layer I provides the smallest compression and layer III the greatest. • The layers are upwardly compatible • layer III can decode data compressed using layers I or II, • Layer II can decode data compressed using layer I. • The precise details differ from layer to layer. • MPEG audio coding is based primarily on MPEG-1/2 layer II, which is part of the terrestrial digital video broadcasting standard (DVB-T) and many current DAB systems MPEG audio coding Arab Open University – Spring 2012

The compression of audio messages depends on several aspects of our perception of sounds. • There are many aspects of an audio signal which we do not perceive, although they carry information in the communications sense of the term. • We are hardly, if at all, sensitive to the phaseof the frequency components that make up an audio signal, so that any phase information may be omitted from the coding process. • The main feature of the process is that the message is split into 32 equal and adjacent frequency bands, called sub-bands. • This is done digitally, by means of filters. • MPEG standards allow for various sampling frequencies to be used. MPEG audio coding Arab Open University – Spring 2012

Each sub-band has a width of 0.75 kHz, giving a total bandwidth of 0.75 x 32 = 24 kHz, half the sampling frequency, this being the assumed bandwidth of the original signal. • The filtering process reduces the number of samples for each sub-band, so that the effective sampling frequency is 1.5 kHz, that is 1500 samples per second (twice the bandwidth of each sub-band) • The total number of samples is 32 x 1500 = 48000, which is the same as for the original signal. • The filtering process, by itself, does not reduce the number of bits to be processed, but it does allow this to be done in several ways. • We are more sensitive to frequencies in the range 1 to 5 kHz than we are outside this range! MPEG audio coding – Example : 48KHz option Arab Open University – Spring 2012

Relative audio sensitivity of humans: signal levels above the curve are audible. • The curve itself represents the perception threshold. • Below this threshold, we do not hear the sounds. MPEG audio coding Arab Open University – Spring 2012

If the signal level in a sub-band is below the threshold for the frequencies covered by that sub-band, then that part of the signal will not be perceived and the samples for that sub-band do not need to be transmitted. • The next step relies on a perception phenomenon known as Noise Masking:result of two effects  • Frequency Masking • Temporal Masking MPEG audio coding Arab Open University – Spring 2012

The frequency masking effect arises because it turns out that a relatively loud sound at a particular frequency reduces our sensitivity (raises the threshold) for neighboring frequencies. • The masking effect decreases as we move away from the frequency of the sound that causes the masking. MPEG audio coding Arab Open University – Spring 2012

The temporal masking effect: Sensitivity to sounds in a narrow frequency range is reduced for a short period, of the order of a few milliseconds, before and after the presence of a relatively strong sound in that frequency range. MPEG audio coding Arab Open University – Spring 2012

Within a relatively narrow frequency band, a signal component in that band will mask noise components with adjacent frequencies to an extent that decreases on either side of the masking signal. • This effect takes place in what are called critical bands whose width increases with frequency. MPEG audio coding Arab Open University – Spring 2012 Noise masking in a critical frequency band

Masker Threshold in quiet Masked threshold Masked Sounds Arab Open University – Spring 2012

The width of the critical bands ranges from 100 Hz at low audio frequencies to about twice that above 500 Hz. • Research indicates that there are about 24 critical bands in the audio range. • They do not coincide exactly with the 32 MPEG sub-bands, but the match is sufficiently good for advantage to be taken of the effect in the following way: • The number of bits used per sample needs to be sufficient to keep the quantization noise below the noise threshold • If the number of bits is reduced, then the quantization noise will increase. • This noise will be distributed over the whole signal frequency spectrum. MPEG audio coding Arab Open University – Spring 2012

However, if the noise masking effect causes a sufficient increase in the thresholds, then it may be possible to decrease the number of bits needed for the samples in some sub-bands without causing a perceivable increase in noise. • If noise masking causes the threshold to rise above the sample values in a sub-band, then these samples will no longer need to be transmitted. • Information about the characteristics of audio perception – variation of threshold with frequency, and the threshold increases caused by frequency and temporal masking, is codified in the form of a psycho-acoustic model which is incorporated into MPEG encoders. MPEG audio coding Arab Open University – Spring 2012

MPEG audio encoder Arab Open University – Spring 2012

The output of each of the 32 sub-band filters is requantized under the control of the psycho-acoustic model element, which is fed with the relevant features of the overall signal. • The psycho-acoustic model generates a masking curve • Samples from sub-bands below their respective thresholds represented by the masking curve are suppressed altogether. • The number of bits per sample from other sub-bands is reduced to the extent allowed by the noise masking effects, for which the relevant information is built into the model. MPEG audio coding Arab Open University – Spring 2012

MPEG audio coding Arab Open University – Spring 2012

Further compression is obtained by taking into account the range of levels covered by the signal in each sub-band. • For instance, if use of the psycho-acoustic model indicated that 6-bit samples in a sub-band would be adequate, then if 000001 were used to represent the quantized value of a 0.1mV sample 111111 would represent a sample having 63 times this value; that is, 6.3mV. • If 000001 represented a 10mV sample, then 111111 would represent a 630 mV sample. MPEG audio coding Arab Open University – Spring 2012

In order to allow for the variation in the range of values which signals in different sub-bands may take, further information is needed. • This is conveyed in the form of scaling factors. • A 6-bit number is used as the scaling factor for each sub-band, giving 64 different possible values of the factor. • The magnitude ratio between any two consecutive scaling factor values corresponds to a 2 dB difference in the audio signals, thus covering a 128 dB dynamic range. • The scaling factor indicates the magnitude of the step sizes for the quantized samples. MPEG audio coding Arab Open University – Spring 2012

For each sub-band, besides the scaling factor, the receiver needs to know the number of bits used to quantize the samples  conveyed as a 4-bit number, which allows up to 16 bits per sample to be used. • We are less sensitive to level differences at high frequencies than at low ones. So, the number of bits used to quantize samples is decreased for the higher frequency sub-bands. • At the receiving end, the decoder reverses the requantization process by applying the scaling factors for each sub-band to the samples in that band. • A bank of digital filters is then used to recombine the samples for all 32 sub-bands into a single decoded audio signal. MPEG audio coding Arab Open University – Spring 2012

In addition to the audio data itself (including samples and scaling factors), frames carry additional control information (including error control). • A number of audio, video and synchronization signals are combined in a transport stream. MPEG audio decoder Arab Open University – Spring 2012

The most important characteristics of the three MPEG-1/2 coding layers: • Layer I is known as ‘pre-MUSICAM’. The encoder can operate at one of 14 fixed output bit rates ranging from 32 to 448 kbit/s. The rate required for a Hi-Fi audio channel is 192 kbit/s. The encoder and decoder are relatively simple. • Layer II, the standard decoder for the European digital video broadcasting system, uses the algorithm known as MUSICAM with the same psycho-acoustic model as layer I. • For the same perceived audio quality, its output bit rate is 30--50% of that of the layer I encoder, requiring 128 kbit/s per channel for hi-fi quality. • Layer II uses a decreasing number of sample quantization bits with increasing sub-band frequency. • Layer II can encode input streams sampled at 32, 44.1 and 48 kHz. The output bit rate options are 64, 96, 128 and 192 kbit/s. Why so many audio standards? Arab Open University – Spring 2012

Layer III (MP3) uses a different psycho-acoustic model with many more sub-bands than in layers I/II, and Huffman coding for better compression. • It also uses a DCT in addition to the sub-band coding of the other two layers. • Compressed hi-fi requires only 64 kbit/s per channel. • The compression is roughly twice that of the layer II encoder, but its structure is much more complex. • In MPEG-1, two audio channels are coded. They can be used independently or as a stereo pair. • MPEG-2 can handle five channels. This allows for the transmission of surround sound. Why so many audio standards? Arab Open University – Spring 2012

Another standard gaining increasing importance is Advanced Audio Coding (AAC). • AAC was developed from 1994 onwards as an MPEG-2 option providing better compression and quality than even MPEG-1/2 layer III (MP3). • It has a lot in common with MP3, and is designed to be backwards compatible. • Includes many features which offer significantly lower bit rates for the same audio quality. • AAC is used in MPEG-4 Why so many audio standards? Arab Open University – Spring 2012

A further development known as AAC+ will replace layer II encoding in the forthcoming new DAB system • Other, non-MPEG, audio coding standards at the time of writing are Windows Audio Media and OggVorbis, both of which were also designed to improve on MP3. • So, why so many different standards? • One reason is that the continuing improvement in processor performance and lowering of costs has meant that it has become possible to develop more and more complex algorithms or to include additional features. • There is a need to maintain ‘legacy’ techniques. MPEG audio coding Arab Open University – Spring 2012

The audio and video bitstreams often need to be combined. • Timing information must also be provided to control the scanning process in the display device and to synchronise the sound and picture  lip movement with speech in particular! Source multiplexing Arab Open University – Spring 2012

The bitstreams at the output of the video and audio encoders are known as elementary streams (ESs). • Each stream is segmented into packets to form the video and audio packetized elementary streams (PESs). Source multiplexing Arab Open University – Spring 2012

The information carried by the packet headers includes a leading fixed start code, used for framing the packet, and fields indicating stream identification, packet length and whether the data is scrambled. • In the case of a single program, the packets are multiplexed to form a program stream. • In DVB, a group of programs is multiplexed and modulated onto a radio-frequency carrier for transmission over a terrestrial or satellite radio link, or via a cable. • The multiplexed stream is known as the transport stream. Source multiplexing Arab Open University – Spring 2012

Source multiplexing Arab Open University – Spring 2012

T325: Technologies for digital media