Concepts of Multimedia Processing and Transmission

Concepts of Multimedia Processing and Transmission IT 481, Lecture 6 Dennis McCaughey, Ph.D. 26 February, 2007

Conventional Audio Signal Format • On vinyl and audio cassettes, the audio waveform is recorded as an analogue signal. Therefore any imperfections will be heard as noise (hiss) or other defects. • To reduce these defects, CDs use Pulse Code Modulation (PCM), the simplest of digital coding technologies. Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Pulse Code Modulation (PCM) • Using PCM technology samples of the analogue waveform are taken at intervals and stored as numbers. The example below shows the conversion of an analogue waveform (which could be part of an audio signal) to digital by representing each sample by a number (from 0 to 100 in this simple example). Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Sampling for Audio Signal • In practice the range of values and sampling rate must be high enough to ensure accurate reproduction of the original analogue waveform. • The upper limit for the human ear is about 20kHz therefore the audio must be sampled at 40,000 times per second or higher (since two samples are required for both halves of a sine wave). • To reduce distortion and quantization noise each sample must be represented by at least a 16-bit number giving 65,536 values or levels (0 to 65,535) per sample. Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Parameter Value CD Digital Audio Parameters Sample rate 44.1 kHz Channels 2 (stereo) Bits per sample, per channel 16 Levels per sample 65,536 Total data rate (Mb/s) 1.4112 • Audio is stored on Compact Discs with the following parameters Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Data Integrity in Audio-CD • Digital encoding allows the use of error correction codes, which are necessary to correct errors resulting from the manufacturing process and minor damage or marks which may occur from handling and use. • The result is that the amount of data stored on a CD is nearly four times the data needed to represent the audio only. But this is a small price to pay for a robust format that allows recordings to be played back free of clicks, hiss and other defects associated with analog media. Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

CD Error Correction and Modulation • Error correction provided by CIRC (Cross Interleaved Read-Solomon Code), which adds two dimensional parity information and also interleaves the data on the disc to protect from burst errors • CIRC corrects error bursts up to 3,500 bits (2.4 mm in length) and compensates for error bursts up to 12,000 bits (8.5 mm) such as caused by minor scratches. • EFM (Eight to Fourteen) modulation: as CD-ROM discs uses a 14-bit byte, a modification necessary because of the way data is stored and read with lasers, using the pits (indentations) and lands (spaces between indentations) on the disc. • In transferring from magnetic to optical media, the 8-bit byte is modulated and stored on optical media as a 14-bit byte. This reduces the effect of jitter and other distortions on the error rate. • When the computer reads the CD-ROM, an interface card demodulates the 14-bit optical codeback to 8-bit code. Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

CD Data Format Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Audio Object VideoObject DVD Coding Format Encoding methods(mandatory) Linear PCM(Scalable)Packed PCM (lossless encoding) Linear PCMDolby AC3 Encoding methods(optional) none MPEG AudioDTSSDDS Audio specifications for Linear PCM and Packed PCM encoding schemes Sampling frequency 48/96/192 kHz, 44.1/88.2/176.4 kHz 48/96 kHz Quantization depth 16/20/24 bits 16/20/24 bits Maximum number of channels 6ch (fs: 48/96/44.1/88.2 kHz) or2ch (fs: 192/176.4 kHz) 8ch(2ch for Stereo + 6ch for Multi channel) Maximum bit rate 9.6 Mbps(Linear PCM / Packed PCM) 6.144 Mbps(Linear PCM) Frame rate 1200Hz (fs: 48/96/192 kHz)1102.5Hz (fs: 44.1/88.2/176.4 kHz) 600Hz(fs: 48/96 kHz) Dennis Mccaughey, IT 481, Spring 2007

Dynamic Range of CD and DVD Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Delta Modulation • In delta modulation, differences between speech samples are encoded & original to be recovered by the decoder at the receiving end • The analog signal is approximated with a series of segments • Each segment of the approximated signal is compared to the original analog wave to determine the increase or decrease in relative amplitude, • The decision process for establishing the state of successive bits is determined by this comparison, and • Only the change of information is sent, i.e., only an increase or decrease of the signal amplitude from the previous sample is sent whereas a no-change condition causes the modulated signal to remain at the same 0 or 1 state of the previous sample. Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Delta-Mod Encoder Dennis Mccaughey, IT 481, Spring 2007

Delta-Mod Decoder Dennis Mccaughey, IT 481, Spring 2007

Delta Modulation - example Dennis Mccaughey, IT 481, Spring 2007

Delta Modulation Variants • Examples of delta modulation are continuously variable slope delta modulation and delta-sigma modulation. • Continuously variable slope delta (CVSD) modulation: A type of delta modulation in which the size of the steps of the approximated signal is progressively increased or decreased as required to make the approximated signal closely match the input analog wave. • Sigma-Delta Modulation: Delta modulation in which the integral of the input signal is encoded rather than the signal itself. Note: Sigma-Delta modulation may be achieved by including a digital integrator preceding the Quantizer in a delta-modulation encoder. • Important concept in “State-of-the-Art” A/D converters Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Sigma-Delta-Mod Encoder Dennis Mccaughey, IT 481, Spring 2007

G.721 Adaptive Differential Pulse Code Modulation (ADPCM) • PCM does not attempt to remove speech signal redundancy, this is done by the ADPCM encoder • The CCITT standard G.721 ADPCM algorithm for 32 kbps speech coding used in CT2 and DECT cordless phone systems • In practice, ADPCM encoders are implemented using a linear predictor for the current sample, and the difference between predicted and actual sample (prediction error) is encoded for transmission • Prediction is based on the knowledge of the autocorrelation property of speech Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Adaptive PCM Example • In an adaptive PCM system for speech coding, the input signal is sampled at 8 KHz and each sample is represented by 8 bits. The quantizer step size is recomputed every 10 msec and is encoded for transmission using 5 bits. What would the transmission bit rate of such a speech coder? • Sampling frequency = fs = 8 KHz • Number of bits per sample = n = 8 bits • Number of information bits per second = 8,000x8 = 64,000 bits/sec • Quantization step sized recomputed every 10 msec, we have 100 step size sample to be transmitted every second • Therefore, the number of overhead bits = 100x5 = 500 bits/sec, and the effective transmission bit rate is 64,000+500 = 65,000 bits/sec Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

ADPCM Encoder used in CT2 Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

DPCM Encoder (Simplified) Neglecting the Quantizer, it is easy to show: e(n) = s(n) – as(n-1) The Coder may be a Huffman/Entropy encoder Dennis Mccaughey, IT 481, Spring 2007

DPCM Decoder (Simplified) Dennis Mccaughey, IT 481, Spring 2007

DPCM Encoder Schematic Dennis Mccaughey, IT 481, Spring 2007

DPCM Decoder Schematic Dennis Mccaughey, IT 481, Spring 2007

Increased Predictor Order • Can improve the compression performance by increasing the number of samples beyond the previous one • In the example a 3rd order predictor is used • The previous three samples contained in R1, R2 &R3 are weighted by C1, C2 &C3 and added to form the overall prediction • C1, C2 and C3 are functions of the correlation between the first sample and the following two • e.g. for a Markov Process C2 =(C1)2 C3 = (C1)3 Dennis Mccaughey, IT 481, Spring 2007

DPCM: Third Order Predictor Encoder Dennis Mccaughey, IT 481, Spring 2007

DPCM: Third Order Decoder Schematic Dennis Mccaughey, IT 481, Spring 2007

Sub-band Coding (SBC) • Quantization typically produces distortion broad in spectrum. But human ear does not detect distortion equally well at all frequency • Thus it’s possible to achieve substantial improvement in quality by coding speech in narrower bands • Speech is typically divided into four or eight sub-bands by a bank of filters and each sub-band is sampled at a band-pass Nyquist rate and encoded accordance to a perceptual criteria • SBC can be thought of as a method of controlling and distributing quantization noise across the signal spectrum Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

An SBC Encoder Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

An SBC Decoder Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Example of SBC • This table gives the frequency range of each band with the number of bits used to encode each band • Assuming that no side information needs to be transmitted, compute the minimum encoding rate of this SBC encoder Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

Example of SBC (cont’d) • For perfect reconstruction of band-pass signals, need to sample at Nyquist rate which is twice the signal bandwidth • Band 1: 2x(450-225) = 450 samples/sec • Band 2: 2x(900-450) = 900 samples/sec • Band 3: 2x(1,500-1,000) = 1,000 samples/sec • Band 4: 2x(2,700-1,800) = 1,800 samples/sec • Total encoding rate is • 450x4+900x3+1,000x2+1,800x1 = 8,300 bits/s Slide: Courtesy, Hung Nguyen Dennis Mccaughey, IT 481, Spring 2007

G.722 Adaptive DPCM • Better sound quality that G.721 • Employs Subband Coding • Input speech bandwidth is expanded to be from 50Hz to 7KHz • Divides frequency band into two subbands • 50Hz to 3.5KHz • 3.5 KHZ-7 KHz • Each subband sampled & encoded independently using ADPCM • Operating bit rate can be 64, 56 or 48kbps • e.g. 64kbps lower band at 48kbps upper band at 16kbps Dennis Mccaughey, IT 481, Spring 2007

G.722 Adaptive DPCM (ADPCM) Subband Encoder Dennis Mccaughey, IT 481, Spring 2007

G.722 Adaptive DPCM (ADPCM) Subband Decoder Dennis Mccaughey, IT 481, Spring 2007

Linear Predictive Coding • LPC analyzes the audio waveform to determine a selection of perceptual features it contains • These are then quantized and sent to the destination together with a sound synthesizer that regenerates the sound that is perceptually comparable with the original • While sounding synthetic very high compression ratios can be obtained Dennis Mccaughey, IT 481, Spring 2007

LPC Features • Perceptual • Pitch: • Closely related to the frequency of the signal • Important since the ear is more sensitive in the frequency range for 2-5kKz • Period: • The duration of the signal • Loudness: • The average energy in the signal • Voice Tract Excitation Parameters • Voiced Sounds: generated through the vocal chords such as those related to the letters m, v and l • Unvoiced Sounds: the vocal chords are open such as those related to the letters f and s Dennis Mccaughey, IT 481, Spring 2007

Linear Predictive Coding (LPC) Signal Encoder Dennis Mccaughey, IT 481, Spring 2007

Linear Predictive Coding (LPC) Signal Decoder Dennis Mccaughey, IT 481, Spring 2007

Perceptual Properties of the Ear: Sensitivity as a Function of Frequency The ear is most sensitive in the range of 2-5kHz Tone A is audible while tone B is not Dennis Mccaughey, IT 481, Spring 2007

Perceptual Properties of the Ear: Frequency Masking Loud tone suppresses a quieter one. Tone B masks Tone A. Tone B is audible while Tone A is not even if Tone A is audible by itself Dennis Mccaughey, IT 481, Spring 2007

Variation with Frequency Effect of Frequency Masking The masking effect is a function of frequency band. The width of each curve at a particular sound level is known as the critical bandwidth. Experiments show the critical bandwidth increases linearly in steps of 100Hz. e.g. for a signal of 1kHz (2x500Hz) the critical bandwidth is about 200Hz Dennis Mccaughey, IT 481, Spring 2007

Temporal Masking Caused by a Loud Signal After the ear hears a loud sound, there is a delay before it can hear a quieter sound Dennis Mccaughey, IT 481, Spring 2007

MPEG Perceptual Audio Coding • Perceptual encoding is a lossy compression technique, • i.e. the decoded data is not an exact replica of the original digital audio data. • Instead, digital audio data is compressed in a way that despite the high compression rate the decoded audio sounds exactly - or as closely as possible - like the original audio. • This is achieved by adapting the encoding process to the characteristics of the human perception of sound: • The parts of the audio signal that humans perceive distinctly are coded with high accuracy, • The less distinctive parts are coded less accurately, and parts of the sound we do not hear at all are mostly discarded or replaced by quantization noise. Dennis Mccaughey, IT 481, Spring 2007

MPEG-1&2 Encoder Psychoacoustic Model Dennis Mccaughey, IT 481, Spring 2007

New Features for Layer 3 (MP3) • Modified DCT (MDCT) • DCT with overlap • Long/short window switching • Short for better temporal resolution (to prevent pre-echoes) • Long for better frequency resolution • Non-uniform quantization • Entropy coding • Run-length and Huffman coding • Bit reservoir (buffer) Dennis Mccaughey, IT 481, Spring 2007

MPEG 1 Layer 3 (MP3) Encoder Dennis Mccaughey, IT 481, Spring 2007

MP3 Components • Perceptual model: An estimate of the actual (time and frequency dependent) masking threshold is computed by using rules known from psychoacoustics. • Filter bank: A hybrid polyphase / MDCT filter bank is used to decompose the input signal into sub-sampled spectral components. Together with the corresponding inverse filter bank in the decoder it forms an analysis/synthesis system. • Quantization and coding: The spectral components are quantized and coded with the aim of keeping the noise introduced by the quantization below the masking threshold. • Distortion Control Loop • Non-uniform Quantization Control Loop • Huffman Coding • Multiplexing: A bit stream formatter is used to assemble the bit stream, which consists of the quantized and coded spectral coefficients and some side information, e.g. bit allocation information. Dennis Mccaughey, IT 481, Spring 2007

Perceptual Model • The perceptual model consists of outputs values for the masking threshold or allowed noise for each coder partition. • In Layer-3, these coder partitions are roughly equivalent to the critical bands of human hearing. • The the compression result should be indistinguishable from the original signal If the quantization noise can be kept below the masking threshold for each coder partition Dennis Mccaughey, IT 481, Spring 2007

Psychoacoustic Model • Time align audio data • The psychoacoustic model must account for both the delay of the audio data through the filter bank and a data off-set so that the relevant data is centered within its analysis window • Convert audio to spectral domain • The psychoacoustic model uses a time-to-frequency map-ping such as a 512- or 1,024-point Fourier transform • A standard Hanning window, applied to audio data before Fourier transformation, conditions the data to reduce the edge effects of the transform window. • Partition spectral values into critical bands • To simplify the psychoacoustic calculations, the model groups the frequency values into perceptual quanta Dennis Mccaughey, IT 481, Spring 2007

MPEG Audio Filter Bank Boundaries Finer resolution at lower frequencies Dennis Mccaughey, IT 481, Spring 2007

Concepts of Multimedia Processing and Transmission