A Tutorial on MPEG/Audio Compression

A Tutorial on MPEG/Audio Compression Davis Pan, IEEE Multimedia Journal, Summer 1995 Presented by: Randeep Singh Gakhal CMPT 820, Spring 2004

Outline • Introduction • Technical Overview • Polyphase Filter Bank • Psychoacoustic Model • Coding and Bit Allocation • Conclusions and Future Work

Introduction • What does MPEG-1 Audio provide? A transparently lossy audio compression system based on the weaknesses of the human ear. • Can provide compression by a factor of 6 and retain sound quality. • One part of a three part standard that includes audio, video, and audio/video synchronization.

Technical Overview

MPEG-I Audio Features • PCM sampling rate of 32, 44.1, or 48 kHz • Four channel modes: • Monophonic and Dual-monophonic • Stereo and Joint-stereo • Three modes (layers in MPEG-I speak): • Layer I: Computationally cheapest, bit rates > 128kbps • Layer II: Bit rate ~ 128 kbps, used in VCD • Layer III: Most complicated encoding/decoding, bit rates ~ 64kbps, originally intended for streaming audio

Human Audio System (ear + brain) • Human sensitivity to sound is non-linear across audible range (20Hz – 20kHz) • Audible range broken into regions where humans cannot perceive a difference • called the critical bands

MPEG-I Encoder Architecture[1]

MPEG-I Encoder Architecture • Polyphase Filter Bank: Transforms PCM samples to frequency domain signals in 32 subbands • Psychoacoustic Model: Calculates acoustically irrelevant parts of signal • Bit Allocator: Allots bits to subbands according to input from psychoacoustic calculation. • Frame Creation: Generates an MPEG-I compliant bit stream.

The Polyphase Filter Bank

Polyphase Filter Bank • Divides audio signal into 32 equal width subband streams in the frequency domain. • Inverse filter at decoder cannot recover signal without some, albeit inaudible, loss. • Based on work by Rothweiler[2]. • Standard specifies 512 coefficient analysis window, C[n]

Polyphase Filter Bank • Buffer of 512 PCM samples with 32 new samples, X[n], shifted in every computation cycle • Calculate window samples for i=0…511: • Partial calculation for i=0…63: • Calculate 32 subsamples:

Polyphase Filter Bank • Visualization of the filter[1]:

Polyphase Filter Bank • The net effect: • Analysis matrix: • Requires 512 + 32x64 = 2560 multiplies. • Each subband has bandwidth π/32T centered at odd multiples of π/64T

Polyphase Filter Bank • Shortcomings: • Equal width filters do not correspond with critical band model of auditory system. • Filter bank and its inverse are NOT lossless. • Frequency overlap between subbands.

Polyphase Filter Bank • Comparison of filter banks and critical bands[1]:

Polyphase Filter Bank • Frequency response of one subband[1]:

Psychoacoustic Model

The Weakness of the Human Ear • Frequency dependent resolution: • We do not have the ability to discern minute differences in frequency within the critical bands. • Auditory masking: • When two signals of very close frequency are both present, the louder will mask the softer. • A masked signal must be louder than some threshold for it to be heard  gives us room to introduce inaudible quantization noise.

MPEG-I Psychoacoustic Models • MPEG-I standard defines two models: • Psychoacoustic Model 1: • Less computationally expensive • Makes some serious compromises in what it assumes a listener cannot hear • Psychoacoustic Model 2: • Provides more features suited for Layer III coding, assuming of course, increased processor bandwidth.

Psychoacoustic Model • Convert samples to frequency domain • Use a Hann weighting and then a DFT • Simply gives an edge artifact (from finite window size) free frequency domain representation. • Model 1 uses 512 (Layer I) or 1024 (Layers II and III) sample window. • Model 2 uses a 1024 sample window and two calculations per frame.

Psychoacoustic Model • Need to separate sound into “tones” and “noise” components • Model 1: • Local peaks are tones, lump remaining spectrum per critical band into noise at a representative frequency. • Model 2: • Calculate “tonality” index to determine likelihood of each spectral point being a tone • based on previous two analysis windows

Psychoacoustic Model • “Smear” each signal within its critical band • Use either a masking (Model 1) or a spreading function (Model 2). • Adjust calculated threshold by incorporating a “quiet” mask – masking threshold for each frequency when no other frequencies are present.

Psychoacoustic Model • Calculate a masking threshold for each subband in the polyphase filter bank • Model 1: • Selects minima of masking threshold values in range of each subband • Inaccurate at higher frequencies – recall how subbands are linearly distributed, critical bands are NOT! • Model 2: • If subband wider than critical band: • Use minimal masking threshold in subband • If critical band wider than subband: • Use average masking threshold in subband

Psychoacoustic Model • The hard work is done – now, we just calculate the signal-to-mask ratio (SMR) per subband • SMR = signal energy / masking threshold • We pass our result on to the coding unit which can now produce a compressed bitstream

Psychoacoustic Model (example) • Input[1]:

Psychoacoustic Model (example) • Transformation to perceptual domain[1]:

Psychoacoustic Model (example) • Calculation of masking thresholds[1]:

Psychoacoustic Model (example) • Signal-to-mask ratios[1]:

Psychoacoustic Model (example) • What we actually send[1]:

Coding and Bit Allocation

Layer Specific Coding • Layer specific frame formats[1]:

Layer Specific Coding • Stream of samples is processed in groups[1]:

Layer I Coding • Group 12 samples from each subband and encode them in each frame (=384 samples) • Each group encoded with 0-15 bits/sample • Each group has 6-bit scale factor

Layer II Coding • Similar to Layer I except: • Groups are now 3 of 12 samples per-subband = 1152 samples per frame • Can have up to 3 scale factors per subband to avoid audible distortion in special cases • Called scale factor selection information (SCFSI)

Layer III Coding • Further subdivides subbands using Modified Discrete Cosine Transform (MDCT) – a lossless transform • Larger frequency resolution => smaller time resolution • possibility of pre-echo • Layer III encoder can detect and reduce pre-echo by “borrowing bits” from future encodings

Bit Allocation • Determine number of bits to allot for each subband given SMR from psychoacoustic model. • Layers I and II: • Calculate mask-to-noise ratio: • MNR = SNR – SMR (in dB) • SNR given by MPEG-I standard (as function of quantization levels) • Now iterate until no bits to allocate left: • Allocate bits to subband with lowest MNR. • Re-calculate MNR for subband allocated more bits.

Bit Allocation • Layer III: • Employs “noise allocation” • Quantizes each spectral value and employs Huffman coding • If Huffman encoding results in noise in excess of allowed distortion for a subband, encoder increases resolution on that subband • Whole process repeats until one of three specified stop conditions is met.

Conclusions and Future Work

Conclusions • MPEG-I provides tremendous compression for relatively cheap computation. • Not suitable for archival or audiophile grade music as very seasoned listeners can discern distortion. • Modifying or searching MPEG-I content requires decompression and is not cheap!

Future Work • MPEG-1 audio lays the foundation for all modern audio compression techniques • Lots of progress since then (1994!) • MPEG-2 (1996) extends MPEG audio compression to support 5.1 channel audio • MPEG-4 (1998) attempts to code based on perceived audio objects in the stream • Finally, MPEG-7 (2001) operates at an even higher level of abstraction, focusing on meta-data coding to make content searchable and retrievable

References [1] D. Pan, “A Tutorial on MPEG/Audio Compression”, IEEE Multimedia Journal, 1995. [2] J. H. Rothweiler, “Polyphase Quadrature Filters – a New Subband Coding Technique”, Proc of the Int. Conf. IEEE ASSP, 27.2, pp1280-1283, Boston 1983.

A Tutorial on MPEG/Audio Compression