Audio Codecs Dan Mechanic CS W4995
Why are there different codecs? Each trying to find the best balance, between: • Fast Processing • Good Compression • Quality (accurate) decoding
The best balance can depend on application: Music: wav encoder compromises compression • lossless • ~1.4Mbps • Sacrifice: Compression aac encoder compromises fast processing • technically lossy, but still quality decoding • via sophisticated compression algorithms 320kbps • Sacrifice: Processing Compact Disc: 16-bit 44.1kHz
The best balance can depend on application: Music: wav encoder compromises compression • lossless • ~1.4Mbps • Sacrifice: Compression aac encoder compromises fast processing • technically lossy, but still quality decoding • via sophisticated compression algorithms 320kbps • Sacrifice: Processing
Why are there different codecs? Standards • Recommendations from the ITU (International Telecommunications Union) Existing Technologies • G.711 was created in the early seventies for pstn lines supporting 8-bit 8000 samples per second • Now G.711 can be a good choice for VOIP because it sounds like a traditional land line and has low latency (less processing at the media gateways) Patents End User Expectations
Other constraints… Nyquist Theorem - “When converting from an analog signal to digital (or otherwise sampling a signal at discrete intervals), the sampling frequency must be greater than twice the highest frequency of the input signal in order to be able to reconstruct the original perfectly from the sampled version.” source: http://www.fact-index.com/n/ny/nyquist_shannon_sampling_theorem.html
What methods do codecs meant for speech use? • Many, many codecs… • only a handful of methodologies.
Pulse Code Modulation image source: http://en.wikipedia.org/wiki/Pulse-code_modulation
Pulse Code Modulation can require a high bitrate G.711 uses different “companding” algorithms to reduce bitrate. • Compression - to reduce audio peaks • Expansion - raise the floor of the audio. • Actually performed via a logarithmic transformation of a 13-14bit number to a 8-bit number
μ-law and A-law algorithms μ-law • Used in North America and Japan • specifically for turning 14-bit encoding to 8 A-law • Used in Europe • converts 13 bit to 8 bit
Differential Pulse Code Modulation • Waveforms act fairly predictably • We can look at a previous sample and predict the value of the next one. • If coder and decoder agree on what algorithm to predict with, only the difference between prediction and actual needs to be transmitted.
Differential Pulse Code Modulation image from “Speech Compression” by Mark Handley: www.cs.columbia.edu/~hgs/teaching/ais/slides/04-speech-coding.pdf
Adaptive Differential Pulse Code Modulation • Algorithms for next-sample prediction can be dynamic to more accurately represent the waveform we are encoding/decoding. • Vary predictor to adapt to the changing characteristics of the audio being recorded. • G.721 uses the previous 8 samples, and can quantized the difference to 4-bits (32Kbs)
Sub-Band Differential Pulse Code Modulation “not all frequencies created equal” • Lower frequencies (50Hz-3.5kHz) are important to understanding speech, and are more sensitive to quantization errors. • Higher frequencies (3.5kHz-7kHz) are used for conveying emotion and recognition of the speaker
Sub-Band Differential Pulse Code Modulation “not all frequencies created equal” …so don’t treat them the same • Lower frequencies (50Hz-3.5kHz) sample at 16kHz • Higher frequencies (3.5Khz-7kHz), less important, down-sample to 8kHz • mux these together to get (64kbs)… same compression, better decoding quality, at the price of processing • G.721, G.726
Linear PredictiveSource-Filter Speech Model • An algorithm that models speech image source: http://mtg.upf.edu/~xserra/cursos/TDP/referencies/Park-LPC-tutorial.pdf
Linear Predictive Based on a simple model of human speech • Buzzer - your glottis or vocal chords, provides pitch • Tube - builds resonance and gives rise to ‘formants’ • Hiss and pops - tongue, lips and throat make sibilants and plosives (“s”,”k”,”p”)
Linear Predictive Formants • peaks in the frequency spectrum caused by acoustic resonance. image of the frequency response of the typical vowel sound source: http://mtg.upf.edu/~xserra/cursos/TDP/referencies/Park-LPC-tutorial.pdf
Linear Predictive Encoding • operates on a sample of sound (around 20ms) • remove formants, and leave ‘residue’ sound (buzz), determine tone of ‘residue’ • Determine whether sound is voiced or unvoiced • voiced - tonal “m” “v” • unvoiced - sibilance and plosives “s” “k” • optimized using a series of linear predictive coefficients
Linear Decoding img source: www.cs.columbia.edu/~hgs/teaching/ais/slides/04-speech-coding.pdf “Speech Compression” Mark Handley
Linear Predictive Encoding What’s the limitation? Our speech creation is not in fact so simple. For some sounds, nasal passages create a ‘side-branch’ to our tube..
Code Excited Linear Predictive(CELP) • Instead of sending a series of coefficients, agree on a ‘codebook’ of coefficients, and send a reference to the code you are using. • Don’t need a codebook for every pitch. One pitch can be delayed for lower frequencies. • Speex (open-source patent free)
Linear Predictive - Other Variants • Regular-Pulse Excitation Long-Term Predictor (GSM) • Low Delay Code Excited Linear Prediction (G.728) • Conjugated Structure Algebraic Code Excited Linear Prediction (G.729)
References • http://www.cs.columbia.edu/~hgs/audio/codecs.html • http://www.fact-index.com/p/pu/pulse_code_modulation_1.html • http://www.fact-index.com/n/ny/nyquist_shannon_sampling_theorem.html • http://en.wikipedia.org/wiki/Pulse-code_modulation • http://www1.cs.columbia.edu/~sedwards/classes/2004/4840/reports/manic.pdf • http://www-mobile.ecs.soton.ac.uk/speech_codecs/standards/adpcm.html • http://www.cs.columbia.edu/~hgs/teaching/ais/slides/04-speech-coding.pdf “Speech Compression” Mark Handley • http://www.myspace.com/growing_up_is_hard_2_do - speak n spell image • A good introduction to LPC Dr. Sung-won Park Texas A&M University-Kingsville • http://en.wikipedia.org/wiki/G.711 • ITU-T recomendation G.711 • http://en.wikipedia.org/wiki/%CE%9C-law_algorithm • Soundfiles: www.Data-Compression.com • http://www.otolith.com/otolith/olt/lpc.html