Front-end Audio Processing: Reflections on Issues, Requirements, and Solutions Tomas Gaensler mh acoustics www.mhacoustics.com Summit NJ/Burlington VT USA
Front-end Audio Processing Processing to enhance perceived and/or measured sound quality in communication and recording devices Then Now
Not So Famous Quotes (Acoustic Jewelry/Bluetooth Headset) • Gary Elko (mh/Bell labs colleague) • At IWAENC 1995: “Acoustic Echo cancellation will not be needed in the future when people wear acoustic jewelry” • Arno Penzias (1978 Nobel prize laureate) • “No one would want acoustic jewelry because people would think the users talking to themselves are crazy” • I’m glad the success of Bluetooth headsets show that both were completely wrong!
Classical Front-end Architectures - POTS Large coupling loss in handset mode Switch loss in speakerphone supporting telephones Carbon microphone with expansion effect that reduces noise Switch Loss
Common problems: Far-end listener does not hear near-end talker Near-end listener does not understand far-end talker Why? Form factor – Size Limited understanding of physics and acoustics(?) Cellphones and Handsfree
RX/TX Levels, Coupling and Doubletalk Far-end 95—100 dBSPL at loudspeaker • Echo louder than near-end: • Linear AEC • ERLE 20-30 dB • After cancellation Residual Echo to Near-end Ratio (RENR): • RENR 90-20-70 = 0 dB • 85—90 dBSPL at mic • >20 dB of residual echo suppression required • Duplexness suffers Near-end talker 55—70 dBSPL at mic
Actual speech to room noise ratio is only about 27 dB at best • Gain is required to get loud enough output • Perceived noise level is ~20 dB above normal room noise level TX: Dynamic Range and Noise • Echo 90 dBSPL Peak echo 105-110 dB • No saturation of echo in TX path Echo Level: 90 dBSPL Near-end speech Level: 70 dBSPL
TX: Fixed-point Processing and Quantization Noise N=64 Q-noise increases by 36 dB Double-precision “required” Q-noise increases by 6log2(N) dB!
Small loudspeakers have rather high cut-off frequency (high-pass) EQ often required to get acceptable “sound” (frequency response). However EQ means: Loss of signal loudness and dynamic range Increased (analog) distortion Many manufacturers compensate the loss of signal level by excessive digital gain and therefore get (digital) saturation RX: Dynamic Range and Distortion Analog gain Digital gain To AEC
What Can or Should be Done? • Minimize acoustical coupling by good physical design • TX • Use noise suppression but not excessively • Double-precision, block scaling, or floating-point • RX • Compression instead of fixed gain • 10% or less loudspeaker/driver THD is desired
What about Non-linear AEC Algorithms? Interesting problem proposed and worked on for many years Not practical in most AEC applications since Complicated model Gain and therefore saturation possibly in both TX and RX paths Added complexity and system cost Often slow convergence Difficult to fine-tune in field Even when non-linear cancellation works perfectly, the user still perceives a distorted loudspeaker signal!
Classical Front-end Architectures – Cellphone 2005 - 2010 Why RX NS? Why TX NS?
Single Channel Noise Suppression Basic single channel noise suppressor An extremely successful signal processing invention by Manfred Schroeder in the 1960s Musical tones – is it a (solved) problem? How do we evaluate and improve quality? How about convergence rate?
Background to Single Channel Noise Suppressors Block processing: Frequency domain model: Linear Time-varying filter: Wiener filter: “enhanced” speech NS speech noise
Background to Single Channel Noise Suppressors • Estimation of spectra is often done recursively: • Frequency smoothing: , when speech is “not” present
Musical Tones – Is it a (Solved) Problem? Examples Original (“Sally Sievers’ reel, June-Sept. 1964” by Manfred Schroeder and Mohan Sondhi at Bell Labs) Original + noise (iSNR ~ 6 dB) Schroeder – 1960s “Generic spectral subtraction” – Boll 1979 IS-127 – 1995 “A problem of last century”, only a constraint in design Controlling variance of suppression gains Any NS algorithm should be constrained not to have musical tones Must only have a small impact on voice quality
Quality Metrics Most importantly: Listen! SNR Total Segmental During speech Distortion metrics: ISD (Itakura-Saito distance) ITU-T P.862: PESQ/MOS-LQO
Quality Metric – P.862 (PESQ/MOS-LQO) MOS-LQO (MOS Listening Quality Objective) Alg-1/2 – Wiener methods with 12 dB noise suppression • What can the best noise suppressor achieve?
Quality Metric – “My Rule of Thumb” • Ideal MOS (PESQ) performance bound is given by shifting the unprocessed PESQ-curve to the left • Example for 12 dB suppression • 12 dB shift to the left 12 dB
Convergence Rate Important performance criterion: Non-stationary noise conditions Frame loss Main objective: Maximize convergence rate while maintaining speech quality
Convergence Rate – A Useful Test Input sequence IS-127 Wiener Based A spectral subtraction m-script retrieved from the internet
Convergence Rate and MOS-LQO “Normal” “Fast” MOS-LQO
Current Applications and Drivers of NS Technology Where is NS going in industry now? Beyond “12 dB” of suppression Multi-microphone solutions Two- or more channel suppressors Linear beamforming Applications Mobile phones (a few two-microphone models have reached the market) Bluetooth headsets: great "new" application for signal processing (Ericsson BT headset 2000)
Background to Linear Beamforming N : Number of microphones Broadside linear beamforming (e.g. delay-sum) Directional gain: 10log(N) White Noise Gain (WNG)>0 Practical size: “large” (~30cm) Endfire differential beamforming Directional gain: 20log(N) WNG<0 Practical size: “small” (1.5-5cm) Differential beamformers more suitable for small form-factors
Background to Linear Beamforming What do we gain? Less reverberation (increased intelligibility) Less (environmental) noise No (or low) distortion on axis Possible interference rejection by spatial zero(s) Some Issues: Performance is given by critical distance! Increase in sensor noise (WNG, differential beamforming)
Beamforming: Critical Distance • Critical distance (Reverberation radius): reverberant-to-direct path energy ratio is 0 dB: • DI = Directivity Index: gain of direct to reverberant energy over an omni-directional microphone • Order of finite differences used. 1st : 2 mics, 2nd : 3 mics etc)
Classical First-Order Beamformer Responses Cardioid Hypercardioid Dipole