Combination of Feature and Channel Compensation (1/2)

Combination of Feature and Channel Compensation (1/2) • It’s often the case that, in addition to environmental acoustic noise, there also exists linear channel distortion which may be caused by transducer mismatch

Combination of Feature and Channel Compensation (2/2)

Cepstral-Domain Acoustic Feature Compensation Based on Decomposition of Speech and Noise for ASR in Noisy EnvironmentsHong Kook Kim, Senior Member, IEEE, and Richard C. Rose, Senior Member, IEEE Graduate Institute of Computer Science and Information Engineering, National Taiwan Normal University, presented by Chen-Wei Liu 2005-03-31

References • [1] Tracking Speech-presence uncertainty to improve speech enhancement in non-stationary noise environments • D.Malah, R. V. Cox, and A. J. Accardi • ICASSP, 1999 • [2] Speech Enhancement Using a MMSE-LSA Estimator • Y.Ephraim and D. Malah • IEEE Trans, ASSP, 1985

Introduction (1/5) • All techniques for HMM and feature compensation in ASR are complicated by the fact that • The interaction between speech and acoustic background noise in cepstrum domain is highly nonlinear • Whereas environmental noise and speech are considered to be additive in the linear spectrum domain • Their interaction is considered to be more difficult to characterize in the log spectral amplitude and the cepstrrum domains • Consequently, the goal of decomposing a noise corrupted speech signal into clean speech and pure noise components has always been difficult to achieve

Introduction (2/5) • This paper presents an approach for cepstrum-domain feature compensation in ASR which exploits noisy speech decomposition • The approach relies on a minimum mean squared error log spectral amplitude estimator (MMSE-LSA) • The MMSE-LSA estimator of a clean speech magnitude spectrum is represented as • A noisy speech magnitude spectrum multiplied by a frequency dependent spectral gain function that is derived from the estimate of noise spectrum, SNR, and speech absence probability. • As a result, the estimated log spectrum of clean speech becomes a sum of the log spectra of noisy speech and the gain function

Introduction (3/5) • By converting these log spectra into cepstra, it turns out to be that estimated noise and clean speech can be considered to be additive in the cepstrum domain • The proposed approach performs frame level decomposition of noisy speech cepstrum and compensates for additive noise in the cepstrum domain • Furthermore, the cepstrum decomposition technique can be extended into a low complexity robust algorithm for implementing parallel model combination (PMC)

Introduction (4/5) • The PMC algorithm combines separately estimated speech and noise HMMs to obtain • a single HMM model that describes the noise corrupted cepstrum observation vectors • Owing to the highly nonlinear interaction of speech and noise in the cepstrum domain • Traditional approaches to PMC require a great deal of computational complexity to convert to a linear spectral representation • Where additive models of speech and noise can be assumed for combining speech and background model distributions

Introduction (5/5) • By applying the proposed approach, corrupted HMM models can be obtained by adding the means and variances of clean HMMs and those of estimated noise HMMs • While the cepstrum decomposition technique has the application to model compensation • This paper will consider the technique from only a feature compensation point of view

Review of Speech Enhancement Algorithm based on MMSE-LSA (1/6) • A nonlinear frequency depend gain function is applied to the spectral components of the noisy speech signal in an attempt to obtain estimates ofspectral components of corresponding clean speech • A modified MMSE-LSA estimation criterion with a soft-decision based modification for taking into account the speech presence is used to derive the gain function

Review of Speech Enhancement Algorithm based on MMSE-LSA (2/6)

Review of Speech Enhancement Algorithm based on MMSE-LSA (3/6) • The objective of the MMSE-LSA is to find the estimator that minimizes the distortion measure for a given noisy observation spectrum • The modified MMSE-LSA gives an estimate of clean speech spectrum that has the form Gain function Gain modification function

Review of Speech Enhancement Algorithm based on MMSE-LSA (4/6) Posteriori SNR Prior SNR

Review of Speech Enhancement Algorithm based on MMSE-LSA (5/6) Ratio between speech presence and absence

Review of Speech Enhancement Algorithm based on MMSE-LSA (6/6) • There are no unique optimum parameter settings because these parameters are also depend on • The characteristics of input noise and the efficiency of the noise psd estimation • Use of this speech enhancement algorithm as a preprocessor to feature extraction will be referred to in the next section as speech enhancement based front-end (SE)

Cepstrum-Domain Feature Compensation (1/3) • Cepstrum Subtraction Method • The speech enhancement algorithm works by multiplying the frequency dependent gain function and noisy magnitude spectrum • By applying the inverse Fourier transform to the above

Cepstrum-Domain Feature Compensation (2/3)

Cepstrum-Domain Feature Compensation (3/3) • Assuming that the enhanced speech signal is an estimate of the clean speech signal, the cepstrum for clean speech is approximated as • The above equation implies that the noisy speech cepstrum can be decomposed into a linear combination of the estimated clean speech cepstrum and noise cepstrum, and this is so called Cepstrum Subtraction Method (CSM)

Speech Recognition Experiments • Baseline front-end • Frame rate 100 Hz • 512 point FFT over the windowed speech segment • 24 filterbank log-magnitude • 13 MFCCS with 1st and 2nd difference MFCCs • Each word was modeled by 16-state 3-mixture HMM • Database • Aurora 2.0 and subset of Aurora 3.0 • Aurora 2.0 • Clean-condition training : 8440 digit strings • Multi-condition training : 8440 digit strings divided into 20 groups

Clean Condition finding

Clean Condition Baseline • CSM reduced AVG. WER. Except Clean • CSM increased WER under subway in 10db in SetA and SetC CSM

Clean Condition • CSM+CMS

Multi Condition

Multi Condition Baseline CSM

Multi Condition • CSM+CMS

Mismatched Transducer Condition • A total of 274 context-dependent sub-word models • Sub-word models contained a head-body-tail structure • Head and tail models were represented with 3 states • Body was represented with 4 states • Each state has 8 Gaussians • The recognition system • 274 HMM models, 831 states, 6672 mixtures • Recorded over PSTN

Mismatched Transducer Condition • Dominant sources of variabilities was transducer variability • Training data : vast array of transducers • But testing set was not (significant mismatch) • Table 5 only simulated channel mismatch

Real Adverse Environment CaseAurora 3.0 - Finnish

Conclusion • Advantages of CSM • The ability to make a soft-decision about whether a given frequency bin within an input frame corresponds to speech or noise • Providing the estimates of that are updated for each analysis frame • CSM gave a better performance than SE in any acoustic noise and different transducer conditions • The best performance was achieved by combined CSM and CMS

Experimental Results

Combination of Feature and Channel Compensation (1/2)