Survey of Robust Speech Techniques in ICASSP 2009

Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin (林士翔) Survey of Robustness Techniques in ICASSP 2009

Survey of Robustness Techniques in ICASSP 2009

Introduction • The Stereo-based stochastic mapping (SSM) is a front-end data-driven techniques for noise robustness • It assume a joint GMM in the stereo feature space • The mapping between clean and noisy features is estimated from the GMM to compensate the noisy features • SSM can be estimated under various criteria • Maximum A Posteriori (MAP) Iteratively Optimized • Minimum Mean Square Error (MMSE)Closed Form Solution • Moreover, the SSM compensated features are further modeled by Multi-Style MPE training Survey of Robust Speech Techniques in ICASSP 2009

Noise Robustness in feature space (1/2) • Compared to the model space robust speech techniques, feature space noise robust techniques have the advantages of • low computational complexity • easy to decouple from the acoustic model end • Front-end computation of an IBM LVCSR system with MFCC features • The computation evolves through various feature spaces • linear spectral space, Mel spectral space, cepstral space , discriminatively trained feature space Survey of Robust Speech Techniques in ICASSP 2009

Noise Robustness in feature space (2/2) • Depending on the nature of the algorithm, feature space noise robust techniques apply compensation at different space • spectral subtraction -> linear spectral • phase-sensitive feature enhancement -> log Mel spectral • data-driven approach -> can be flexibly applied to different feature spaces (e.g., MFCC, LDA or fMPE) Survey of Robust Speech Techniques in ICASSP 2009

SSM and Discriminative Training (1/6) • SSM is based on stereo features that are concatenation of clean speech feature vectors and noisy speech feature vectors • Define as the joint stereo feature vectors. A GMM is assumed and trained by the EM algorithm where • and are obtained by fMPE training on the LDA features Survey of Robust Speech Techniques in ICASSP 2009

SSM and Discriminative Training (2/6)MMSE-based SSM • Given the observed noisy speech feature , the MMSE estimate of clean speech is given by where and is the posterior probability against , the marginal noisy speech distribution of the joint stereo distribution Survey of Robust Speech Techniques in ICASSP 2009

SSM and Discriminative Training (3/6)MAP-based SSM • Given the observed noisy speech feature , the MAP estimate of clean speech is given by where equation can be solved using the EM algorithm, which results in an iterative estimation process where Survey of Robust Speech Techniques in ICASSP 2009

SSM and Discriminative Training (4/6)Mathematical Connections • The MMSE estimate of SSM is a special tying case of one iteration of the corresponding MAP estimate • Assumes all Gaussians in the GMM share the same condition covariance matrix • It is a reasonable results of the “averaging” effect of the expectation function in the MMSE estimate of SSM • Due to the iterative nature of the MAP estimate of SSM, an initial guess has to be made • A natural choice would be the noisy speech feature itself • or setting the MMSE estimate as the starting point Survey of Robust Speech Techniques in ICASSP 2009

SSM and Discriminative Training (5/6)Mathematical Connections • SPLICE is a special case of the MMSE estimate of SSM under the assumption that is an identity matrix which is equivalent to and having a perfect correlation • SPLICE estimates the bias terms under the ML criteria • Deng Li also gives a connection between SPLICE and fMPE • fMPE has under the minimum phone error criterion • Both SPLICE and fMPE share a similar piece-wise linear structure with posterior probability Survey of Robust Speech Techniques in ICASSP 2009

SSM and Discriminative Training (6/6)Mathematical Connections • Therefore, the overall MAP-based SSM estimation in the fMPE space with the MMSE-based SSM estimate being the starting point can be expressed as • This amount to applying a sequence of posterior probability weighted piece-wise linear mappings on noisy LDA features • After the stochastic mapping, the compensated features can be directly decoded by the clean acoustic models • For better performance, an environment adaptive multi-style discriminative re-training can be further applied (e.g., MPE) Survey of Robust Speech Techniques in ICASSP 2009

Experimental Results (1/3) • LVCSR tasks (a vocabulary of 32k English words) • Back-End • 150 hrs / 55k Gaussians / 4.5k states (clean acoustic model) • 300 hrs / 90k Gaussians / 5k states (multi-style acoustic model) • noisy data are generated by adding a mix of humvee, tank and babble noise to the clean data around 15dB • Front-End • 24 dims MFCCs (CMS) -> super-vector (9 frames: 216 dim) -> LDA 40 dims • GMMs are trained on the noisy training data and the maping is SNR-specific • In test, a GMM-based environment classifier is used to estimate the SNR of sentence • The proposed technique is evaluated on two test sets • Set A : 2070 utterances (around 1.7 hrs) recorded in clean condition • Set B : 1421 utterances (around 1.2 hrs) recorded in a real world noisy condition (with humvee noise running in the background 5-8dB) Survey of Robust Speech Techniques in ICASSP 2009

Experimental Results (2/3) • All the MAP estimations are run for 3 iterations • SSM gives the same results for Set A after environment detection • As the acoustic model is discriminatively trained on clean speech, the baseline result on Set B noisy data is very poor • But SSM is able to significantly improve the results • Compared to the SSM MAP, SSM MMSE MAP reduces the WER relatively by 50%. Survey of Robust Speech Techniques in ICASSP 2009

Experimental Results (3/3) • The baseline with multi-style training in Table 2 improves in the noisy condition (Set B) but degrades in the clean condition (Set A) • When using compensated feature for multi-style training, the performance improves for both Set A and Set B • It significantly reduces WER in the noisy condition (Set B) while maintaining a decent performance in the clean condition (Set A) Survey of Robust Speech Techniques in ICASSP 2009

Summary and Discussion • SSM is a data-driven feature space noise robust technique that exploits stereo data. Hence, it has its advantages and disadvantages • Since it is data-driven and does not rely on model for feature computation, it is quite flexible to apply to various speech features • e.g., MFCC, PLP, linear or Mel-spectral space, cepstral space, LDA and fMPE spaces, etc • However, stereo data is usually expensive to collect • A suboptimal alternative, as done in this paper, would be to artificially generate data for the noisy channel • SSM as a data-driven approach relies on the noise in the training data and may not handle the unseen noise very well Survey of Robust Speech Techniques in ICASSP 2009

Introduction (1/2) • Recently, several techniques has been proposed which aim to exploit the speech signal properties • The spectral peaks being more robust to a broad-band noise than the spectral valleys or harmonicity information • performs locking of the spectral peak-to-valley ratio • alleviate the mismatch between clean and noisy features caused by the spectral valleys being buried by noise • appended the information on spectral peaks into the acoustic • modified the likelihood calculation with the aim of emphasizing parts of the spectrum corresponding to peaks • In this paper, they investigated an incorporation of the mask modeling into an HMM-based (ASR) system Survey of Robust Speech Techniques in ICASSP 2009

Introduction (2/2) • As the mask expresses which spectro-temporal regions are uncorrupted by noise • It can also be seen as a generalized and soft incorporation of the spectral peak information • The mask model is associated with each HMM state and mixture • It expresses what mask information the state/mixture would expect to find in the signal • The mask modeling is performed by employing the Bernoulli distribution • The incorporation of the mask modeling is evaluated in a standard model and in two models that had compensated for the effect of the noise, missing feature and multi-condition training model Survey of Robust Speech Techniques in ICASSP 2009

Incorporating Mask Modeling into HMM-based ASR System (1/6) • The HMM-based ASR system with the incorporation of mask modeling is formulated as follow • The term corresponds to the employment of the missing-feature techniques • The term expresses how likely the given mask M is being generated by the HMM state Sequence S • It serves as a penalization factor for states whose mask model is not in agreement with the mask extracted from the give signal Mask-Model Probability Language Model Probability Emission Probability HMM State Transition Probability Survey of Robust Speech Techniques in ICASSP 2009

Incorporating Mask Modeling into HMM-based ASR System (2/6) • How can we estimate the mask model? • Having an example of noise • The mask model could be estimated based on masks obtained from the training data corrupted by the given noise • Having no information about noise • It could be estimated by using a mask reflecting some a-priori knowledge about speech • the fact that high-energy regions of speech spectra are less likely to be corrupted by noise • The estimation of the mask model is performed by a separate training procedure that is performed after the HMMs have been trained Survey of Robust Speech Techniques in ICASSP 2009

Incorporating Mask Modeling into HMM-based ASR System (3/6) • Estimating the mask model for HMM states • Let denotes the mask vector at a given frame where is the binary mask information of the channel • The mask-model probability for each HMM state and mixture is modeled by a multivariate Bernoulli distribution where is the parameter of the distribution • The estimation of the parameter can be estimated by a Baum-Welch or Viterbi -style training procedure Survey of Robust Speech Techniques in ICASSP 2009

Incorporating Mask Modeling into HMM-based ASR System (4/6) • The Viterbi algorithm is used to obtain the state-time alignment of the sequence of feature vectors on the HMMs • The posterior probability that the mixture component (at state ) have generated the feature vector is then calculated as • Then, the parameters of the mask models are then estimated as Survey of Robust Speech Techniques in ICASSP 2009

Incorporating Mask Modeling into HMM-based ASR System (5/6) • Regions of a high value of the mask model parameter reflect that the masks associated with the given state were for those regions often one, i.e. little affected by noises Survey of Robust Speech Techniques in ICASSP 2009

Incorporating Mask Modeling into HMM-based ASR System (6/6) • The value of the mask probability when being incorporated in the overall probability calculation may need to be scaled (akin to language model scaling) • By employing a sigmoid function • The bigger the value of is the greater the effect of the mask probability on overall probability Survey of Robust Speech Techniques in ICASSP 2009

Experimental Results (1/5) • The experiments were carried out on the Aurora-2 database • The frequency-filtered logarithm filter-bank energies were used as speech feature representation • Due to their suitability for missing-feature based recognition • The noisy speech data from the Set A were used for recognition experiments Survey of Robust Speech Techniques in ICASSP 2009

Experimental Results (2/5) Survey of Robust Speech Techniques in ICASSP 2009

Introduction • The idea of the feature mapping method is to obtain “enhanced” or “clean” features from the “noisy” features • In theory, the mapping need not be performed between equivalent domains • In this paper • They firstly investigate the feature mapping between different domains with the consideration of MMSE criterion and regression optimizations • Secondly they investigate the data-driven filtering for the speech separation by using the neural network based mapping method Survey of Robust Speech Techniques in ICASSP 2009

Mapping Approach (1/3) • Assume that we have both the direction of the target and interfering sound sources through the use of microphone array • The mapping approach which takes those two features, and , and maps them to “clean” recordings • To allow non-linear mapping, they used a generic multilayer perceptron (MLP) with one hidden layer, estimating the feature vector of the clean speech Survey of Robust Speech Techniques in ICASSP 2009

Mapping Approach (2/3) • The parameters are obtained by minimizing the mean squared error: • The optimal parameters can be found through the error back-propagation algorithm • Note that during training this requires that parallel recordings of clean and noisy data are available while only the noisy features are required for the estimation of clean data during testing Survey of Robust Speech Techniques in ICASSP 2009

Mapping Approach (3/3) • With the assumption that the distribution of the target data is Gaussian distributed minimizing the mean square error in is the result of the principle of maximum likelihood • From the perspective of Blind Source Separation (BSS) and Independent Component Analysis (ICA) • The principle of maximum likelihood, which is highly related to the minimization of mutual information between clean source • Their methods, however, lead to a linear transformation, and the probability densities of the sources must be estimated correctly Survey of Robust Speech Techniques in ICASSP 2009

Experimental Data and Setup (1/2) • The Multichannel Overlapping Numbers Corpus (MONC) was used to perform speech recognition experiments • There are four recording scenarios • S1 (no overlapping speech), S12 (with 1 competing speaker L2), S13 (with 1 competing speaker L3), S123 (with 2 competing speakers L2 and L3) • Training data:6049 utterances, development:2026 utterances and testing (2061 utterances) • The MLP is trained from data drawn from the development data set which consists of 2,000 utterances (500 utterances of each recording scenario in the development set Survey of Robust Speech Techniques in ICASSP 2009

Experimental Data and Setup (2/2) • In this paper, two delay-and-sum (DS) beamformer enhanced speech signals are used • The ASR frontend generated 12 MFCCs and log-energy with corresponding delta and acceleration coefficients Survey of Robust Speech Techniques in ICASSP 2009

Feature Mapping Between Different Domains (1/3) • Three domains are selected as the input • spectral amplitude, log Mel-filterbank energies (log MFBE), and Mel-frequency cepstral coefficients (MFCC) • As earlier mentioned, the target data with a Gaussian distribution is optimal from the point view of the MMSE • The PDFs of the amplitudes of the clean speech are far from being Gaussian • The PDFs of the log MFBEs are bi-modal (the lower modal may be due to the low SNR segments) • The PDFs of MFCCs have approximative Gaussian distributions Survey of Robust Speech Techniques in ICASSP 2009

Feature Mapping Between Different Domains (2/3) Survey of Robust Speech Techniques in ICASSP 2009

Feature Mapping Between Different Domains (3/3) • In fact, the mapping to MFCCs is more straightforward in the context of the ASR system, in which MFCCs are used as the features • Furthermore, MMSE in the MFCCs also results in MMSE in the delta coefficients (likewise for acceleration coefficients) Survey of Robust Speech Techniques in ICASSP 2009

Experimental Results (1/2) • The mapping of the log MFBEs from two DS enhanced speech to MFCCs yields the best ASR performance • The smaller dynamic range of the log MFBE vectors is advantageous for regression optimization • The gains from model adaptation are marginal • The mapping methods evaluated are already very effective at suppressing the influence of interfering speakers on the extracted features without model adaptation with model adaptation Survey of Robust Speech Techniques in ICASSP 2009

Survey of Robust Speech Techniques in ICASSP 2009

Survey of Robust Speech Techniques in ICASSP 2009

Presentation Transcript

Review of ICASSP 2004

Speech-Coding Techniques

Speech Coding Techniques

Robust Speech Feature

Robust Speech recognition

Robust Recognition of Emotion from Speech

Rhetorical Techniques in Speech

Survey of ICASSP 2013 section: feature for robust automatic speech recognition

ICASSP 2009: Acoustic Model Survey

A Survey of ICASSP 2013 Language Model

Robust Speech Feature

Speech-Coding Techniques

Enhanced Speech Models for Robust Speech Recognition

ICASSP 2008 Survey

ICASSP 2014

Survey ICASSP 2007 Discriminative Training

ICASSP Paper Survey

ICASSP 2006 Robustness Techniques Survey

EFFECTIVE SPEECH TECHNIQUES???

Survey of Robust Techniques