Quantile Based Histogram Equalization for Noise Robust Speech Recognition

Quantile Based Histogram Equalizationfor Noise Robust Speech Recognition von Diplom-Physiker Florian Erich Hilger aus Bonn - Bad Godesberg Berichter: Univ.-Prof. Dr.-Ing. Hermann Ney Presenter : Chen Hung_Bin December 2004

outline • Histogram Normalization • Quantile Based Histogram Equalization • Experimental • Conclusion

Histogram Normalization • Histogram normalization is a general non-parametric method to make the cumulative distribution function (CDF) of some given data match a reference distribution. • to reduce an eventual mismatch between the distribution of the incoming test data and the training data's distribution which is used as reference

Histogram Normalization • between the test and the training data distributions is caused by the dierent acoustic conditions • the two CDFs can be used directly to dene a transformation

Histogram Normalization • Example for the cumulative distribution functions of a clean and noisy signal. The arrows show how an incoming noisy value is transformed based on these two cumulative distribution functions.

Histogram Normalization • two pass method • Two separate histograms, one for silence the other for speech, can be estimated on the training data. • Then a first recognition pass can be used to determine the amount of silence in the recognition utterances. • Based on that percentage the appropriate target histogram can be determined. • which requires a sufficiently large amount of data from the same recording environment or noise condition to get reliable estimates for the high resolution histograms

Histogram Normalization • two pass method • It can not be used when a real-time response of the recognizer is required, like in command and control applications or spoken dialog systems. • Quantile equalization is a straight forward solution to this problem would be to reduce the number of histogram bins, in order to get reliable estimates even with little data.

Quantile Based Histogram Equalization • Quantiles are very easy to determine by just sorting the sample data set. • Cumulative distributions can be approximated using quantiles. • example, two cumulative distribution function with four 25% quantiles, NQ = 4

Quantile Based Histogram Equalization • NQ = 4, like shown in the example, about one second of data (100 time frames) is already sufficient to get a rough estimate of the cumulative distribution • an other advantage of the quantile • Even if the data set that shall be considered only consists of very few or in an extreme case just one sample, the quantiles can be calculated without any special modication of the algorithm.

Quantile Based Histogram Equalization • the corresponding reference quantiles of the training data define a set of points that can be used to determine the parameters of a transformation function that transforms the incoming data to and thus reduces the mismatch between the test and training data quantiles Applying a transformation function to make the four training and recognition quantiles match.

Quantile Based Histogram Equalization • Within the context of this work the transformation is applied to the output of the Mel-scaled filter-bank after applying a 10th root to reduce the dynamic range, so in the following will denote the output vector of the filter-bank and will correspondingly denote its component. • To scale the incoming filter output values down to the interval [0; 1] • After the power function transformation is applied the values are scaled back to the original range:

Quantile Based Histogram Equalization • Small values are scaled down even further towards zero, so little amplitude dierences will be enhanced considerably if a logarithm is applied afterwards, this is in contradiction to the desired compression of the signal to a smaller range. • so the transformation function that will always be used within the context

Quantile Based Histogram Equalization • Both transformation parameters are jointly optimized to minimize the squared distance between the current quantiles and the training quantiles • The minimum is determined with a simple grid search: by the way it should be in the range The step size for the grid search can be set to a value in the order of 0.01

Quantile Based Histogram Equalization • Example: output of the 6th Mel scaled lter over time for a sentence from the Aurora 4 test set Cumulative distributions of the signals

Quantile Based Histogram Equalization • Combine neighboring filter channels: • a linear combination of a filter with its left and right neighbor can be used to further reduce the remaining difference • are the filter output values and the recognition quantiles after the preceding power function transformation • factors are denoted for the left neighbors and for the right neighbors • With the transformation step can be written as:

Quantile Based Histogram Equalization • Comparison of the RWTH baseline feature extraction front-end

Experiment • Car Navigation • isolated German words recorded in cars • vocabulary consists of 2100 equally probable words • the training data was recorded in a quiet office environment • Aurora 3 – SpeechDat Car • continuous digit strings recorded in cars • four languages are available: Danish, Finnish, German, and Spanish • Aurora 4 – noisy WSJ 5k • utterances read from the Wall Street Journal with various artificially added noises • vocabulary consists of 5000 words

Comparison of Logarithm and Root Functions • isolated word Car Navigation database • with different root functions on the Car Navigation database LOG: logarithm, CMN: cepstral mean normalization, 2nd - 20th: root instead of logarithm, FMN: filter mean normalization.

Comparison of Logarithm and Root Functions • Comparison of logarithm and 10th root on Aurora 3 database WM: well matched, MM: medium mismatch, HM: high mismatch, FMN: filter mean normalization

Comparison of Logarithm and Root Functions • on the Aurora 4 noisy WSJ 16kHz database. LOG: logarithm, CMN: cepstral mean normalization, 2nd - 20th: root instead of logarithm, FMN: filter mean normalization.

Experiment - Quantile Equalization • Recognition results on the Car Navigation database with quantile equalization LOG: logarithm, CMN: cepstral mean normalization, 10th: root instead of logarithm, FMN: filter mean normalization, QE: quantile equalization, QEF(2): quantile equalization with filter combination (2 neighbors).

Experiment - Quantile Equalization • Comparison of quantile equalization with histogram normalization on the Car Navigation database. QE train: applied during training and recognition. HN: speaker session wise histogram normalization, HN sil: histogram normalization dependent on the amount of silence, ROT: feature space rotation.

Comparison of QE and HN • Cumulative distribution function of the 6th lter output. HN: after histogram normalization, QE: after quantile equalization. clean: data from test set 1, noisy: test set 12

Experiment - Quantile Equalization • Recognition results on the Car Navigation database for dierent numbers ofquantiles. 10th: root instead of logarithm, FMN: filter mean (and variance) normalization, QE: quantile equalization with NQ quantiles, QEF quantile equalization with filter combination.

Experiment - Quantile Equalization • Comparison of the logarithm in the feature extraction with dierent root functions on the Car Navigation database. 2nd - 20th: root instead of logarithm, FMN:filter mean normalization, QE: quantile equalization, QEF: quantile equalization with filter combination.

Conclusion • Replacing the logarithm in the feature extraction by a root function signficantly increased the recognition performance on noisy data • Using four quantiles NQ = 4 can be recommended as standard setup, it can be used on short windows as well as complete utterances.

Spectral Entropy Feature in Full-Combination Multi-Stream for Robust ASR Hemant Misra∗, Herv´e Bourlard∗ IDIAP Research Institute, Martigny, Switzerland Presenter : Chen Hung_Bin INTERSPEECH 2005

Introduction • computing spectral entropy features from the sub-bands of spectrum in order to locate the spectral peaks of the spectrum • spectral entropy features are used along with PLP features in multi-stream framework • training a separate multi-layered perceptron (MLP) for PLP features • 9.2% relative error reduction as compared to the baseline

Spectral entropy feature • Entropy measures can be used to capture the “peakiness” • sharp peak will have low entropy • flat distribution will have high entropy • convert the spectrum into a probability mass function (PMF) like function by normalizing it.

Spectral entropy feature • observe that entropy computed on full-band spectrum can be used as an estimate for speech/silence detection Entropy computed from the full-band spectrum. (a) Clean speech wave form, (b) Entropy contour for clean speech, (c) Speech corrupted with factory noise at 6 dB SNR, and (d) Entropy contour for speech corrupted with factory noise at 6 dB SNR.

Multi-band/multi-resolution spectral entropy feature • The full-band spectral entropy feature can capture only the gross peakiness of the spectrum. • obtained the best results by dividing the normalized full-band spectrum into 24 overlapping sub-bands defined by Mel-scale and computed entropy from each sub-band

Entropy based full-combination multi-stream (FCMS) • Full-combination multi-stream : • All possible combinations of the two features are treated as separate streams. • An MLP expert is trained for each stream. The posteriors at the output of experts are weighted and combined. • The combined posteriors thus obtained are passed to an HMM decoder.

Entropy based full-combination multi-stream • The combined output posterior probability for class and frame

Spectral entropy feature in Tandem framework • exploiting the advantages of both HMM/ANN and HMM/GMM systems Multi-stream Tandem: Out puts from different experts are weighted and combined. The combined output undergoes KL transform before being fed as features into HMM/GMM systems.

access to the ‘outputs before softmax’ • Therefore we cannot use the entropy based weighting directly. • To overcome this problem • we converted the ‘outputs before softmax’ into posteriors using the equation. • “softmax” nonlinearity in this position • (exponentials normalized to sum to 1)

Experimental • Numbers95 database of US English connected digits telephone speech is used • There are 30 words in the database represented by 27 phonemes • Noisex92 database added at different signal-to-noise-ratios (SNRs) • There were 3,330 utterances for training and 2,250 utterances were used for testing the system

Results • Hybrid system under different noise conditions: WERs for PLP features, 24 Mel-band spectral entropy features and its time derivaties (24-Mel), the two features appended (PLP + 24-Mel), and PLP and spectral entropy features in FCMS with inverse entropy weighting.

Results • Tandem system under different noise conditions: WERs for PLP features, 24 Mel-band spectral entropy features and its time derivaties (24-Mel), the two features appended (PLP + 24-Mel), and PLP and spectral entropy features in FCMS with inverse entropy weighting.

Conclusion • We demonstrated that better performance can be achieved by FCMS as compared to appending the multi-resolution entropy feature vector to the PLP feature vector.

References • [4] Hemant Misra, Shajith Ikbal, Herv´e Bourlard, and Hynek Hermansky, “Spectral entropy based feature for robust ASR,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Montreal, Canada, May 2004. • [5] Hemant Misra, Shajith Ikbal, Sunil Sivadas, and Herv´e Bourlard, “Multi-resolution spectral entropy feature for robust ASR,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Philadelphia, U.S.A., Mar. 2005. • [7] Hynek Hermansky, Daniel P. W. Ellis, and Sangita Sharma, “TANDEM connectionist feature extraction for conventional HMM systems,” in Proceedings of IEEE International Conference on Acoustic, Speech, and Signal Processing, Istanbul, Turkey, 2000. • [11] Astrid Hagen and Andrew Morris, “Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR,” Computer Speech and Language, , no. 19, pp. 3–30, 2005.

Quantile Based Histogram Equalization for Noise Robust Speech Recognition

Quantile Based Histogram Equalization for Noise Robust Speech Recognition

Presentation Transcript

Landmark-Based Speech Recognition

Biologically Inspired Noise-Robust Speech Recognition for Both Man and Machine

Robust Speech recognition

ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION

Robust Recognition of Emotion from Speech

Histogram equalization

Environmental Noise No Longer Relevant for Speech Recognition.

Homework 2 Histogram Equalization

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition

Histogram-based Quantization for Distributed / Robust Speech Recognition

Speech Recognition in Noise

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION

Enhanced Speech Models for Robust Speech Recognition

Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition

Histogram Equalization

Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments

Noise Reduction in Speech Recognition

Prosodic Constraints for Robust Speech Recognition

A Feature Weighting Method for Robust Speech Recognition

Landmark-Based Speech Recognition

Landmark-Based Speech Recognition