1 / 27

Topic

Topic. Why Speech Recognizers Make Errors? A Robustness View (ICSLP 2004) Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments (ICSLP 2004). Why Speech Recognizers Make Errors? A Robustness View. Hong Kook Kim and Mzin Rahim

duc
Download Presentation

Topic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic • Why Speech Recognizers Make Errors? A Robustness View (ICSLP 2004) • Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments (ICSLP 2004)

  2. Why Speech Recognizers Make Errors?A Robustness View Hong Kook Kim and Mzin Rahim Gwangju Institute of Science and Technologu(GIST), Korea AT&T Labs-Research, USA ICSLP 2004 Reporter: Shih-Hsiang

  3. Introduction • Various kinds of robustness problems • arrtibuted, background noise, coarticulation effects, channel distortion, accent and dialects • Several novel algorithms that have been proposed • Minimize the acoustic mismatch between the training model and the testing environment • Enhancement, normalization or adaptation • Feature domain / Model domain • In this paper try to create a diagnostic tool that can provide a better insight of “why recognizers make errors”

  4. Stationary Quantity of NoiseStationary Signal-to-Noise Ratio • Investigate the effect of environment noise on the ASR performance • First Step: detect and measure background noise • Voice activity detection (VAD) • according context of speech) • Energy detection • using histogram and threshold • Forces alignment • preserving the state segmentation or forces alignment with recognized transcription • Or train a binary classifier or a Gaussian mixture model to separate speech from silence

  5. Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.) Voice ActivityDetection Speech waveform SNR_V EnergyClustering Speech/NoiseDecision SNRComputation SNR_E SNR_F HMM Forced Alignment Speech/SilenceState Decoding Transcription DictionaryGeneration Acoustic Model

  6. Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.) • Second Step: Measure SNR I(n) : identifier for the n-th analysis frame I(n)=1, speech interval I(n)=0, silent interval E(n) : log energy of the n-th frame

  7. Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.) “I need to inquire about a bill that was sent.” laughing from 6 to 8 seconds VAD 30.67 dB Energy clustering 36.29dB Forced alignment 23.52dB *Forced alignment approach to be more robust for speech/non-speech detection

  8. Time-Varying Quantity of NoiseNonstationary SNR • Stationary SNR measurement does not reflect the local characteristics of environmental noise • Using standard deviation of noise power normalized by the average signal power *smaller variations in the noise characteristics among different frames would result in low measurement of NSNR

  9. Effect of Stationary and Nonstationary SNRs on ASR Performance • Corpus • telephone speech collected (over 20 different datasets) • 5171 utterances (54658 words) for testing • trigram language model

  10. Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

  11. Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

  12. Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

  13. Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.) • Estimated word accuracy (linear regression model) • Estimation error

  14. Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

  15. Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments Zhenyu Xiong, Thomas Fang Zheng, and Wenhu Wu Center for Speech Technology,Tsinghua University, China ICSLP 2004 Reporter:ShihHsiang

  16. Front-end Module

  17. Quantile based speech/non-speech Detection • Based on order statistics (OS) filter to obtain an estimation of the local SNR of the speech signal • Two OS filters are applied to the log energy of the signal • Median filter : track the background noise level (B) • 0.9 quantile(Q(0.9)) : track the signal level • The difference is called quantile-based estimation of the instantaneous SNR (QSNR) of the signal

  18. Quantile based speech/non-speech Detection (cont.) • Let Et-L ….Et+Lbe the log energy values of 2L+1 around the frame t to be analyzed • Let E(r) , where r=1…2L+1, be the corresponding sorted values in ascending order • Then, E(L+1) is the output of the median filter • For the other filter • The speech/non-speech detection is made by comparing the estimated SNR with a threshold

  19. Quantile based speech/non-speech Detection (cont.)

  20. Noise estimation • S(ω,t ) be the power spectrum at the frequency ω at the t-th frame of the input signal • N(ω,t ) be the power spectrum of the estimated noise at the frequencyωat the t-th frame for non-speech for speech λ= 0.05 forgetting factor

  21. Spectral subtraction • A tradition non-linear spectral subtraction algorithm in the power spectrum domain is used for noise reduction α=1.1 : the over-subtraction factor β=0.1 : the spectral floor

  22. Frame SNR estimation • Based on the result of noise estimation and spectral subtraction • Indicates the degree how the current speech frame is uncorrupted with noise

  23. Weighting Algorithm • In a conventional HMM-based speech recognition • Emphasize the observation for slightly corrupted speech rj is an observation weighing vector emphasizing δis a factor used to adjust the emphasizing degree

  24. Weighting factor

  25. Experimental Result • Clean Speech • isolate word database by 20 speaker • Each speakers speak 100 Chinese names for 4 times • Dataset contain 7,893 isolate word utterances • Four different kinds of noises • Babble noise, Factory noise, Pink noise, White noise • Recognition System • di-IFs corpus • 3 states and a mixture of 8 Gaussian pdfs per states • Acoustic model employs 42-dimension features

  26. Experimental Result (cont.)

  27. Experimental Result (cont.)

More Related