1 / 67

Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization

Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization. Yoon Kim March 8, 2000. Outline. Introduction and Motivation Speech Analysis with Frequency Warping Speaker Normalization with Convex Optimization Experimental Results Conclusions.

raisie
Download Presentation

Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000

  2. Outline • Introduction and Motivation • Speech Analysis with Frequency Warping • Speaker Normalization with Convex Optimization • Experimental Results • Conclusions

  3. Problem Definition • Devise effective and robust features for speech recognition that are insensitive to mismatches in individual speaker acoustics and environment • How can we process the signal such that the acoustic mismatch is minimized ?

  4. Robust Signal Modeling • Feature Extraction • Derives a compact, yet effective representation • Feature Normalization • Compensates for the acoustic mismatch between the training and testing conditions

  5. Part I: Feature Extraction for Speech Recognition

  6. Cepstral Analysis of Speech • Most popular choice for speech recognition • Cepstrum is defined as the inverse Fourier transform of the log spectrum • Truncated to length L (smoothes log spectrum)

  7. FFT-Based Feature Extraction • Perceptually motivated FFT filterbank is used to emulate the auditory system • Analysis is directly affected by fine harmonics • Examples • Mel Frequency Cepstral Analysis • Perceptual Linear Prediction (PLP)

  8. LP-Based Feature Extraction • Linear prediction provides a smooth spectrum mostly containing vocal-tract information • Frequency warping is not straightforward • Examples • Frequency-Warped Linear Prediction • Time-domain Warped Linear Prediction

  9. Part I: Non-uniform Linear Predictive Analysis of Speech

  10. Basic Ideas of the NLP Analysis • Frequency warping of the vocal-tract spectrum using non-uniform DFT (NDFT) • Bark-frequency scale is used for warping • Pre- and post-warp linear prediction smoothing

  11. Bark Bilinear Transform • Bark Bilinear Transform • For an appropriately chosen ρ, the mapping closely resembles a Bark mapping

  12. Figure: Bark-Frequency Warping

  13. Pre-Warp Linear Prediction • Vocal-tract transfer function H(z) can be represented by an all-pole model

  14. NDFT Frequency Warping • NDFT of the vocal-tract impulse response • ωk : Frequency grid of Bark bilinear transform

  15. Post-Warp Linear Prediction • Take the IDFT of the power spectrum to get the warped autocorrelation coefficients • Durbin recursion to get new LP coefficients

  16. Conversion to Cepstrum • Convert warped LP parameters to a set of L cepstral parameters via recursion

  17. NDFT Warping: Vowel /u/

  18. NDFT Warping: Vowel /u/

  19. Clustering Measures • Derive meaningful measures to assess how well the feature clusters of each class (vowel) can be separated and discriminated • Three measures were considered • Determinant measure • Trace measure • Inverse trace measure

  20. Scatter Matrices • SW: Within-class scatter matrix • SB : Between-class scatter matrix • ST : Total scatter matrix

  21. Determinant Measure • Ratio of the between-class and within-class scattering volume • Larger the value, better the clustering

  22. Trace Measure • Ratio of the sum of scattering radii of between-class and within-class scatter • Larger the better

  23. Inverse Trace Measure • Sum of within-class scattering radii normalized by the total scatter • Smaller the better

  24. Vowel Clustering Performance • We compared the values of the scattering measures discussed to assess the clustering performance of the NLP cepstrum • Mel, PLP and LP techniques were also tested for comparison

  25. Steady-State Vowel Database • Eleven steady-state English vowels from 23 speakers (12 male, 9 female, 2 children) • Sampling rate: 10 kHz • Each speaker provided 6 frames of steady-state vowel segments

  26. Results: Vowel Clustering

  27. 2-D Vowel Clusters: /a/ /i/ /o/

  28. 2-D Vowel Clusters: /a/ /e/ /i/

  29. Part II: Feature Normalization for Speaker Acoustics Matching

  30. Speech Recognition Problem • Given a sequence of acoustic feature vectors Xextracted from speech, find the most likely word string that could have been uttered

  31. HMM Acoustic Model • Hidden Markov Models (HMMs): Each phone unit is modeled as a sequence of hidden states • Speech dynamics modeled as transitions from one state to another • Each state has a feature probability distribution • Goal: Guess the underlying state sequence (phone string) from the observable features

  32. Example: HMM Word Model 1 2 3 4 5 pause /w/ /Λ/ /n/ pause Digit: “one”

  33. Why Speaker Normalization ? • Most speech recognition systems use statistical models trained using a large database with the hope that the testing conditions will be similar • Acoustic mismatches between the speakers used in training and testing result in unacceptable degradation of recognition performance

  34. Prior Work in Speaker Normalization • Normalization usually refers to modification of the features to fit a statistical model • Vocal-tract length normalization (VTLN) • Attempts to alter the resonant frequencies of the vocal-tract by warping the frequency axis • Linear warping • All-pass warping (bilinear transform)

  35. Prior Work: Speaker Adaptation • Adaptation usually refers to modification of the model parameters to fit the data • Maximum Likelihood Bias • ML Linear Regression (MLLR)

  36. Part II: Speaker Normalization with Maximum-Likelihood Affine Cepstral Filtering

  37. Linear Cepstral Filtering (LCF) • We propose the following linear, Toeplitz transformation of the cepstral feature vectors

  38. Linear Cepstral Filtering (LCF) • H represents the linear cepstral transformation for normalizing speaker acoustics. • The matrix operation corresponds to • Convolution in the cepstral domain • Log spectral filtering in the frequency domain

  39. Maximum-Likelihood Estimation • Find the optimal normalization H such that the transformed features yield maximum likelihood with respect to a given model Λ • Only L parameters for estimation (instead of L2)

  40. Commutative Property of LCF • Due to the commutative property of the convolution, the transformed cepstrum can also be expressed as a linear function of the filter h

  41. Solution: Single Gaussian Case • Let c(i) be the i-th feature of the data (i=0,…,N-1) • Let the distribution corresponding to c(i) be Gaussian with mean μi and covariance Σi • Total log-likelihood of transformed feature data set is a concave, quadratic function of the filter h

  42. Solution: Single Gaussian Case • Since the negative of the log-likelihood is convex in h, there exists a unique ML solution h*

  43. Case: Gaussian Mixture • Log-likelihood is no longer a convex function • Approximation: We use the single Gaussian density for ML filter estimation • Past studies support the validity of the approx.

  44. Case: Log-Concave PDFs • For any distribution that is log-concave, ML estimation can be posed as a convex problem • Examples • Laplace: p(x) = (1/2a) exp(-|x|/a) • Uniform: p(x) = 1/(2a) on [-a, a] • Rayleigh: p(x) = (2/a) x exp(-x2/b), x > 0

  45. Affine Cepstral Filtering (ACF) • We can extend the linear transformation to an affine form by adding a cepstral bias term v • Bias models channel and other additive effects • Joint optimization of filter and bias leads to a more flexible transformation of the cepstral space

  46. Solution: Affine Transformation • By combining the filter h and bias v into an augmented design vector x, the joint ML solution can be easily attained by extending the linear case

  47. Example: Vowel /ah/No Warping, No Normalization

  48. Vowel /ah/: With NLP Warping, No Normalization

  49. Vowel /ah/: With NLP Warping and LCF Normalization

  50. Example: Vowel /oh/No Warping, No Normalization

More Related