1 / 25

LORIA

LORIA. Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007. Missing Data : previous approach. Hypothesis: some coefficients of feature vector are masked by noise marginalization : to replace p(Y|M) by integration Approach presented before: y = x + n

aliza
Download Presentation

LORIA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007

  2. Missing Data : previous approach • Hypothesis: some coefficients of feature vector are masked by noise • marginalization : to replace p(Y|M) by integration • Approach presented before: y = x + n (additive case because we are in the spectral domain) • Two cases: • If SNR > 0 If SNR < 0 • x>n then y/2<x<y x<n then 0<x<y/2 y y n x n n ….. ….. y/2 y/2 x x 0 0

  3. WP1 : Missing Data • modified approach presented before • Better approximation of the interval of marginalization

  4. Missing Data : new approach • To chose the integral limits in function of the mask estimation Interval of marginalization will be smaller

  5. Proposed masks Noisy speech spectrum Y Clean speech spectrum X • Each Time-Frequency unit is a scalar ( in [0;1] ) which is the relative contribution of speech energy in the observed signal. • Different from mask based on SNR where each unit give the probability that the corresponding pixel is missing.

  6. Cluster 1 Cluster 2 Cluster 3 Cluster 4 Proposed masks • Each cluster k is represented by: • a mean vector: μk = (μ1 , … , μN ) • a diagonal covariance matrix: Σk = diag(σ1 , … , σN) • Clusters can be seen as pdfs of the contribution of speech energy in the noisy observed signal. • We propose to consider these clusters as potential missing data masks for any noisy input frame

  7. Missing data :training • For each mask k a GMM model is trained with observation on the noisy frames Y aligned with Mk • Construction of ergodic HMM with previous GMMs

  8. Missing data : recognition • Use ergodic HMM to find the mask k for each frame • Each frame y(t) -> one state -> mask • Use mik and sik of Mk to define the marginalization interval: • [mik - 2 sik , mik +2 sik] • Marginalization:

  9. Missing Data: Experiments • Parameterization • Spectral domain 12 Mel bands + + D • training • HMM models on clean Aurora4 + adaptation with 50 first sentences HIWIRE clean • Mk : trained on noisy HIWIRE (50 first sentences) LN+MN+HM+clean • Test • Noisy HIWIRE (50 last sentences)

  10. Visualisation of the marginalisation intervals on an example One spectral coefficient for word « standby » New method Previous method Clean LN

  11. Visualisation of the marginalisation intervals on an example MN new method previous method HN

  12. WER evaluation new previous

  13. WER based evaluation • Comparison with ETSI AFE: New

  14. Results WER % previous new Oracle : X/Y -> Mk -> marginalisation

  15. New method : High Noise problem True value is outside of the marginalization interval

  16. Conclusion • Better approximation of the interval of marginalization gives better recognition results especially for LN and MN conditions • But mask estimation must be improved in MN and HN conditions

  17. WP2: Non-native speech recognition • Previous work • 2 sets of models: • TIMIT HMM models • Native (Fr, It, Gr, Sp) HMM models • Confusion rules • Integration of the rules in HMM • New study: • Different sets of models

  18. Different sets of models • TIMIT models (canonical English models) • Native models L={Fr, It, Sp, Gr} • MLLR adapted models • TIMIT HMM adapted on HIWIREL • MAP adapted models • TIMIT HMM adapted on HIWIREL • Re-estimated models • TIMIT HMM + Baum-Welch iterations using HIWIREL

  19. Experimental conditions • Adaptation and re-estimation: • Cross-validation system (leave one out): • All speakers exept one for adaptation or re-estimation • The remaining speaker for testing

  20. Results HMM TIMIT MLLR adaptation with HIWIRE HIWIRE grammar TIMIT+ native Retraining on HIWIRE MAP adaptation with HIWIRE Word loop grammar

  21. Results with confusion rules integrated in HMM (HIWIRE grammar) WER SER Baseline 7.2 14.6 5.3 10.2 5.8 11.8 4.8 10.9 3.5 8.1 2.8 6.4 2.8 6.5 2.1 5.0 Best result with TIMIT HMM models (canonical English) + retrained models

  22. Results with speaker adaptation • Using the best system of the previous slide (confusion rules integrated in TIMIT HMM + re-estimation) we add a speaker adaptation step: • 50 first sentences per speaker for adaptation • MAP adaptation • Hiwire grammar • WER : 1.4% • SER : 3.2%

  23. Conclusion • Different sets of models have been tested • Baseline results : • WER : 7.2% SER : 14.6% • Best result is obtained with Confusion with TIMIT HMM + re-estimation+MAP speaker adaptation : • WER : 1.4% SER : 3.2%

  24. Extracted rules /t/ //  /t/ /t/  /k/ //  /t/ // Modifed structure of HMM for model /t/ Example of acoustic model modification for english phone /t/ English phones French phones English model French models

More Related