1 / 61

ICCS-NTUA : WP1+WP2

ICCS-NTUA : WP1+WP2. Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr. HIWIRE. Computer Vision, Speech Communication and Signal Processing Research Group. ICCS-NTUA in HIWIRE: 1 st Year. Evaluation Databases Completed Baseline Completed WP1

tevin
Download Presentation

ICCS-NTUA : WP1+WP2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr HIWIRE Computer Vision, Speech Communication and Signal Processing Research Group

  2. ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • BaselineCompleted • WP1 • Noise Robust FeaturesResults 1st Year • Audio-Visual ASRBaseline + Visual Features • Multi-microphone arrayExploratory Phase • VADPrelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005

  3. ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Modulation Features Results 1st Year • Fractal Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech DatabaseCompleted HIWIRE Meeting, Granada June 2005

  4. WP1: Noise Robustness • Platform: HTK • Baseline + Evaluation: • Aurora 2, Aurora 3, TIMIT+NOISE • Modulation Features • AM-FM Modulations • Teager Energy Cepstrum • Fractal Features • Dynamical Denoising • Correlation Dimension • Multiscale Fractal Dimension • Hybrid-Merged Features up to+62% (Aurora 3) up to+36%(Aurora 2) up to +61%(Aurora 2) HIWIRE Meeting, Granada June 2005

  5. ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Speech Modulation Features Results 1st Year • Fractal Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005

  6. Modulation Acoustic Features Regularization + Multiband Filtering Nonlinear Processing Demodulation Robust Feature Transformation/ Selection Speech Statistical Processing AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP V.A.D. Energy Features: Teager Energy Cepstrum Coeff. TECC HIWIRE Meeting, Granada June 2005

  7. TIMIT-based Speech Databases • TIMIT Database: • Training Set: 3696 sentences , ~35 phonemes/utterances • Testing Set: 1344 utterances, 46680 phonemes • Sampling Frequency 16 kHz • Feature Vectors: • MFCC+C0+AM-FM+1st+2nd Time Derivatives • Stream Weights: (1) for MFCC and (2) for ΑΜ-FM • 3-state left-right HMMs, 16 mixtures • All-pair, Unweighted grammar • Performance Criterion: Phone Accuracy Rates (%) • Back-end System: HTK v3.2.0 HIWIRE Meeting, Granada June 2005

  8. Results: TIMIT+Noise Up to+106% HIWIRE Meeting, Granada June 2005

  9. Aurora 3 - Spanish • Connected-Digits, Sampling Frequency 8 kHz • Training Set: • WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 • MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) • HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) • Testing Set: • WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits • MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits • HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits • 2 Back-end ASR Systems (ΗΤΚ and BLasr) • Feature Vectors: MFCC+AM-FM(or Auditory+ΑM-FM), TECC • All-Pair, Unweighted Grammar (or Word-Pair Grammar) • Performance Criterion: Word (digit) Accuracy Rates HIWIRE Meeting, Granada June 2005

  10. Results: Aurora 3 (HTK) Up to+62% HIWIRE Meeting, Granada June 2005

  11. Databases: Aurora 2 • Task: Speaker Independent Recognition of Digit Sequences • TI - Digits at 8kHz • Training (8440 Utterances per scenario, 55M/55F) • Clean (8kHz, G712) • Multi-Condition (8kHz, G712) • 4 Noises (artificial): subway, babble, car, exhibition • 5 SNRs : 5, 10, 15, 20dB , clean • Testing, artificially added noise • 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] • A: noises as in multi-cond train., G712 (28028 Utters) • B: restaurant, street, airport, train station, G712 (28028 Utters) • C: subway, street (MIRS) (14014 Utters) HIWIRE Meeting, Granada June 2005

  12. Results: Aurora 2 Up to+12% HIWIRE Meeting, Granada June 2005

  13. Refinements w.r.t. AM-FM Features • Fusion w. Other Features Work To Be Done on Modulation Features HIWIRE Meeting, Granada June 2005

  14. ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Speech Modulation Features Results 1st Year • Fractal Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005

  15. Fractal Features N-d Cleaned FDCD speech signal N-d Signal Local SVD Embedding Filtered Dynamics - Correlation Dimension (8) MFD Geometrical Filtering Multiscale Fractal Dimension (6) Filtered Embedding Noisy Embedding HIWIRE Meeting, Granada June 2005

  16. Databases: Aurora 2 • Task: Speaker Independent Recognition of Digit Sequences • TI - Digits at 8kHz • Training (8440 Utterances per scenario, 55M/55F) • Clean (8kHz, G712) • Multi-Condition (8kHz, G712) • 4 Noises (artificial): subway, babble, car, exhibition • 5 SNRs : 5, 10, 15, 20dB , clean • Testing, artificially added noise • 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] • A: noises as in multi-cond train., G712 (28028 Utters) • B: restaurant, street, airport, train station, G712 (28028 Utters) • C: subway, street (MIRS) (14014 Utters) HIWIRE Meeting, Granada June 2005

  17. Results: Aurora 2 Up to +40% HIWIRE Meeting, Granada June 2005

  18. Results: Aurora 2 Up to +27% HIWIRE Meeting, Granada June 2005

  19. Results: Aurora 2 Up to +61% HIWIRE Meeting, Granada June 2005

  20. Future Directions on Fractal Features • Refine Fractal Feature Extraction. • Application to Aurora 3. • Fusion with other features. HIWIRE Meeting, Granada June 2005

  21. ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005

  22. Visual Front-End • Aim: • Extract low-dimensional visual speech feature vector from video • Visual front-end modules: • Speaker's face detection • ROI tracking • Facial Model Fitting • Visual feature extraction • Challenges: • Very high dimensional signal - which features are proper? • Robustness • Computational Efficiency HIWIRE Meeting, Granada June 2005

  23. Face Modeling • A well studied problem in Computer Vision: • Active Appearance Models, Morphable Models, Active Blobs • Both Shape & Appearance can enhance lipreading • The shape and appearance of human faces “live” in low dimensional manifolds = = HIWIRE Meeting, Granada June 2005

  24. Image Fitting Example step 2 step 6 step 10 step 14 step 18 HIWIRE Meeting, Granada June 2005

  25. Example: Face Interpretation Using AAM • Generative models like AAM allow us to evaluate the output of the visual front-end shape track superimposed on original video reconstructed face This is what the visual-only speech recognizer “sees”! original video HIWIRE Meeting, Granada June 2005

  26. Evaluation on the CUAVE Database HIWIRE Meeting, Granada June 2005

  27. Audio-Visual ASR: Database • Subset of CUAVE database used: • 36 speakers (30 training, 6 testing) • 5 sequences of 10 connected digits per speaker • Training set: 1500 digits (30x5x10) • Test set: 300 digits (6x5x10) • CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations) • CUAVE was kindly provided by the Clemson University HIWIRE Meeting, Granada June 2005

  28. Recognition Results (Word Accuracy) • Data • Training: ~500 digits (29 speakers) • Testing: ~100 digits (4 speakers) HIWIRE Meeting, Granada June 2005

  29. Future Work • Visual Front-end • Better trained AAM • Temporal tracking • Feature fusion • Experimentation with alternative DBN architectures • Automatic stream weight determination • Integration with non-linear acoustic features • Experiments on other audio-visual databases • Systematic evaluation of visual features HIWIRE Meeting, Granada June 2005

  30. ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Modulation Features Results 1st Year • Fractal Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005

  31. User Robustness, Speaker Adaptation • VTLN Baseline • Platform: HTK • Database: AURORA 4 • Fs = 8 kHz • Scenarios: Training, Testing • Comparison with MLLR • Collection of non-Native Speech Data Completed • 10 Speakers • 100 Utterances/Speaker HIWIRE Meeting, Granada June 2005

  32. Frequency Warping Vocal Tract Length Normalization Implementation: HTK • Warping Factor Estimation • Maximum Likelihood (ML) criterion Figures from Hain99, Lee96 HIWIRE Meeting, Granada June 2005

  33. VTLN • Training • AURORA 4 Baseline Setup • Clean (SIC), Multi-Condition (SIM), Noisy (SIN) • Testing • Estimate warping factor using adaptation utterances (Supervised VTLN) • Per speaker warping factor (1, 2, 10, 20 Utterances) • 2-pass Decoding • 1st pass • Get a hypothetical transcription • Alignment and ML to estimate per utterance warping factor • 2nd pass • Decode properly normalized utterance HIWIRE Meeting, Granada June 2005

  34. Databases: Aurora 4 • Task: 5000 Word, Continuous Speech Recognition • WSJ0: (16 / 8 kHz) + Artificially Added Noise • 2 microphones: Sennheiser, Other • Filtering: G712, P341 • Noises: Car, Babble, Restaurant, Street, Airport, Train Station • Training (7138 Utterances per scenario) • Clean: Sennheiser mic. • Multi-Condition: Sennheiser – Other mic., 75% w. artificially added noise @ SNR: 10 – 20 dB • Noisy: Sennheiser, artificially added noise SNR: 10 – 20 dB • Testing (330 Utterances – 166 Utterances each. Speaker # = 8) • SNR: 5-15 dB • 1-7: Sennheiser microphone • 8-14: Other microphone HIWIRE Meeting, Granada June 2005

  35. VTLN Results, Clean Training HIWIRE Meeting, Granada June 2005

  36. VTLN Results, Multi-Condition Training HIWIRE Meeting, Granada June 2005

  37. VTLN Results, Noisy Training HIWIRE Meeting, Granada June 2005

  38. Future Directions for Speaker Normalization • Estimate warping transforms at signal level • Exploit instantaneous amplitude or frequency signals to estimate the warping parameters, Normalize the signal • Effective integration with model-based adaptation techniques (collaboration with TSI) HIWIRE Meeting, Granada June 2005

  39. ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • BaselineCompleted • WP1 • Noise Robust FeaturesResults 1st Year • Audio-Visual ASRBaseline + Visual Features • Multi-microphone arrayExploratory Phase • VADPrelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005

  40. WP1: Appendix Slides Aurora 3

  41. TIMIT-Based Speech Databases (Correct Phone Accuracies (%)) Features TIMIT NTIMIT TIMIT+ Babble TIMIT+ White TIMIT+ Pink TIMIT+ Car Av. Rel. Improv. MFCC* 58.40 42.42 27.71 17.72 18.60 52.75 - TEner. CC 58.89 42.40 41.61 34.74 38.40 54.35 24.26 MFCC*+ IA-Mean 59.61 43.53 39.25 26.03 31.05 56.50 17.62 MFCC*+ IF-Mean 59.34 43.70 36.87 25.38 30.92 55.30 15.58 MFCC*+ FMP 59.92 43.69 38.60 26.15 32.84 55.97 18.17 * MFCC+C0+D+DD, # states=3, # mixtures=16 ASR Results Ι HIWIRE Meeting, Granada June 2005

  42. Aurora3 (Spanish Task) (Correct Word Accuracies (%)) Av. Rel. Improv. Scenario Features WM MM HM Average Aurora Front-End (WI007) 92.94 80.31 51.55 74.93 - MFCC* 93.68 92.73 65.18 83.86 +35.62 % TEnerCC+log(Ener) +CMS 93.64 91.61 86.85 90.70 62.90 % MFCC*+IA-Mean 94.05 92.22 77.70 87.99 52.09 % MFCC*+IF-Mean 90.71 89.52 72.36 84.20 36.98 % MFCC*+FMP 94.41 92.46 82.73 89.87 59.59 % * MFCC+log(Ener)+D+DD+CMS, # states=14, # mixtures=14 Experimental Results IIa (HTK) HIWIRE Meeting, Granada June 2005

  43. Aurora 3 Configs • HM • States 14, Mix’s 12 • MM • States 16, Mix’s 6 • WM • States 16, Mix’s 16 HIWIRE Meeting, Granada June 2005

  44. WP1: Appendix Slides Aurora 2

  45. Baseline: Aurora 2 • Database Structure: • 2 Training Scenarios, 3 Test Sets, [4+4+2] Conditions, 7 SNRs per Condition: Total of 2x70 Tests • Presentation of Selected Results: • Average over SNR. • Average over Condition. • Training Scenarios: Clean- v.s Multi- Train. • Noise Level: Low v.s. High SNR. • Condition: Worst v.s. Easy Conditions. • Features: MFCC+D+A v.s. MFCC+D+A+CMS • Set up: # states 18 [10-22], # mix’s [3-32], MFCC+D+A+CMS HIWIRE Meeting, Granada June 2005

  46. Baseline* Developed Baseline Training Scenario PLAIN CMS Best PLAIN Best CMS Clean 58.26 58.26 55.48 59.10 57.58 Multi 78.66 81.90 84.04 82.04 84.17 Average Baseline Results: Aurora 2 Average over all SNR’s and all Conditions Plain: MFCC+D+A, CMS: MFCC+D+A+CMS. Mixture #: Clean train (Both Plain,CMS) 3, Multi train Plain: 22, CMS: 32. Best: Select for each condition/noise the # mix’s with the best result. * Average HTK results as reported with the database. HIWIRE Meeting, Granada June 2005

  47. Results: Aurora 2 Up to+12% HIWIRE Meeting, Granada June 2005

  48. Results: Aurora 2 Up to +40% HIWIRE Meeting, Granada June 2005

  49. Results: Aurora 2 Up to +27% HIWIRE Meeting, Granada June 2005

  50. Results: Aurora 2 Up to +61% HIWIRE Meeting, Granada June 2005

More Related