610 likes | 763 Views
ICCS-NTUA : WP1+WP2. Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr. HIWIRE. Computer Vision, Speech Communication and Signal Processing Research Group. ICCS-NTUA in HIWIRE: 1 st Year. Evaluation Databases Completed Baseline Completed WP1
E N D
ICCS-NTUA : WP1+WP2 Prof. Petros Maragos NTUA, School of ECE URL: http://cvsp.cs.ntua.gr HIWIRE Computer Vision, Speech Communication and Signal Processing Research Group
ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • BaselineCompleted • WP1 • Noise Robust FeaturesResults 1st Year • Audio-Visual ASRBaseline + Visual Features • Multi-microphone arrayExploratory Phase • VADPrelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005
ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Modulation Features Results 1st Year • Fractal Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech DatabaseCompleted HIWIRE Meeting, Granada June 2005
WP1: Noise Robustness • Platform: HTK • Baseline + Evaluation: • Aurora 2, Aurora 3, TIMIT+NOISE • Modulation Features • AM-FM Modulations • Teager Energy Cepstrum • Fractal Features • Dynamical Denoising • Correlation Dimension • Multiscale Fractal Dimension • Hybrid-Merged Features up to+62% (Aurora 3) up to+36%(Aurora 2) up to +61%(Aurora 2) HIWIRE Meeting, Granada June 2005
ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Speech Modulation Features Results 1st Year • Fractal Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005
Modulation Acoustic Features Regularization + Multiband Filtering Nonlinear Processing Demodulation Robust Feature Transformation/ Selection Speech Statistical Processing AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP V.A.D. Energy Features: Teager Energy Cepstrum Coeff. TECC HIWIRE Meeting, Granada June 2005
TIMIT-based Speech Databases • TIMIT Database: • Training Set: 3696 sentences , ~35 phonemes/utterances • Testing Set: 1344 utterances, 46680 phonemes • Sampling Frequency 16 kHz • Feature Vectors: • MFCC+C0+AM-FM+1st+2nd Time Derivatives • Stream Weights: (1) for MFCC and (2) for ΑΜ-FM • 3-state left-right HMMs, 16 mixtures • All-pair, Unweighted grammar • Performance Criterion: Phone Accuracy Rates (%) • Back-end System: HTK v3.2.0 HIWIRE Meeting, Granada June 2005
Results: TIMIT+Noise Up to+106% HIWIRE Meeting, Granada June 2005
Aurora 3 - Spanish • Connected-Digits, Sampling Frequency 8 kHz • Training Set: • WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 • MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) • HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) • Testing Set: • WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits • MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits • HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits • 2 Back-end ASR Systems (ΗΤΚ and BLasr) • Feature Vectors: MFCC+AM-FM(or Auditory+ΑM-FM), TECC • All-Pair, Unweighted Grammar (or Word-Pair Grammar) • Performance Criterion: Word (digit) Accuracy Rates HIWIRE Meeting, Granada June 2005
Results: Aurora 3 (HTK) Up to+62% HIWIRE Meeting, Granada June 2005
Databases: Aurora 2 • Task: Speaker Independent Recognition of Digit Sequences • TI - Digits at 8kHz • Training (8440 Utterances per scenario, 55M/55F) • Clean (8kHz, G712) • Multi-Condition (8kHz, G712) • 4 Noises (artificial): subway, babble, car, exhibition • 5 SNRs : 5, 10, 15, 20dB , clean • Testing, artificially added noise • 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] • A: noises as in multi-cond train., G712 (28028 Utters) • B: restaurant, street, airport, train station, G712 (28028 Utters) • C: subway, street (MIRS) (14014 Utters) HIWIRE Meeting, Granada June 2005
Results: Aurora 2 Up to+12% HIWIRE Meeting, Granada June 2005
Refinements w.r.t. AM-FM Features • Fusion w. Other Features Work To Be Done on Modulation Features HIWIRE Meeting, Granada June 2005
ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Speech Modulation Features Results 1st Year • Fractal Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005
Fractal Features N-d Cleaned FDCD speech signal N-d Signal Local SVD Embedding Filtered Dynamics - Correlation Dimension (8) MFD Geometrical Filtering Multiscale Fractal Dimension (6) Filtered Embedding Noisy Embedding HIWIRE Meeting, Granada June 2005
Databases: Aurora 2 • Task: Speaker Independent Recognition of Digit Sequences • TI - Digits at 8kHz • Training (8440 Utterances per scenario, 55M/55F) • Clean (8kHz, G712) • Multi-Condition (8kHz, G712) • 4 Noises (artificial): subway, babble, car, exhibition • 5 SNRs : 5, 10, 15, 20dB , clean • Testing, artificially added noise • 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] • A: noises as in multi-cond train., G712 (28028 Utters) • B: restaurant, street, airport, train station, G712 (28028 Utters) • C: subway, street (MIRS) (14014 Utters) HIWIRE Meeting, Granada June 2005
Results: Aurora 2 Up to +40% HIWIRE Meeting, Granada June 2005
Results: Aurora 2 Up to +27% HIWIRE Meeting, Granada June 2005
Results: Aurora 2 Up to +61% HIWIRE Meeting, Granada June 2005
Future Directions on Fractal Features • Refine Fractal Feature Extraction. • Application to Aurora 3. • Fusion with other features. HIWIRE Meeting, Granada June 2005
ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005
Visual Front-End • Aim: • Extract low-dimensional visual speech feature vector from video • Visual front-end modules: • Speaker's face detection • ROI tracking • Facial Model Fitting • Visual feature extraction • Challenges: • Very high dimensional signal - which features are proper? • Robustness • Computational Efficiency HIWIRE Meeting, Granada June 2005
Face Modeling • A well studied problem in Computer Vision: • Active Appearance Models, Morphable Models, Active Blobs • Both Shape & Appearance can enhance lipreading • The shape and appearance of human faces “live” in low dimensional manifolds = = HIWIRE Meeting, Granada June 2005
Image Fitting Example step 2 step 6 step 10 step 14 step 18 HIWIRE Meeting, Granada June 2005
Example: Face Interpretation Using AAM • Generative models like AAM allow us to evaluate the output of the visual front-end shape track superimposed on original video reconstructed face This is what the visual-only speech recognizer “sees”! original video HIWIRE Meeting, Granada June 2005
Evaluation on the CUAVE Database HIWIRE Meeting, Granada June 2005
Audio-Visual ASR: Database • Subset of CUAVE database used: • 36 speakers (30 training, 6 testing) • 5 sequences of 10 connected digits per speaker • Training set: 1500 digits (30x5x10) • Test set: 300 digits (6x5x10) • CUAVE database also contains more complex data sets: speaker moving around, speaker shows profile, continuous digits, two speakers (to be used in future evaluations) • CUAVE was kindly provided by the Clemson University HIWIRE Meeting, Granada June 2005
Recognition Results (Word Accuracy) • Data • Training: ~500 digits (29 speakers) • Testing: ~100 digits (4 speakers) HIWIRE Meeting, Granada June 2005
Future Work • Visual Front-end • Better trained AAM • Temporal tracking • Feature fusion • Experimentation with alternative DBN architectures • Automatic stream weight determination • Integration with non-linear acoustic features • Experiments on other audio-visual databases • Systematic evaluation of visual features HIWIRE Meeting, Granada June 2005
ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • Baseline Completed • WP1 • Noise Robust Features Results 1st Year • Modulation Features Results 1st Year • Fractal Features Results 1st Year • Audio-Visual ASR Baseline + Visual Features • Multi-microphone array Exploratory Phase • VAD Prelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005
User Robustness, Speaker Adaptation • VTLN Baseline • Platform: HTK • Database: AURORA 4 • Fs = 8 kHz • Scenarios: Training, Testing • Comparison with MLLR • Collection of non-Native Speech Data Completed • 10 Speakers • 100 Utterances/Speaker HIWIRE Meeting, Granada June 2005
Frequency Warping Vocal Tract Length Normalization Implementation: HTK • Warping Factor Estimation • Maximum Likelihood (ML) criterion Figures from Hain99, Lee96 HIWIRE Meeting, Granada June 2005
VTLN • Training • AURORA 4 Baseline Setup • Clean (SIC), Multi-Condition (SIM), Noisy (SIN) • Testing • Estimate warping factor using adaptation utterances (Supervised VTLN) • Per speaker warping factor (1, 2, 10, 20 Utterances) • 2-pass Decoding • 1st pass • Get a hypothetical transcription • Alignment and ML to estimate per utterance warping factor • 2nd pass • Decode properly normalized utterance HIWIRE Meeting, Granada June 2005
Databases: Aurora 4 • Task: 5000 Word, Continuous Speech Recognition • WSJ0: (16 / 8 kHz) + Artificially Added Noise • 2 microphones: Sennheiser, Other • Filtering: G712, P341 • Noises: Car, Babble, Restaurant, Street, Airport, Train Station • Training (7138 Utterances per scenario) • Clean: Sennheiser mic. • Multi-Condition: Sennheiser – Other mic., 75% w. artificially added noise @ SNR: 10 – 20 dB • Noisy: Sennheiser, artificially added noise SNR: 10 – 20 dB • Testing (330 Utterances – 166 Utterances each. Speaker # = 8) • SNR: 5-15 dB • 1-7: Sennheiser microphone • 8-14: Other microphone HIWIRE Meeting, Granada June 2005
VTLN Results, Clean Training HIWIRE Meeting, Granada June 2005
VTLN Results, Multi-Condition Training HIWIRE Meeting, Granada June 2005
VTLN Results, Noisy Training HIWIRE Meeting, Granada June 2005
Future Directions for Speaker Normalization • Estimate warping transforms at signal level • Exploit instantaneous amplitude or frequency signals to estimate the warping parameters, Normalize the signal • Effective integration with model-based adaptation techniques (collaboration with TSI) HIWIRE Meeting, Granada June 2005
ICCS-NTUA in HIWIRE: 1st Year • Evaluation • Databases Completed • BaselineCompleted • WP1 • Noise Robust FeaturesResults 1st Year • Audio-Visual ASRBaseline + Visual Features • Multi-microphone arrayExploratory Phase • VADPrelim. Results • WP2 • Speaker Normalization Baseline • Non-native Speech Database Completed HIWIRE Meeting, Granada June 2005
WP1: Appendix Slides Aurora 3
TIMIT-Based Speech Databases (Correct Phone Accuracies (%)) Features TIMIT NTIMIT TIMIT+ Babble TIMIT+ White TIMIT+ Pink TIMIT+ Car Av. Rel. Improv. MFCC* 58.40 42.42 27.71 17.72 18.60 52.75 - TEner. CC 58.89 42.40 41.61 34.74 38.40 54.35 24.26 MFCC*+ IA-Mean 59.61 43.53 39.25 26.03 31.05 56.50 17.62 MFCC*+ IF-Mean 59.34 43.70 36.87 25.38 30.92 55.30 15.58 MFCC*+ FMP 59.92 43.69 38.60 26.15 32.84 55.97 18.17 * MFCC+C0+D+DD, # states=3, # mixtures=16 ASR Results Ι HIWIRE Meeting, Granada June 2005
Aurora3 (Spanish Task) (Correct Word Accuracies (%)) Av. Rel. Improv. Scenario Features WM MM HM Average Aurora Front-End (WI007) 92.94 80.31 51.55 74.93 - MFCC* 93.68 92.73 65.18 83.86 +35.62 % TEnerCC+log(Ener) +CMS 93.64 91.61 86.85 90.70 62.90 % MFCC*+IA-Mean 94.05 92.22 77.70 87.99 52.09 % MFCC*+IF-Mean 90.71 89.52 72.36 84.20 36.98 % MFCC*+FMP 94.41 92.46 82.73 89.87 59.59 % * MFCC+log(Ener)+D+DD+CMS, # states=14, # mixtures=14 Experimental Results IIa (HTK) HIWIRE Meeting, Granada June 2005
Aurora 3 Configs • HM • States 14, Mix’s 12 • MM • States 16, Mix’s 6 • WM • States 16, Mix’s 16 HIWIRE Meeting, Granada June 2005
WP1: Appendix Slides Aurora 2
Baseline: Aurora 2 • Database Structure: • 2 Training Scenarios, 3 Test Sets, [4+4+2] Conditions, 7 SNRs per Condition: Total of 2x70 Tests • Presentation of Selected Results: • Average over SNR. • Average over Condition. • Training Scenarios: Clean- v.s Multi- Train. • Noise Level: Low v.s. High SNR. • Condition: Worst v.s. Easy Conditions. • Features: MFCC+D+A v.s. MFCC+D+A+CMS • Set up: # states 18 [10-22], # mix’s [3-32], MFCC+D+A+CMS HIWIRE Meeting, Granada June 2005
Baseline* Developed Baseline Training Scenario PLAIN CMS Best PLAIN Best CMS Clean 58.26 58.26 55.48 59.10 57.58 Multi 78.66 81.90 84.04 82.04 84.17 Average Baseline Results: Aurora 2 Average over all SNR’s and all Conditions Plain: MFCC+D+A, CMS: MFCC+D+A+CMS. Mixture #: Clean train (Both Plain,CMS) 3, Multi train Plain: 22, CMS: 32. Best: Select for each condition/noise the # mix’s with the best result. * Average HTK results as reported with the database. HIWIRE Meeting, Granada June 2005
Results: Aurora 2 Up to+12% HIWIRE Meeting, Granada June 2005
Results: Aurora 2 Up to +40% HIWIRE Meeting, Granada June 2005
Results: Aurora 2 Up to +27% HIWIRE Meeting, Granada June 2005
Results: Aurora 2 Up to +61% HIWIRE Meeting, Granada June 2005