1 / 33

Towards speaker and environmental robustness in ASR: the HIWIRE project

Towards speaker and environmental robustness in ASR: the HIWIRE project. A. Potamianos 1 , G. Bouselmi 2 , D. Dimitriadis 3 , D. Fohr 2 , R. Gemello 4 , I. Illina 2 , F. Mana 4 , P. Maragos 3 , M. Matassoni 5 , V. Pitsikalis 3 , J. Ramírez 6 , E. Sanchez-Soto 1 , J. Segura 6 , and P. Svaizer 5

louie
Download Presentation

Towards speaker and environmental robustness in ASR: the HIWIRE project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos1,G. Bouselmi2,D. Dimitriadis3,D. Fohr2,R. Gemello4,I. Illina2, F. Mana4, P. Maragos3,M. Matassoni5,V. Pitsikalis3,J. Ramírez6,E. Sanchez-Soto1,J. Segura6, andP. Svaizer5 1 Dept. of E.C.E., Tech. Univ. of Crete, Chania, Greece 2 Speech Group, LORIA, Nancy, France 3School of E.C.E., Natl. Tech. Univ. of Athens, Athens, Greece 4 Loquendo, via Valdellatorre, 4-10149, Torino, Italy 5 ITC-irst, via Sommarive 18 - Povo (TN), Italy 6Dept. of Signal Theory, Univ. of Granada, Spain

  2. Outline • Introduction: the HIWIRE project • Goals and objectives • Research areas: • Environmental robustness • Speaker robustness • Experimental results • Ongoing work

  3. HIWIRE project • http://www.hiwire.org • Goals: environment and speaker robust ASR • Showcase: fixed cockpit platform, PDA platform • Industrial partners: Thales Avionics, Loquendo • Research partners: LORIA, TUC, NTUA, UGR, ITC-IRST, Thales research • FP6 project: 6/2004 to 5/2007

  4. Research areas • Environmental robustness • Multi-microphone ASR • Robust feature extraction • Feature fusion and audio-visual ASR • Feature equalization • Voice-activity detection • Speech enhancement • Speaker robustness • Model-transformation • Acoustic modeling for non-native speech

  5. Multi-microphone ASR: Outline • Beamforming and Adaptive Noise Cancellation • Environmental Acoustics Estimation

  6. Beamforming: D&S Availability of multi-channel signals allows to selectively capture the desired source: • Issues: • estimation of reliable TDOAs; • Method: • CSP analysis over multiple frames • Advantages: • robustness • reduced computational power

  7. D&S with MarkIII • Test set: • set N1_SNR0 of MC-TIDIGITS (cockpit noise), MarkIII channels • clean models, trained on original TIDIGITS • Results (WERR [%]):

  8. Robust Features for ASR • Modulation Features • AM-FM Modulations • Teager Energy Cepstrum • Fractal Features • Dynamical Denoising • Correlation Dimension • Multiscale Fractal Dimension • Hybrid-Merged Features up to+62% (Aurora 3) up to+36%(Aurora 2) up to +61%(Aurora 2)

  9. Speech Modulation Features • Filterbank Design • Short-Term AM-FM Modulation Features • Short-Term Mean Inst. Amplitude IA-Mean • Short-Term Mean Inst. Frequency IF-Mean • Frequency Modulation Percentages FMP • Short-Term Energy Modulation Features • Average Teager Energy, Cepstrum Coef. TECC

  10. Regularization + Multiband Filtering Demodulation Robust Feature Transformation/ Selection Nonlinear Processing Speech Statistical Processing AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP V.A.D. Energy Features: Teager Energy Cepstrum Coeff.TECC Modulation Acoustic Features

  11. TIMIT-based Speech Databases • TIMIT Database: • Training Set: 3696 sentences , ~35 phonemes/utterances • Testing Set: 1344 utterances, 46680 phonemes • Sampling Frequency 16 kHz • Feature Vectors: • MFCC+C0+AM-FM+1st+2nd Time Derivatives • Stream Weights: (1) for MFCC and (2) for ΑΜ-FM • 3-state left-right HMMs, 16 mixtures • All-pair, Unweighted grammar • Performance Criterion: Phone Accuracy Rates (%) • Back-end System: HTK v3.2.0

  12. Results: TIMIT+Noise Up to+106%

  13. Aurora 3 - Spanish • Connected-Digits, Sampling Frequency 8 kHz • Training Set: • WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 • MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) • HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) • Testing Set: • WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits • MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits • HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits • 2 Back-end ASR Systems (ΗΤΚ and BLasr) • Feature Vectors: MFCC+AM-FM(or Auditory+ΑM-FM), TECC • All-Pair, Unweighted Grammar (or Word-Pair Grammar) • Performance Criterion: Word (digit) Accuracy Rates

  14. Results: Aurora 3 Up to+62%

  15. FDCD N-d Cleaned speech signal N-d Signal Local SVD Embedding Filtered Dynamics - Correlation Dimension MFD Geometrical Filtering Multiscale Fractal Dimension Filtered Embedding Noisy Embedding Fractal Features

  16. Databases: Aurora 2 • Task: Speaker Independent Recognition of Digit Sequences • TI - Digits at 8kHz • Training (8440 Utterances per scenario, 55M/55F) • Clean (8kHz, G712) • Multi-Condition (8kHz, G712) • 4 Noises (artificial): subway, babble, car, exhibition • 5 SNRs : 5, 10, 15, 20dB , clean • Testing, artificially added noise • 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] • A: noises as in multi-cond train., G712 (28028 Utters) • B: restaurant, street, airport, train station, G712 (28028 Utters) • C: subway, street (MIRS) (14014 Utters)

  17. Results: Aurora 2 Up to +61%

  18. Feature Fusion • Merge synchronous feature streams • Investigate both supervised and unsupervised algorithms

  19. Compute “optimal” exponent weights for each streams [HMM Gaussian mixture formulation; similar expressions for MM, naïve Bayes, Euclidean/Mahalonobois classifier] • Optimality in the sense of minimizing “total classification error” Feature Fusion: multi-stream

  20. Multi-Stream Classification • Two class problem w1, w2 • Feature vector x is broken up into two independent streamsx1 and x2 • Stream weightss1 and s2 are used to “equalize” the “probabilities”

  21. Multi-Stream Classification • Bayes classification decision • Non-unity weights increase Bayes error but estimation/modeling error may decrease • Stream weights can decrease total error • “Optimal” weights minimize estimation error variance z2

  22. Optimal Stream Weights • Equal error rate in single-stream classifiers optimal stream weights are inversely proportional to the total stream estimation error variance

  23. Optimal Stream Weights • Equal estimation error variance in each stream optimal weights are approximately inversely proportional to the single stream classification error

  24. Experimental Results • Subset of CUAVE database used: • 36 speakers (30 training, 6 testing), 5 sequences of 10 digits per spkr. • Training set: 1500 digits (30x5x10) • Test set: 300 digits (6x5x10) • Features: • Audio: 39 features (MFCC_D_A) • Visual: 105 features (ROIDCT_D_A) • Multi-Streams HMM models, Middle Integration: • 8 state, left-to-right HMM whole-digit models • Single Gaussian mixture • AV-HMM uses separate audio and video feature streams

  25. Assume: V2 / A2 = 2 SNR-indep. correlation 0.96 Optimal Stream Weights Results

  26. Parametric non-linear equalization • Parametric histogram equalization • Smoother estimates • Bi-modal transformation (speech vs. non-speech)

  27. Voice Activity Detection • Bi-spectrum based VAD • Support vector machine based VAD • Combination of VAD with speech enhancement

  28. Speech Enhancement • Modified Wiener filtering with filter depending on global SNR • Modified Ephraim-Malah enhancement: based on the E-M spectral attenuation rule

  29. Non Native Speech Recognition • Build non-native models by combining English and native models • Use phone confusion between English phones and native acoustic models to add alternate model paths • Extract confusion matrix automatically by running phone recognition using native model • Phone pronunciation depends on word grapheme, English phone [grapheme] -> french phone

  30. Extracted rules English French English model /t/ //  /t/ /t/  /k/ //  /t/ // French models ExampleforEnglishphone/t/

  31. Graphemic constraints • Example: • APPROACH /ah p r ow ch/ • APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch) • Alignment between graphemes and phones for each word of lexicon • Lexicon modification: add graphemes for each word • Confusion rules extraction • (grapheme, english phone) → list of non native phones • Example: (A, ah) → a

  32. Experiments : HIWIRE Database

  33. Ongoing Work • Front-end • combination and integration of algorithms • Fixed-platform demonstration • non-native speech demo • PDA-platform demonstration • Ongoing research

More Related