Towards speaker and environmental robustness in ASR: the HIWIRE project

Towards speaker and environmental robustness in ASR: the HIWIRE project A. Potamianos1,G. Bouselmi2,D. Dimitriadis3,D. Fohr2,R. Gemello4,I. Illina2, F. Mana4, P. Maragos3,M. Matassoni5,V. Pitsikalis3,J. Ramírez6,E. Sanchez-Soto1,J. Segura6, andP. Svaizer5 1 Dept. of E.C.E., Tech. Univ. of Crete, Chania, Greece 2 Speech Group, LORIA, Nancy, France 3School of E.C.E., Natl. Tech. Univ. of Athens, Athens, Greece 4 Loquendo, via Valdellatorre, 4-10149, Torino, Italy 5 ITC-irst, via Sommarive 18 - Povo (TN), Italy 6Dept. of Signal Theory, Univ. of Granada, Spain

Outline • Introduction: the HIWIRE project • Goals and objectives • Research areas: • Environmental robustness • Speaker robustness • Experimental results • Ongoing work

HIWIRE project • http://www.hiwire.org • Goals: environment and speaker robust ASR • Showcase: fixed cockpit platform, PDA platform • Industrial partners: Thales Avionics, Loquendo • Research partners: LORIA, TUC, NTUA, UGR, ITC-IRST, Thales research • FP6 project: 6/2004 to 5/2007

Research areas • Environmental robustness • Multi-microphone ASR • Robust feature extraction • Feature fusion and audio-visual ASR • Feature equalization • Voice-activity detection • Speech enhancement • Speaker robustness • Model-transformation • Acoustic modeling for non-native speech

Multi-microphone ASR: Outline • Beamforming and Adaptive Noise Cancellation • Environmental Acoustics Estimation

Beamforming: D&S Availability of multi-channel signals allows to selectively capture the desired source: • Issues: • estimation of reliable TDOAs; • Method: • CSP analysis over multiple frames • Advantages: • robustness • reduced computational power

D&S with MarkIII • Test set: • set N1_SNR0 of MC-TIDIGITS (cockpit noise), MarkIII channels • clean models, trained on original TIDIGITS • Results (WERR [%]):

Robust Features for ASR • Modulation Features • AM-FM Modulations • Teager Energy Cepstrum • Fractal Features • Dynamical Denoising • Correlation Dimension • Multiscale Fractal Dimension • Hybrid-Merged Features up to+62% (Aurora 3) up to+36%(Aurora 2) up to +61%(Aurora 2)

Speech Modulation Features • Filterbank Design • Short-Term AM-FM Modulation Features • Short-Term Mean Inst. Amplitude IA-Mean • Short-Term Mean Inst. Frequency IF-Mean • Frequency Modulation Percentages FMP • Short-Term Energy Modulation Features • Average Teager Energy, Cepstrum Coef. TECC

Regularization + Multiband Filtering Demodulation Robust Feature Transformation/ Selection Nonlinear Processing Speech Statistical Processing AM-FM Modulation Features: Mean Inst. Ampl. IA-Mean Mean Inst. Freq. IF-Mean Freq. Mod. Percent. FMP V.A.D. Energy Features: Teager Energy Cepstrum Coeff.TECC Modulation Acoustic Features

TIMIT-based Speech Databases • TIMIT Database: • Training Set: 3696 sentences , ~35 phonemes/utterances • Testing Set: 1344 utterances, 46680 phonemes • Sampling Frequency 16 kHz • Feature Vectors: • MFCC+C0+AM-FM+1st+2nd Time Derivatives • Stream Weights: (1) for MFCC and (2) for ΑΜ-FM • 3-state left-right HMMs, 16 mixtures • All-pair, Unweighted grammar • Performance Criterion: Phone Accuracy Rates (%) • Back-end System: HTK v3.2.0

Results: TIMIT+Noise Up to+106%

Aurora 3 - Spanish • Connected-Digits, Sampling Frequency 8 kHz • Training Set: • WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 • MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) • HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596) • Testing Set: • WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits • MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits • HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits • 2 Back-end ASR Systems (ΗΤΚ and BLasr) • Feature Vectors: MFCC+AM-FM(or Auditory+ΑM-FM), TECC • All-Pair, Unweighted Grammar (or Word-Pair Grammar) • Performance Criterion: Word (digit) Accuracy Rates

Results: Aurora 3 Up to+62%

FDCD N-d Cleaned speech signal N-d Signal Local SVD Embedding Filtered Dynamics - Correlation Dimension MFD Geometrical Filtering Multiscale Fractal Dimension Filtered Embedding Noisy Embedding Fractal Features

Databases: Aurora 2 • Task: Speaker Independent Recognition of Digit Sequences • TI - Digits at 8kHz • Training (8440 Utterances per scenario, 55M/55F) • Clean (8kHz, G712) • Multi-Condition (8kHz, G712) • 4 Noises (artificial): subway, babble, car, exhibition • 5 SNRs : 5, 10, 15, 20dB , clean • Testing, artificially added noise • 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] • A: noises as in multi-cond train., G712 (28028 Utters) • B: restaurant, street, airport, train station, G712 (28028 Utters) • C: subway, street (MIRS) (14014 Utters)

Results: Aurora 2 Up to +61%

Feature Fusion • Merge synchronous feature streams • Investigate both supervised and unsupervised algorithms

Compute “optimal” exponent weights for each streams [HMM Gaussian mixture formulation; similar expressions for MM, naïve Bayes, Euclidean/Mahalonobois classifier] • Optimality in the sense of minimizing “total classification error” Feature Fusion: multi-stream

Multi-Stream Classification • Two class problem w1, w2 • Feature vector x is broken up into two independent streamsx1 and x2 • Stream weightss1 and s2 are used to “equalize” the “probabilities”

Multi-Stream Classification • Bayes classification decision • Non-unity weights increase Bayes error but estimation/modeling error may decrease • Stream weights can decrease total error • “Optimal” weights minimize estimation error variance z2

Optimal Stream Weights • Equal error rate in single-stream classifiers optimal stream weights are inversely proportional to the total stream estimation error variance

Optimal Stream Weights • Equal estimation error variance in each stream optimal weights are approximately inversely proportional to the single stream classification error

Experimental Results • Subset of CUAVE database used: • 36 speakers (30 training, 6 testing), 5 sequences of 10 digits per spkr. • Training set: 1500 digits (30x5x10) • Test set: 300 digits (6x5x10) • Features: • Audio: 39 features (MFCC_D_A) • Visual: 105 features (ROIDCT_D_A) • Multi-Streams HMM models, Middle Integration: • 8 state, left-to-right HMM whole-digit models • Single Gaussian mixture • AV-HMM uses separate audio and video feature streams

Assume: V2 / A2 = 2 SNR-indep. correlation 0.96 Optimal Stream Weights Results

Parametric non-linear equalization • Parametric histogram equalization • Smoother estimates • Bi-modal transformation (speech vs. non-speech)

Voice Activity Detection • Bi-spectrum based VAD • Support vector machine based VAD • Combination of VAD with speech enhancement

Speech Enhancement • Modified Wiener filtering with filter depending on global SNR • Modified Ephraim-Malah enhancement: based on the E-M spectral attenuation rule

Non Native Speech Recognition • Build non-native models by combining English and native models • Use phone confusion between English phones and native acoustic models to add alternate model paths • Extract confusion matrix automatically by running phone recognition using native model • Phone pronunciation depends on word grapheme, English phone [grapheme] -> french phone

Extracted rules English French English model /t/ //  /t/ /t/  /k/ //  /t/ // French models ExampleforEnglishphone/t/

Graphemic constraints • Example: • APPROACH /ah p r ow ch/ • APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch) • Alignment between graphemes and phones for each word of lexicon • Lexicon modification: add graphemes for each word • Confusion rules extraction • (grapheme, english phone) → list of non native phones • Example: (A, ah) → a

Experiments : HIWIRE Database

Ongoing Work • Front-end • combination and integration of algorithms • Fixed-platform demonstration • non-native speech demo • PDA-platform demonstration • Ongoing research

Towards speaker and environmental robustness in ASR: the HIWIRE project