Landmark-Based Speech Recognition

Landmark-Based Speech Recognition Mark Hasegawa-Johnson Carol Espy-Wilson Jim Glass Steve Greenberg Katrin Kirchhoff Mark Liberman Partha Niyogi Ken Stevens

What are Landmarks? • Time-frequency regions of high mutual information between phone and signal (maxima of I(q;X(t,f)) ) • Acoustic events with similar importance in all languages, and across all speaking styles • Acoustic events that can be detected even in extremely noisy environments Where do these things happen? • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure I(q;X(t,f)) experiment: Hasegawa-Johnson, 2000

Landmark-Based Speech Recognition MAP transcription: … backed up … Search Space: … … buck up … … big dope … … backed up … … bagged up … … … big doowop … … ONSET ONSET Syllable Structure NUCLEUS NUCLEUS CODA CODA

Stop Detection using Support Vector Machines False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Landmark Detector: Half the Error of an HMM (1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2% (2) HMM (*): False Rejection Error=0.3% (3) Linear SVM: EER = 0.15% (4) Kernel SVM: Equal Error Rate=0.13% Niyogi & Burges, 1999, 2002

Manner Class Recognition Accuracy in TIMIT (errors per phoneme)

Small-Vocabulary Word Recognition Using Landmarks: Results on TIDIGITS TIDIGITS recognition, using models trained on TIMIT. Word recognition accuracy given only MANNER CLASS FEATURES: • Manner-Class HMMs: 53% WRA • SVM Landmark Detectors: 76% WRA (Juneja and Espy-Wilson, 2003)

Lexical Notation: What are “Distinctive Features?” MANNER FEATURES: +sonorant +continuant = Vowel, Glide +sonorant –continuant = Nasal, /l/ –sonorant +continuant = Fricative –sonorant –continuant = Stop

Distinctive Feature Lexicon • Based on ICSI train-ws97 Switchboard transcriptions • Compiled to a lexicon using Fosler-Lussier’s babylex lexical compiler • Converted to landmarks using Hasegawa-Johnson’s perl transcription tools Landmarks in blue, Place and voicing features in green. AGO(0.441765) +syllabic+reduced +back AX ↓continuant↓sonorant+velar +voiced G closure ↑continuant↑sonorant +velar +voiced G release +syllabic–low –high +back +round +tense OW AGO(0.294118) +syllabic+reduced –back IX ↓continuant↓sonorant+velar +voiced G closure ↑continuant↑sonorant+velar +voiced G release +syllabic–low –high +back +round +tense OW

Noise Robustness of MLP-BasedDistinctive Feature DetectorsOGI Number95, Continuous Numbers WRA • Each Distinctive Feature relies on different acoustic observations • Acoustic diversity can improve word recognition accuracy in noise: • 10% WRA improvement at 0dB SNR. (Kirchhoff, 1999)

Noise Robustness of Distinctive Features: Pink Noise Articulatory feature classification more robust than phone classification at low SNRs (Chang, Shastri and Greenberg, 2001)

Noise Robustness of Distinctive Features: White Noise Some features better in white noise, some better in pink noise; diversity improves overall performance (Chang, Shastri and Greenberg, 2001)

Research Goals: Summer 2004 • Switchboard: • Train landmark detectors • Test: manner-class recognition • Word Lattice Rescoring using landmark detection probabilities • Noise • Manner class recognition, babble noise, 0dB • Word lattice rescoring with noisy observations

Experiment #1: Training and Manner Class Recognition on Switchboard • Currently Existing Infrastructure (11/2003): • SVM training code – libsvm • Forced-alignment of landmarks to phonetically untranscribed data – Espy-Wilson • Landmark-based dictionaries for Switchboard – Hasegawa-Johnson • TIMIT-trained SVMs – Espy-Wilson, Hasegawa-Johnson • Phonetic transcriptions of WS97 test data – Greenberg • Interactive code for viewing transcriptions and observations – xwaves, matlab • Infrastructure Prepared Prior to Summer 2004: • Diverse acoustic observations for all of Switchboard, including MFCC-based, broadband spectral energies, and sub-band periodicity. • Experiment Schedule, Summer 2004: • Week 1: Test TIMIT-trained MFCC-based SVMs on WS97 data. Retrain and re-test. Error analysis. • Week 2: Train and test using alternative acoustic observations.

Experiment #2: Lattice Rescoring • Infrastructure Prepared Prior to Summer 2004: • Word Recognition Lattices for Switchboard test corpus (Byrne and Makhoul have both tentatively offered lattices) • “Pinched” lattices (time-aligned to ML transcription) • Code to learn SVM-MLP landmark detection probabilities • Efficient code for lattice rescoring • Code for lattice error analysis: locate phoneme and landmark differences between ML path and correct path, and tabulate by syllable position and manner features. • Experiment Schedule, Summer 2004: • Weeks 1-2: Train landmark detection probabilities, both MFCC-based and acoustically-diverse landmark detectors. Refine landmark-based dictionaries; retrain if necessary. • Weeks 3-4: Test lattice rescoring results as function of (1) acoustic observations, (2) dictionary type. • Weeks 5-6: Test lattice rescoring results using landmarks computed from noisy observations.

Lattice Rescoring – Oracle Experiment Suppose all landmarks in the ICSI transcription were correctly recognized by the SVM. How much could the lattices be improved? Rescoring given a full correct phone transcription: 9.4% 1-BEST WRA improvement Rescoring given only correct landmark times & manner features: 2.2% 1-BEST WRA improvement Content WRA improves more than Function WRA. Example (1-BEST WRA=1/10 both before and after): REF: HOW DID THIS WORKA MALE RAT HAD BEEN BOUGHT BEFORE: HIGH KIDS ARE TERM YOU'RE AT A DO BY AFTER: HIGH TO HIS WORK IN YOUR AT A DO BY

Experiment #3: Noise • Infrastructure Prepared Prior to Summer 2004: • Switchboard waveform data, in babble, 10dB and 0dB SNR. • Acoustic observation files (MFCC and diverse observations) created from all noisy waveform files. • Experiment Schedule, Summer 2004: • Weeks 3-4: Train landmark detectors using noisy speech data. Test landmark detectors in the task of manner class recognition. • Weeks 5-6: Test lattice rescoring results using landmarks computed from noisy observations.

Summary • Landmarks: a somewhat different view of the speech signal. • Integration with existing systems via lattice rescoring. • Probable benefits: • Low parameter count • High manner-class recognition accuracy • Acoustic diversity → noise robustness • Costs: Novel theory, e.g., • Label sequence SVM: convergence not yet guaranteed. • Landmark detection probabilities are discriminant; pronunciation model is a likelihood. • The costs are also benefits: a successful workshop could spawn important research.

Citations S Chang, L Shastri, and S Greenberg, “Robust phonetic feature extraction under a wide range of noise backgrounds and signal-to-noise ratios,” Workshop on Consistent and Reliable Acoustic Cues for Sound Analysis, Aalborg, Denmark, 2001 M Hasegawa-Johnson, “Time-Frequency Distribution of Partial Phonetic Information Measured Using Mutual Information,” ICSLP 2000 A Juneja, Speech recognition using acoustic landmarks and binary phonetic features classifiers, PhD Thesis Proposal, University of Maryland, August 2003 A Juneja and C Espy-Wilson, “Speech segmentation using probabilistic phonetic feature hierarchy and support vector machines,” International Joint Conference on Neural Networks, 2003 K Kirchhoff, G Fink and G Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition.” Speech Communication, May, 2002 K Kirchhoff, Robust Speech Recognition Using Articulatory Information, PhD thesis, University of Bielefeld, Germany, July 1999 P Niyogi, C Burges, and P Ramesh, “Distinctive Feature Detection Using Support Vector Machines,” ICASSP 1999 P Niyogi and C Burges, Detecting and Interpreting Acoustic Features by Support Vector Machines, University of Chicago Technical Report TR-2002-02, available on the internet at http://www.cs.uchicago.edu/research/publications/techreports/TR-2002-02 KN Stevens, SY Manuel, S Shattuck-Hufnagel and S Liu, “Implementation of a Model for Lexical Access Based on Features,” ICSLP, 1992

Landmark-Based Speech Recognition