1 / 19

Landmark-Based Speech Recognition

Landmark-Based Speech Recognition. Mark Hasegawa-Johnson Carol Espy-Wilson Jim Glass Steve Greenberg Katrin Kirchhoff Mark Liberman Partha Niyogi Ken Stevens. What are Landmarks?.

newellm
Download Presentation

Landmark-Based Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Landmark-Based Speech Recognition Mark Hasegawa-Johnson Carol Espy-Wilson Jim Glass Steve Greenberg Katrin Kirchhoff Mark Liberman Partha Niyogi Ken Stevens

  2. What are Landmarks? • Instants of perceptual importance (human speech recognition accuracy drops if a 50ms segment is deleted). - not according to Miller and Licklider 1950; Huggins, 1975. I would drop this argument about short instants in time and focus on landmarks of variable duration (30-150 ms) • Instants of high mutual information between phone and signal (maxima of I(q;X(t,f)) ). • Potential universality of certain acoustic landmarks - cross linguistic and speaking style transfer; noise robustness Where do these things happen? • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure Perceptual experiments: Strange, 1989 I(q;X(t,f)) experiment: Hasegawa-Johnson, 2000

  3. Landmark-Based Speech Recognition MAP transcription: … backed up … Search Space: … … buck up … … big dope … … backed up … … bagged up … … … big doowop … … ONSET ONSET Syllable Structure NUCLEUS NUCLEUS CODA CODA

  4. Stop Detection using Support Vector Machines False Acceptance vs. False Rejection Errors per 10ms frame, Four Types of Stop Detectors - Add “take home message” on this slide (1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2% (2) HMM (*): False Rejection Error=0.3% (3) Linear SVM: EER = 0.15% (4) Kernel SVM: Equal Error Rate=0.13% Niyogi & Burges, 1999, 2002

  5. Manner Class Recognition Accuracy(Juneja and Espy-Wilson, 2003)

  6. Small-Vocabulary Word Recognition Using Landmarks: Results on TIDIGITS TIDIGITS recognition, using SVMs trained on TIMIT: • Manner-Class HMM: 53% WRA • SVM Landmark Detectors: 76% WRA (Juneja and Espy-Wilson, 2003)

  7. Lexical Notation: What are “Distinctive Features?” MANNER FEATURES: +sonorant +continuant = Vowel, Glide +sonorant –continuant = Nasal, /l/ –sonorant +continuant = Fricative –sonorant –continuant = Stop

  8. Distinctive Feature Lexicon • Based on ICSI train-ws97 Switchboard transcriptions • Compiled to a lexicon using Fosler-Lussier’s babylex lexical compiler • Converted to landmarks using Hasegawa-Johnson’s perl transcription tools Landmarks in blue, Place and voicing features in green. AGO(0.441765) +syllabic+reduced +back (syllable nucleus) ↓continuant↓sonorant+velar +voiced (stop closure) ↑continuant↑sonorant +velar +voiced (stop release) +syllabic–low –high +back +round +tense (syllable nucleus) AGO(0.294118) +syllabic+reduced –back (syllable nucleus) ↓continuant↓sonorant+velar +voiced (stop closure) ↑continuant↑sonorant+velar +voiced (stop release) +syllabic–low –high +back +round +tense (syllable nucleus)

  9. Noise Robustness of MLP-BasedDistinctive Feature Detectors • Each Distinctive Feature relies on different acoustic observations • Acoustic diversity can improve word recognition accuracy in noise: • 10% WRA improvement at 0 dB SNR. (Kirchhoff, 1999)

  10. Noise Robustness of Distinctive Features: Pink Noise Articulatory feature classification more robust than phone classification at low SNRs (Chang, Shastri and Greenberg, 2001)

  11. Noise Robustness of Distinctive Features: White Noise (Chang, Shastri and Greenberg, 2001)

  12. Research Goals: Summer 2004 • Switchboard: • Train landmark detectors • Test: manner-class recognition • Word Lattice Rescoring using landmark detection probabilities • Noise • Manner class recognition, babble noise, 0dB • Word lattice rescoring with noisy observations

  13. Experiment #1: Training and Manner Class Recognition on Switchboard • Currently Existing Infrastructure (11/2003): • SVM training code – libsvm • Forced-alignment of landmarks to phonetically untranscribed data – Espy-Wilson • Landmark-based dictionaries for Switchboard – Hasegawa-Johnson • TIMIT-trained SVMs – Espy-Wilson, Hasegawa-Johnson • Phonetic transcriptions of WS97 test data – Greenberg • Interactive code for viewing transcriptions and observations – xwaves, matlab • Infrastructure Prepared Prior to Summer 2004: • Diverse acoustic observations for all of Switchboard, including MFCC-based, broadband spectral energies, and sub-band periodicity. • Experiment Schedule, Summer 2004: • Week 1: Test TIMIT-trained MFCC-based SVMs on WS97 data. Retrain and re-test. Error analysis. • Week 2: Train and test using alternative acoustic observations.

  14. Reminder Slide: Manner Class Recognition Accuracy on TIMIT

  15. Experiment #2: Lattice Rescoring • Infrastructure Prepared Prior to Summer 2004: • Word Recognition Lattices for Switchboard test corpus (Byrne and Makhoul have both tentatively offered lattices) • “Pinched” lattices (time-aligned to ML transcription) • Code to learn SVM-MLP landmark detection probabilities • Efficient code for lattice rescoring • Code for lattice error analysis: locate phoneme and landmark differences between ML path and correct path, and tabulate by syllable position and manner features. • Experiment Schedule, Summer 2004: • Weeks 1-2: Train landmark detection probabilities, both MFCC-based and acoustically-diverse landmark detectors. Refine landmark-based dictionaries; retrain if necessary. • Weeks 3-4: Test lattice rescoring results as function of (1) acoustic observations, (2) dictionary type. • Weeks 5-6: Test lattice rescoring results using landmarks computed from noisy observations.

  16. Lattice Rescoring – Oracle Experiment Suppose all landmarks in the ICSI transcription were correctly recognized by the SVM. How much could the lattices be improved? N-Best Lattices, misc-ws97, 3-mixture 3-state HTK monophones, 19.8% WRA, 23000 arcs/second. Result: WRA not improved (19.5%). Example (WRA=1/10 both before and after): REF: HOW DID THIS WORKA MALE RAT HAD BEEN BOUGHT BEFORE: HIGH KIDS ARE TERM YOU'RE AT A DO BY AFTER: HIGH TO HIS WORK IN YOUR AT A DO BY Possible resolution: preprocess the lattices using Byrne’s method – reduce ambiguity to a level that can be addressed using manner features.

  17. Experiment #3: Noise • Infrastructure Prepared Prior to Summer 2004: • Switchboard waveform data, in babble, 10 dB and 0 dB SNR. • Acoustic observation files (MFCC and diverse observations) created from all noisy waveform files. • Experiment Schedule, Summer 2004: • Weeks 3-4: Train landmark detectors using noisy speech data. Test landmark detectors in the task of manner class recognition. • Weeks 5-6: Test lattice rescoring results using landmarks computed from noisy observations.

  18. Summary • Landmarks: a somewhat different view of the speech signal. • Integration with existing systems via lattice rescoring. • Probable benefits: • Low parameter count • High manner-class recognition accuracy • Acoustic diversity → noise robustness • Costs: Novel theory, e.g., • Label sequence SVM: convergence not yet guaranteed. • Landmark detection probabilities are discriminant; pronunciation model is a likelihood. • The costs are also benefits: a successful workshop could spawn important research.

  19. Citations S Chang, L Shastri, and S Greenberg, “Robust phonetic feature extraction under a wide range of noise backgrounds and signal-to-noise ratios,” Workshop on Consistent and Reliable Acoustic Cues for Sound Analysis, Aalborg, Denmark, 2001. M Hasegawa-Johnson, “Time-Frequency Distribution of Partial Phonetic Information Measured Using Mutual Information,” ICSLP 2000. A Juneja, Speech recognition using acoustic landmarks and binary phonetic features classifiers, PhD Thesis Proposal, University of Maryland, August 2003. A Juneja and C Espy-Wilson, “Speech segmentation using probabilistic phonetic feature hierarchy and support vector machines,” International Joint Conference on Neural Networks, 2003. K Kirchhoff, G Fink and G Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition.” Speech Communication, May, 2002. K Kirchhoff, Robust Speech Recognition Using Articulatory Information, PhD thesis, University of Bielefeld, Germany, July 1999. P Niyogi, C Burges, and P Ramesh, “Distinctive Feature Detection Using Support Vector Machines,” ICASSP 1999. P Niyogi and C Burges, Detecting and Interpreting Acoustic Features by Support Vector Machines, University of Chicago Technical Report TR-2002-02, available on the internet at http://www.cs.uchicago.edu/research/publications/techreports/TR-2002-02 KN Stevens, SY Manuel, S Shattuck-Hufnagel and S Liu, “Implementation of a Model for Lexical Access Based on Features,” ICSLP, 1992. W Strange, JJ Jenkins and TL Johnson, “Dynamic Specification of Coarticulated Vowels,” Journal of the Acoustical Society of America 74(3):695-705, 1983.

More Related