1 / 18

Landmark-Based Speech Recognition

Landmark-Based Speech Recognition. Mark Hasegawa-Johnson Carol Espy-Wilson Jim Glass Steve Greenberg Katrin Kirchhoff Mark Liberman Partha Niyogi Ken Stevens. What are Landmarks?. Time-frequency regions of high mutual information between phone and signal (maxima of I(q;X(t,f)) )

sundari
Download Presentation

Landmark-Based Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Landmark-Based Speech Recognition Mark Hasegawa-Johnson Carol Espy-Wilson Jim Glass Steve Greenberg Katrin Kirchhoff Mark Liberman Partha Niyogi Ken Stevens

  2. What are Landmarks? • Time-frequency regions of high mutual information between phone and signal (maxima of I(q;X(t,f)) ) • Acoustic events with similar importance in all languages, and across all speaking styles • Acoustic events that can be detected even in extremely noisy environments Where do these things happen? • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure I(q;X(t,f)) experiment: Hasegawa-Johnson, 2000

  3. Landmark-Based Speech Recognition MAP transcription: … backed up … Search Space: … … buck up … … big dope … … backed up … … bagged up … … … big doowop … … ONSET ONSET Syllable Structure NUCLEUS NUCLEUS CODA CODA

  4. Stop Detection using Support Vector Machines False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Landmark Detector: Half the Error of an HMM (1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2% (2) HMM (*): False Rejection Error=0.3% (3) Linear SVM: EER = 0.15% (4) Kernel SVM: Equal Error Rate=0.13% Niyogi & Burges, 1999, 2002

  5. Manner Class Recognition Accuracy in TIMIT (errors per phoneme)

  6. Small-Vocabulary Word Recognition Using Landmarks: Results on TIDIGITS TIDIGITS recognition, using models trained on TIMIT. Word recognition accuracy given only MANNER CLASS FEATURES: • Manner-Class HMMs: 53% WRA • SVM Landmark Detectors: 76% WRA (Juneja and Espy-Wilson, 2003)

  7. Lexical Notation: What are “Distinctive Features?” MANNER FEATURES: +sonorant +continuant = Vowel, Glide +sonorant –continuant = Nasal, /l/ –sonorant +continuant = Fricative –sonorant –continuant = Stop

  8. Distinctive Feature Lexicon • Based on ICSI train-ws97 Switchboard transcriptions • Compiled to a lexicon using Fosler-Lussier’s babylex lexical compiler • Converted to landmarks using Hasegawa-Johnson’s perl transcription tools Landmarks in blue, Place and voicing features in green. AGO(0.441765) +syllabic+reduced +back AX ↓continuant↓sonorant+velar +voiced G closure ↑continuant↑sonorant +velar +voiced G release +syllabic–low –high +back +round +tense OW AGO(0.294118) +syllabic+reduced –back IX ↓continuant↓sonorant+velar +voiced G closure ↑continuant↑sonorant+velar +voiced G release +syllabic–low –high +back +round +tense OW

  9. Noise Robustness of MLP-BasedDistinctive Feature DetectorsOGI Number95, Continuous Numbers WRA • Each Distinctive Feature relies on different acoustic observations • Acoustic diversity can improve word recognition accuracy in noise: • 10% WRA improvement at 0dB SNR. (Kirchhoff, 1999)

  10. Noise Robustness of Distinctive Features: Pink Noise Articulatory feature classification more robust than phone classification at low SNRs (Chang, Shastri and Greenberg, 2001)

  11. Noise Robustness of Distinctive Features: White Noise Some features better in white noise, some better in pink noise; diversity improves overall performance (Chang, Shastri and Greenberg, 2001)

  12. Research Goals: Summer 2004 • Switchboard: • Train landmark detectors • Test: manner-class recognition • Word Lattice Rescoring using landmark detection probabilities • Noise • Manner class recognition, babble noise, 0dB • Word lattice rescoring with noisy observations

  13. Experiment #1: Training and Manner Class Recognition on Switchboard • Currently Existing Infrastructure (11/2003): • SVM training code – libsvm • Forced-alignment of landmarks to phonetically untranscribed data – Espy-Wilson • Landmark-based dictionaries for Switchboard – Hasegawa-Johnson • TIMIT-trained SVMs – Espy-Wilson, Hasegawa-Johnson • Phonetic transcriptions of WS97 test data – Greenberg • Interactive code for viewing transcriptions and observations – xwaves, matlab • Infrastructure Prepared Prior to Summer 2004: • Diverse acoustic observations for all of Switchboard, including MFCC-based, broadband spectral energies, and sub-band periodicity. • Experiment Schedule, Summer 2004: • Week 1: Test TIMIT-trained MFCC-based SVMs on WS97 data. Retrain and re-test. Error analysis. • Week 2: Train and test using alternative acoustic observations.

  14. Experiment #2: Lattice Rescoring • Infrastructure Prepared Prior to Summer 2004: • Word Recognition Lattices for Switchboard test corpus (Byrne and Makhoul have both tentatively offered lattices) • “Pinched” lattices (time-aligned to ML transcription) • Code to learn SVM-MLP landmark detection probabilities • Efficient code for lattice rescoring • Code for lattice error analysis: locate phoneme and landmark differences between ML path and correct path, and tabulate by syllable position and manner features. • Experiment Schedule, Summer 2004: • Weeks 1-2: Train landmark detection probabilities, both MFCC-based and acoustically-diverse landmark detectors. Refine landmark-based dictionaries; retrain if necessary. • Weeks 3-4: Test lattice rescoring results as function of (1) acoustic observations, (2) dictionary type. • Weeks 5-6: Test lattice rescoring results using landmarks computed from noisy observations.

  15. Lattice Rescoring – Oracle Experiment Suppose all landmarks in the ICSI transcription were correctly recognized by the SVM. How much could the lattices be improved? Rescoring given a full correct phone transcription: 9.4% 1-BEST WRA improvement Rescoring given only correct landmark times & manner features: 2.2% 1-BEST WRA improvement Content WRA improves more than Function WRA. Example (1-BEST WRA=1/10 both before and after): REF: HOW DID THIS WORKA MALE RAT HAD BEEN BOUGHT BEFORE: HIGH KIDS ARE TERM YOU'RE AT A DO BY AFTER: HIGH TO HIS WORK IN YOUR AT A DO BY

  16. Experiment #3: Noise • Infrastructure Prepared Prior to Summer 2004: • Switchboard waveform data, in babble, 10dB and 0dB SNR. • Acoustic observation files (MFCC and diverse observations) created from all noisy waveform files. • Experiment Schedule, Summer 2004: • Weeks 3-4: Train landmark detectors using noisy speech data. Test landmark detectors in the task of manner class recognition. • Weeks 5-6: Test lattice rescoring results using landmarks computed from noisy observations.

  17. Summary • Landmarks: a somewhat different view of the speech signal. • Integration with existing systems via lattice rescoring. • Probable benefits: • Low parameter count • High manner-class recognition accuracy • Acoustic diversity → noise robustness • Costs: Novel theory, e.g., • Label sequence SVM: convergence not yet guaranteed. • Landmark detection probabilities are discriminant; pronunciation model is a likelihood. • The costs are also benefits: a successful workshop could spawn important research.

  18. Citations S Chang, L Shastri, and S Greenberg, “Robust phonetic feature extraction under a wide range of noise backgrounds and signal-to-noise ratios,” Workshop on Consistent and Reliable Acoustic Cues for Sound Analysis, Aalborg, Denmark, 2001 M Hasegawa-Johnson, “Time-Frequency Distribution of Partial Phonetic Information Measured Using Mutual Information,” ICSLP 2000 A Juneja, Speech recognition using acoustic landmarks and binary phonetic features classifiers, PhD Thesis Proposal, University of Maryland, August 2003 A Juneja and C Espy-Wilson, “Speech segmentation using probabilistic phonetic feature hierarchy and support vector machines,” International Joint Conference on Neural Networks, 2003 K Kirchhoff, G Fink and G Sagerer, “Combining acoustic and articulatory feature information for robust speech recognition.” Speech Communication, May, 2002 K Kirchhoff, Robust Speech Recognition Using Articulatory Information, PhD thesis, University of Bielefeld, Germany, July 1999 P Niyogi, C Burges, and P Ramesh, “Distinctive Feature Detection Using Support Vector Machines,” ICASSP 1999 P Niyogi and C Burges, Detecting and Interpreting Acoustic Features by Support Vector Machines, University of Chicago Technical Report TR-2002-02, available on the internet at http://www.cs.uchicago.edu/research/publications/techreports/TR-2002-02 KN Stevens, SY Manuel, S Shattuck-Hufnagel and S Liu, “Implementation of a Model for Lexical Access Based on Features,” ICSLP, 1992

More Related