1 / 17

Voicing Features

Voicing Features. Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International. Phonetically Motivated Features. Problem: Cepstral coefficients fail to capture many discriminative cues. Front-end optimized for traditional Mel cepstral features.

loyal
Download Presentation

Voicing Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Voicing Features • Horacio Franco, Martin Graciarena • Andreas Stolcke, Dimitra Vergyri, Jing Zheng • STAR Lab. SRI International

  2. Phonetically Motivated Features • Problem: • Cepstral coefficients fail to capture many discriminative cues. • Front-end optimized for traditional Mel cepstral features. • Front-end parameters are a compromise solution for all phones.

  3. Phonetically Motivated Features • Proposal: • Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends. • Optimize each specific front-end to improve discrimination. • Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding. • General framework for multiple phonetic features. First approach: voicing features.

  4. Voicing Features • Voicing features algorithms: • Normalized peak autocorrelation(PA). For time frame X • max computed in pitch region 80Hz to 450Hz • Entropy of high order cepstrum (EC) and linear spectra (ES).If • And H is the entropy of Y, • then • Entropy computed in pitch region 80Hz to 450Hz

  5. Voicing Features • Correlation with template and DP alignment [Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform • for the frequency band for speech signal • If IT is an impulse train, the template is • and the signal DLFT • the correlation for frame j with the template is • the DP optimal correlation is • max computed in pitch region 80Hz to 450Hz

  6. Voicing Features • Preliminary exploration of voicing features: • - Best feature combination: Peak Autocorrelation + Entropy Cepstrum • - Complementary behavior of autocorrelation and entropy features for high and low pitch. • Low pitch: time periods are well separated therefore correlation is well defined. • High pitch: harmonics are well separated and cepstrum is well defined.

  7. Voicing Features • Graph of voicing features: w er k ay n d ax f s: aw th ax v dh ey ax r

  8. Voicing Features • Integration of Voicing Features: • 1 - Juxtaposing Voicing Features: • Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD) • Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.

  9. Voicing Features • Train small switchboard database (64 hours). Test on dev 2001. WER for both sexes. • Features: MFCC+D+DD, 25.6 msec. frame every 10 msec. • VTL and speaker mean and var. norm. Genone acoustic model. Non-X-word, MLE trained, Gender Dep. Bigram LM.

  10. Voicing Features • 2 – Voiced/Unvoiced Posterior Features: • Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40. • Similar setup as before. Males only results. • Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature.

  11. Voicing Features • 3 –Window of Voicing Features + HLDA: • Juxtapose MFCC features and window of voicing features around current frame. • Apply dimensionality reduction with HLDA. Final feature had 39 dimensions. • Same setup as before, MFCC+D+DD+3rd diffs. Both sexes. • Baseline 1.5% abs. better, Voicing improves 1% more. 39.5 39.5

  12. Voicing Features • 4 – Delta of Voicing Features + HLDA: • Use delta and delta-delta features instead of window of voicing features. Apply HLDA to juxtaposed feature. • Same setup as before, MFCC+D+DD+3rd diffs. Males only. • Reason may be variability in voicing features produce noisy deltas. • HLDA weighting of “window of voicing features” is similar to average. • ---------------------------------------------------------------------------------- •  The best overall configuration was MFCC+D+DD+3rd diffs. and 10 voicing features + HLDA.

  13. Voicing Features • Voicing Features in SRI CTS Eval. Sept 03 System: • Adaptation of MMIE cross-word models w/wo voicing features. • Used best configuration of voicing features. • Train on Full SWBD+CTRANS data. Test on EVAL’02. • Feature: MFCC+D+DD+3rd diffs.+HLDA • Adaptation: 9 transforms full matrix MLLR. • Adaptation hypothesis from: MLE non cross-word model, PLP front end with voicing features.

  14. Voicing Features • Hypothesis Examples: • REF: OH REALLY WHAT WHAT KIND OF PAPER • HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPER • HYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER • REF: YOU KNOW HE S JUST SO UNHAPPY • HYP BASELINE: YOU KNOW YOU JUST I WANT HAPPY • HYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY

  15. Voicing Features • Error analysis: • In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase. • Still need a more detailed study of speaker dependent performance. • Implementation: • Implemented a voicing feature engine in DECIPHER system. • Fast computation, using one FFT and two IFFTs per frame for both voicing features.

  16. Voicing Features • Conclusions: • Explored how to represent/integrate the voicing features for best performance. • Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system. • Future work: • Still need to further explore feature combination/selection • Develop more reliable voicing features, features not always reflect actual voicing activity • Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).

More Related