Voicing Features
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Voicing Features PowerPoint PPT Presentation


  • 64 Views
  • Uploaded on
  • Presentation posted in: General

Voicing Features. Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International. Phonetically Motivated Features. Problem: Cepstral coefficients fail to capture many discriminative cues. Front-end optimized for traditional Mel cepstral features.

Download Presentation

Voicing Features

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Voicing features

Voicing Features

  • Horacio Franco, Martin Graciarena

  • Andreas Stolcke, Dimitra Vergyri, Jing Zheng

  • STAR Lab. SRI International


Phonetically motivated features

Phonetically Motivated Features

  • Problem:

    • Cepstral coefficients fail to capture many discriminative cues.

    • Front-end optimized for traditional Mel cepstral features.

    • Front-end parameters are a compromise solution for all phones.


Voicing features

Phonetically Motivated Features

  • Proposal:

    • Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends.

    • Optimize each specific front-end to improve discrimination.

    • Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding.

    • General framework for multiple phonetic features. First approach: voicing features.


Voicing features

Voicing Features

  • Voicing features algorithms:

    • Normalized peak autocorrelation(PA). For time frame X

      • max computed in pitch region 80Hz to 450Hz

    • Entropy of high order cepstrum (EC) and linear spectra (ES).If

      • And H is the entropy of Y,

      • then

      • Entropy computed in pitch region 80Hz to 450Hz


Voicing features

Voicing Features

  • Correlation with template and DP alignment [Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform

    • for the frequency band for speech signal

    • If IT is an impulse train, the template is

    • and the signal DLFT

    • the correlation for frame j with the template is

    • the DP optimal correlation is

    • max computed in pitch region 80Hz to 450Hz


Voicing features

Voicing Features

  • Preliminary exploration of voicing features:

    • - Best feature combination: Peak Autocorrelation + Entropy Cepstrum

    • - Complementary behavior of autocorrelation and entropy features for high and low pitch.

      • Low pitch: time periods are well separated therefore correlation is well defined.

      • High pitch: harmonics are well separated and cepstrum is well defined.


Voicing features

Voicing Features

  • Graph of voicing features:

w er k ay n d ax f s: aw th ax v dh ey ax r


Voicing features

Voicing Features

  • Integration of Voicing Features:

    • 1 - Juxtaposing Voicing Features:

    • Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD)

    • Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.


Voicing features

Voicing Features

  • Train small switchboard database (64 hours). Test on dev 2001. WER for both sexes.

  • Features: MFCC+D+DD, 25.6 msec. frame every 10 msec.

  • VTL and speaker mean and var. norm. Genone acoustic model. Non-X-word, MLE trained, Gender Dep. Bigram LM.


Voicing features

Voicing Features

  • 2 – Voiced/Unvoiced Posterior Features:

  • Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40.

  • Similar setup as before. Males only results.

  • Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature.


Voicing features

Voicing Features

  • 3 –Window of Voicing Features + HLDA:

  • Juxtapose MFCC features and window of voicing features around current frame.

  • Apply dimensionality reduction with HLDA. Final feature had 39 dimensions.

  • Same setup as before, MFCC+D+DD+3rd diffs. Both sexes.

  • Baseline 1.5% abs. better, Voicing improves 1% more.

39.5

39.5


Voicing features

Voicing Features

  • 4 – Delta of Voicing Features + HLDA:

  • Use delta and delta-delta features instead of window of voicing features. Apply HLDA to juxtaposed feature.

  • Same setup as before, MFCC+D+DD+3rd diffs. Males only.

  • Reason may be variability in voicing features produce noisy deltas.

    • HLDA weighting of “window of voicing features” is similar to average.

  • ----------------------------------------------------------------------------------

  •  The best overall configuration was MFCC+D+DD+3rd diffs. and 10 voicing features + HLDA.


Voicing features

Voicing Features

  • Voicing Features in SRI CTS Eval. Sept 03 System:

    • Adaptation of MMIE cross-word models w/wo voicing features.

    • Used best configuration of voicing features.

    • Train on Full SWBD+CTRANS data. Test on EVAL’02.

    • Feature: MFCC+D+DD+3rd diffs.+HLDA

    • Adaptation: 9 transforms full matrix MLLR.

    • Adaptation hypothesis from: MLE non cross-word model, PLP front end with voicing features.


Voicing features

Voicing Features

  • Hypothesis Examples:

  • REF: OH REALLY WHAT WHAT KIND OF PAPER

  • HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPER

  • HYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER

  • REF: YOU KNOW HE S JUST SO UNHAPPY

  • HYP BASELINE: YOU KNOW YOU JUST I WANT HAPPY

  • HYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY


Voicing features

Voicing Features

  • Error analysis:

    • In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase.

      • Still need a more detailed study of speaker dependent performance.

  • Implementation:

    • Implemented a voicing feature engine in DECIPHER system.

      • Fast computation, using one FFT and two IFFTs per frame for both voicing features.


Voicing features

Voicing Features

  • Conclusions:

    • Explored how to represent/integrate the voicing features for best performance.

    • Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system.

  • Future work:

    • Still need to further explore feature combination/selection

    • Develop more reliable voicing features, features not always reflect actual voicing activity

    • Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).


  • Login