slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Voicing Features PowerPoint Presentation
Download Presentation
Voicing Features

Loading in 2 Seconds...

play fullscreen
1 / 17

Voicing Features - PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on

Voicing Features. Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International. Phonetically Motivated Features. Problem: Cepstral coefficients fail to capture many discriminative cues. Front-end optimized for traditional Mel cepstral features.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Voicing Features' - loyal


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Voicing Features

  • Horacio Franco, Martin Graciarena
  • Andreas Stolcke, Dimitra Vergyri, Jing Zheng
  • STAR Lab. SRI International
phonetically motivated features
Phonetically Motivated Features
  • Problem:
    • Cepstral coefficients fail to capture many discriminative cues.
    • Front-end optimized for traditional Mel cepstral features.
    • Front-end parameters are a compromise solution for all phones.
slide3

Phonetically Motivated Features

  • Proposal:
    • Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends.
    • Optimize each specific front-end to improve discrimination.
    • Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding.
    • General framework for multiple phonetic features. First approach: voicing features.
slide4

Voicing Features

  • Voicing features algorithms:
    • Normalized peak autocorrelation(PA). For time frame X
      • max computed in pitch region 80Hz to 450Hz
    • Entropy of high order cepstrum (EC) and linear spectra (ES).If
      • And H is the entropy of Y,
      • then
      • Entropy computed in pitch region 80Hz to 450Hz
slide5

Voicing Features

  • Correlation with template and DP alignment [Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform
    • for the frequency band for speech signal
    • If IT is an impulse train, the template is
    • and the signal DLFT
    • the correlation for frame j with the template is
    • the DP optimal correlation is
    • max computed in pitch region 80Hz to 450Hz
slide6

Voicing Features

  • Preliminary exploration of voicing features:
      • - Best feature combination: Peak Autocorrelation + Entropy Cepstrum
      • - Complementary behavior of autocorrelation and entropy features for high and low pitch.
        • Low pitch: time periods are well separated therefore correlation is well defined.
        • High pitch: harmonics are well separated and cepstrum is well defined.
slide7

Voicing Features

  • Graph of voicing features:

w er k ay n d ax f s: aw th ax v dh ey ax r

slide8

Voicing Features

  • Integration of Voicing Features:
    • 1 - Juxtaposing Voicing Features:
    • Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD)
    • Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.
slide9

Voicing Features

  • Train small switchboard database (64 hours). Test on dev 2001. WER for both sexes.
  • Features: MFCC+D+DD, 25.6 msec. frame every 10 msec.
  • VTL and speaker mean and var. norm. Genone acoustic model. Non-X-word, MLE trained, Gender Dep. Bigram LM.
slide10

Voicing Features

  • 2 – Voiced/Unvoiced Posterior Features:
  • Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40.
  • Similar setup as before. Males only results.
  • Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature.
slide11

Voicing Features

  • 3 –Window of Voicing Features + HLDA:
  • Juxtapose MFCC features and window of voicing features around current frame.
  • Apply dimensionality reduction with HLDA. Final feature had 39 dimensions.
  • Same setup as before, MFCC+D+DD+3rd diffs. Both sexes.
  • Baseline 1.5% abs. better, Voicing improves 1% more.

39.5

39.5

slide12

Voicing Features

  • 4 – Delta of Voicing Features + HLDA:
  • Use delta and delta-delta features instead of window of voicing features. Apply HLDA to juxtaposed feature.
  • Same setup as before, MFCC+D+DD+3rd diffs. Males only.
  • Reason may be variability in voicing features produce noisy deltas.
    • HLDA weighting of “window of voicing features” is similar to average.
  • ----------------------------------------------------------------------------------
  •  The best overall configuration was MFCC+D+DD+3rd diffs. and 10 voicing features + HLDA.
slide13

Voicing Features

  • Voicing Features in SRI CTS Eval. Sept 03 System:
    • Adaptation of MMIE cross-word models w/wo voicing features.
    • Used best configuration of voicing features.
    • Train on Full SWBD+CTRANS data. Test on EVAL’02.
    • Feature: MFCC+D+DD+3rd diffs.+HLDA
    • Adaptation: 9 transforms full matrix MLLR.
    • Adaptation hypothesis from: MLE non cross-word model, PLP front end with voicing features.
slide14

Voicing Features

  • Hypothesis Examples:
  • REF: OH REALLY WHAT WHAT KIND OF PAPER
  • HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPER
  • HYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER
  • REF: YOU KNOW HE S JUST SO UNHAPPY
  • HYP BASELINE: YOU KNOW YOU JUST I WANT HAPPY
  • HYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY
slide15

Voicing Features

  • Error analysis:
    • In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase.
      • Still need a more detailed study of speaker dependent performance.
  • Implementation:
    • Implemented a voicing feature engine in DECIPHER system.
      • Fast computation, using one FFT and two IFFTs per frame for both voicing features.
slide16

Voicing Features

  • Conclusions:
    • Explored how to represent/integrate the voicing features for best performance.
    • Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system.
  • Future work:
    • Still need to further explore feature combination/selection
    • Develop more reliable voicing features, features not always reflect actual voicing activity
    • Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).