application of shifted delta cepstral features for gmm language identification l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Application of Shifted Delta Cepstral Features for GMM Language Identification PowerPoint Presentation
Download Presentation
Application of Shifted Delta Cepstral Features for GMM Language Identification

Loading in 2 Seconds...

play fullscreen
1 / 53

Application of Shifted Delta Cepstral Features for GMM Language Identification - PowerPoint PPT Presentation


  • 465 Views
  • Uploaded on

Application of Shifted Delta Cepstral Features for GMM Language Identification. Masters Thesis Defense Jonathan J. Lareau Rochester Institute of Technology Department of Computer Science Tuesday, Oct. 24, 2006 12:00 pm. Full report can be found at: www.jonlareau.com/JonathanLareau.pdf.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Application of Shifted Delta Cepstral Features for GMM Language Identification' - libitha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
application of shifted delta cepstral features for gmm language identification

Application of Shifted Delta CepstralFeatures for GMM LanguageIdentification

Masters Thesis Defense

Jonathan J. LareauRochester Institute of TechnologyDepartment of Computer Science

Tuesday, Oct. 24, 2006

12:00 pm

Full report can be found at: www.jonlareau.com/JonathanLareau.pdf

what s the use of lid
What’s the use of LID?
  • Pre-Processor for Automatic Speech Recognition Algorithms
  • Routing speech signals for:
    • Telecommunications
    • Multimedia
    • Human-Computer interfaces
    • Security applications.
overview
Overview
  • Study the use of different types of Shifted Delta (SD) feature vectors for telephone speech language identification (LID).
    • 6 types of feature vectors
    • Uniform GMM pattern recognition algorithm.
additionally
Additionally…
  • Heuristic speech enhancement pre-processor
  • No phonemically labeled training data
  • Original code written in MATLAB 7
    • Also Uses:
      • NETLAB [Nabney, 2002]
      • RASTA-MAT [Ellis, 2005]
methods for lid
Methods for LID
  • Phonemic Recognition followed by Language Modeling (PRLM) [Zissman, 1993]
    • Predominant method for ASR/LID
    • Works well, but laborious + time consuming
    • Difficult to extend
why use gaussian mixture models
Why use Gaussian Mixture Models?
  • Alternative to PRLM methods
    • Avoids the laborious phonemic labeling required by PRLM techniques
    • Comparatively easy to extend
method feature vectors
Method - Feature Vectors
  • Pre-Processing
    • Pre-Emphasis
    • Cepstral Speech Enhancement
  • Base Feature Extraction
  • Post-Processing
    • Cepstral Mean Subtraction
  • Shifted Delta Operation
  • Silence Removal
pre emphasis filtering pre processor
Pre-Emphasis Filtering (Pre-Processor)
  • Speech has Natural attenuation of approximately 20dB/decade, [Picone, 1993]
  • Pre-emphasis filter, Hpre(z), flattens speech spectrum

Hpre(z) = 1 + apre z-1

base feature vector extraction
Base Feature Vector Extraction
  • All based on Cepstral Coefficients
    • Linear Predictive (LP-CC)
      • All-pole filter models formant envelope
    • Mel-Frequency (MF-CC)
      • Psycho-Acoustic frequency scaling
        • Mimic Response of human ear
    • Perceptual Linear Prediction (PLP-CC)
      • Psycho-Acoustic scaling followed by Linear Prediction
cepstral mean subtraction post processor
Cepstral Mean Subtraction (Post-Processor)
  • A simplistic method to reduce channel effects.
  • Once Cepstral feature vectors are calculated using MF, LP, or PLP:
    • mean feature vector for the entire utterance is subtracted off
shifted delta cepstra
Shifted Delta Cepstra
  • Pseudo-prosodic feature vectors from acoustic feature vectors
    • Quick approximation to true prosodic modeling
  • ‘Stacks’ blocks of evenly spaced and differenced feature vectors.
method classification task
Method - Classification Task
  • Each language uses a different distribution over the feature space.
  • Difficult because:
    • Shape alone doesn’t distinguish between languages
    • Density information along feature space surface needs to be included
so we use gaussian mixture models
…so we use Gaussian Mixture Models

The M-V-N-PDF is:

A Gaussian Mixture Model (GMM) is then:

With conditions:

  • w(j) is mixture weight (prior) for component j
  • M is model order
  • p(x|j) is M-V-N PDF for jth component
results
Results
  • OGI Multilanguage Telephone Speech Database [Muthusamy et. al., 1992]
  • Mutually Exclusive training and test sets.
  • SD-LP-CC performed best on 3-Language and 10-Language tasks
  • SD coefficients increased accuracy by ~10% over standard LP and MF feature vectors
  • Results were consistent and repeatable
3 language task
3-Language Task

Pre/Post Disabled

Pre/Post Enabled

LP-CC 59.40%

SD-LP-CC 71.95%

MF-CC 54.86%

SD-MF-CC 67.92%

PLP-CC 61.51%

SD-PLP-CC 63.31%

LP-CC 47.15%

SD-LP-CC 65.49%

MF-CC 52.50%

SD-MF-CC 63.26%

PLP-CC 47.64%

SD-PLP-CC 61.20%

accuracy vs amount of training data
Accuracy Vs. Amount of Training Data

Approximate Trend Line

Outliers due to High Mixture order, low amount of training data, and stochastic nature of data selection and NETLAB training algorithm.

The cutoff point for

the amount of training data that must be used in order to assure accurate results increases with the

mixture order.

comparisons
Comparisons
  • Results agree with previous work on SDC
    • reported accuracies between 70%-75% [Deller et. al., 2002] [Kohler, 2002] [Reynolds, 2002]
  • This thesis specifically addresses effects of different derivations of SDC on LID
comparisons cont d
Comparisons, cont’d…
  • Also [Wang & Qu, 2003]
    • Used GMBM-UBBM
      • 70.128% accuracy
      • 128 Mixtures
    • Our algorithm - 71.13% with:
      • Reduced mixture order
      • Reduced training data
      • Without using bigram or universal background modeling
conclusions
Conclusions
  • SDC Features improved LID performance over standard features across all categories
  • SD-LP-CC perform the best overall in both 3 and 10 language tasks.
future work
Future Work
  • Gender Specific Modeling
  • Perform hill climbing
  • Add UBM
  • KL - Divergence
bibliography
Bibliography
  • Daniel P. W. Ellis. Plp and Rasta and mfcc, and inversion in matlab. http://www.ee.columbia.edu/ dpwe/resources/matlab/rastamat/, 2005.
  • Ian T. Nabney. NETLAB: Algorithms for pattern Recognition. Advances in Pattern Recognition. Springer, 2002.
  • H. Hermansky. Perceptual linear predictive (plp) analysis of speech. J. Acoust. Soc. Am., vol. 87, no. 4:1738-1752, Apr 1990.
  • Y. K. Muthusamy, R.A.C. & Oshika, B.T. The OGI multi-language telephone speech corpus Proceedings of the 1992 International Conference on Spoken Language Processing (ICSLP 92), Alberta, 1992
  • Dan Qu, Bingxi Wang, Automatic language identification based on GMBM-UBBM Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, 26-29 Oct 2003, 722-727
  • J.R. Deller Jr., P.A. Torres-Carrasquillo, D.A. Reynolds, Language Identification Using Gaussian Mixture Model Tokenization Proc. International Conference on Acoustics, Speech, and Signal Processing in Orlando, FL, IEEE, 2002, 757-760
  • Kohler, M.K., Kennedy, M.A., Language identification using shifted delta cepstra Circuits and Systems, 2002. MWSCAS-2002. The 2002 45th Midwest Symposium on, 2002, 3, 69-72
  • D.A. Reynolds, M.A. Kohler, R.J. Greene, J.R. Deller Jr., E. Singer, P.A. Torres-Carrasquillo, Approaches to language identification using Gaussian mixture models and shifted delta cepstral features, Proc. International Conference on Spoken Language Processing in Denver,CO, ISCA, pages 33-36,82-92, September 2002.
  • Zissman, M., Automatic Language Identification using Gaussian Mixture and Hidden Markov Models, ICASSP, 1993
  • Picone, J., Signal Modeling Techniques in Speech Recognition, in Proc. IEEE 81:1215-1247, Sept. 1993
supplemental slides
Supplemental Slides
  • Speech Production
  • Cepstral Coefficients
  • Feature Calculation
    • Linear Prediction
    • Psycho-Acoustic Scaling (Mel-Frequency)
    • Perceptual Linear Prediction
  • 3-Language Confusion Matrices
source filter model
Source-Filter Model

http://www.spectrum.uni-bielefeld.de/~thies/HTHS_WiSe2005-06/source-filter.jpg

cepstral coefficients
Cepstral Coefficients
  • Compact way of representing the formant envelope of a speech signal.

Formant envelope information is encoded here in the first few Cepstral coefficients.

We use the first 12 coefficients, but omit the very first.

linear prediction
Linear Prediction
  • Models Formant Envelope as all pole filter.

Where S(z) is the Speech waveform, E(z) is the excitation signal and A(z) is the all-pole filter:

linear prediction39
Linear Prediction
  • Linear Predictive coefficients can be found by solving:

where R= [R1,R2, . . . ,RP+1] is the auto-correlation vector, a= [a1, a2, . . . , aP+1] is the Linear Predictive coefficient vector, P denotes the model order, [· · ·]−1 denotes the matrix inverse, and ∗denotes the complex conjugate operation.

calculation of lp cc s
Calculation of LP-CC’s
  • Create an N-point frequency spectrum by evaluating
  • Then find the Cepstral Coefficients
calculation of mf cc s
Calculation of MF-CC’s
  • Filter bank of Mel-Scaled Filters
    • Sum the energy in each channel

Power Spectrum of Input signal:

Energy in each channel:

cont d
Cont’d
  • Then use inverse discrete Cosine Transform of the log10 of channel energies.
perceptual linear prediction
Perceptual Linear Prediction
  • First use Perceptual Scaling, such as Mel, on the spectrum.
  • Then use Linear Prediction to derive the Cepstral Coefficients.
  • In studies by [Hermansky, 1990], was shown to reduce speaker dependence.