An overview of the SPHINX Speech Recognition System - PowerPoint PPT Presentation

An overview of the sphinx speech recognition system l.jpg
Download
1 / 29

An overview of the SPHINX Speech Recognition System. Jie Zhou, Zheng Gong Lingli Wang, Tiantian Ding M.Sc in CMHE Spoken Language Processing Module Presentation of the speech recognition system 27 th , February 2004. Abstract.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

An overview of the SPHINX Speech Recognition System

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An overview of the sphinx speech recognition system l.jpg

An overview of the SPHINX Speech Recognition System

Jie Zhou, Zheng Gong

Lingli Wang, Tiantian Ding

M.Sc in CMHE

Spoken Language Processing Module

Presentation of the speech recognition system

27th, February 2004

Presentation for the speech recognition system


Abstract l.jpg

Abstract

  • SPHINX is a system that demonstrates the feasibility of accuracy, large-vocabulary speaker- independent, continuous speech recognition.

  • SPHINX is based on discrete hidden Markov model (HMM’s) with LPC-derived parameters.

  • To provide speaker independence

  • To deal with co-articulation in continuous speech

  • Adequately represent a large-vocabulary

  • SPHINX attained word accuracies of 71, 94, and 96 percent on a 997-word task.

Presentation for the speech recognition system


Introduction l.jpg

Introduction

  • SPHINX is a system that tries to overcome three constraints:

    1)Speaker dependent

    2)Isolated words

    3)Small vocabulary

Presentation for the speech recognition system


Introduction4 l.jpg

Introduction

  • Speaker independent

    • Train on less appropriate training data

    • Many more data can be acquired which may compensate for the less appropriate training material

  • Continuous speech recognition’s difficulties

    • Word boundaries are difficult to locate

    • Coarticulatory effects are much stronger in continuous speech

    • Content words are often emphasized , while function words are poorly articulated

  • Large vocabulary

    • 1000 words or more

Presentation for the speech recognition system


Introduction5 l.jpg

Introduction

  • To improve speaker independence

    • Presented additional knowledge through the use of multiple vector quantized codebooks

    • Enhance the recognizer with carefully designed models and word duration modeling.

  • To deal with coarticulation in continuous speech

    • Function-word-dependent phone models

    • Generalized triphone models

  • SPHINX achieved speaker-independent word recognition accuracies of 71, 94 and 96 percent on the 997 word DARPA resource management task with grammars of perplexity 997, 60 and 20.

Presentation for the speech recognition system


The baseline sphinx system l.jpg

The baseline SPHINX system

  • This system uses standard HMM techniques

  • Speech Processing

    • Sample rate 16KHz

    • Frame span 20ms, each frame overlap 10ms

    • Each frame is multiplied by Hamming window

    • Computing the LPC coefficients

    • 12 LPC-derived cepstral coefficients are got

    • 12 LPC cepstrum coefficient are vector quantized into one of 256 prototype vectors

Presentation for the speech recognition system


Task and database l.jpg

Task and Database

  • The resource Management task

    • SHPINX was evaluated on the DARPA resource management task

    • Three difficult grammars are used with SPHINX

      • Null grammar (perplexity 997)

      • Word-pair grammar (perplexity 60)

      • Bigram grammar (perplexity 20)

  • The TIRM Database

    • 80 “training” speakers

    • 40 “development test” speakers

    • 40 “evaluation” speakers

Presentation for the speech recognition system


Task and database8 l.jpg

Task and Database

  • Phonetic Hidden Markov Models

    • HMM’s are parametric models particularly suitable for describing speech events.

    • Each HMM represents a phone

    • A total number of 46 phones in English

    • {s}: a set of states

    • {aij}: a set of transitions where aij is the probability of transition from state i to state j

    • {bij(k)}: the output probability matrix

    • Phonetic HMM’s topology figure

Presentation for the speech recognition system


Phonetic hmm s topology l.jpg

Phonetic HMM’s topology

Presentation for the speech recognition system


Task and database10 l.jpg

Task and Database

  • Training

    • A set of 46 phone models was used to initialize the parameters.

    • Ran the forward-backward algorithm on the resource management training sentences.

    • Create a sentence model from word models, which were in turn concatenated from phone models.

    • The trained transition probability are used directly in recognition

    • The output probabilities are smoothed with a uniform distribution

    • The SPHINX recognition search is a standard time-synchronous Viterbi beam search.

Presentation for the speech recognition system


Task and database11 l.jpg

Task and Database

  • The results with the baseline SPHINX system, using 15 new speakers with 10 sentences each for evaluation are shown in table I.

  • Baseline system is inadequate for any realistic large-vocabulary applications, without incorporating knowledge and contextual modeling

Presentation for the speech recognition system


Adding knowledge to sphinx l.jpg

Adding knowledge to SPHINX

  • Fixed-Width Speech Parameters

  • Lexical/Phonological Improvements

  • Word Duration Modeling

  • Results

Presentation for the speech recognition system


Fixed width speech parameter l.jpg

Fixed-Width Speech Parameter

  • Bilinear Transform on the Cepstrum Coefficients

  • Differenced Cepstrum Coefficients

  • Power and Differenced Power

  • Integrating Fixed-Width Parameters in Multiple Codebooks

Presentation for the speech recognition system


Lexical phonological improvements l.jpg

Lexical/Phonological Improvements

  • This set of improvements involved the modification of the set of phones and the pronunciation dictionary. These changes lead to more accurate assumptions about how words are articulated, without changing our assumption that each word has a single pronunciation. 

  • The first step we took was to replace the baseform pronunciation with the most likely pronunciation. 

  • In order to improve the appropriateness of the word pronunciation dictionary, a small set of rules was created to

    • modify closure-stop pairs into optional compound phones when appropriate

    • modify /t/’s and /d/’s into /dx/ when appropriate

    • reduce nasal /t/’s when appropriate

    • perform other mappings such as /t s/ to /ts/. 

  • Finally, there is the issue of what HMM topology is optimal for phones in general, and what topology is optimal for each phone.

Presentation for the speech recognition system


Word duration modeling l.jpg

Word Duration Modeling

  • HMM’s model duration of events with transition probabilities, which lead to a geometric distribution for the duration of state residence.

  • We incorporated word duration into SPHINX as a part of the Viterbi search. The duration of a word is modelled by a univariate Gaussian distribution, with the mean and variance estimated from a supervised Viterbi segmentation of the training set.

Presentation for the speech recognition system


Results l.jpg

Results

  • We have presented various strategies for adding knowledge to SPHINX. 

  • Consistent with earlier results, we found that bilinear transformed coefficients improved the recognition rates. An even greater improvement came from the use of differential coefficients, power, and differenced power in three separate codebooks. 

  • Next, we enhanced the dictionary and the phone set- a step that led to an appreciable improvement. 

  • Finally, the addition of durational information significantly improved SPHINX’s accuracy when no grammar was used, but was not helpful with a grammar.

Presentation for the speech recognition system


Context modeling in sphinx l.jpg

Context Modeling in SPHINX

  • Previously Proposed Units of Speech

  • Function-Word Dependent Phones

  • Generalized Triphones

  • Smoothing Detailed Models

Presentation for the speech recognition system


Previously proposed units of speech l.jpg

Previously Proposed Units of Speech

  • Since lack of sharing across words, word models not practical for large-vocabulary speech recognition

  • In order to improve trainability, some subword unit has to be used

  • Word-dependent phones: a compromise btw word modeling and phone modeling

  • Context-dependent phones: triphone model, instead of modeling phone-in-word, they model phone-in-context

Presentation for the speech recognition system


Function word dependent phones l.jpg

Function-Word Dependent Phones

  • Function words are particularly problematic in continuous speech recognition since they are typically unstressed

  • The phones in function words are distorted

  • Function-word-dependent phones are the same as word-dependent phones, except they are only used for function words

Presentation for the speech recognition system


Generalized triphones l.jpg

Generalized Triphones

  • Triphones model are sparsely trained and consume substantial memory

  • Combining similar triphones, improving the trainability and reduce the memory storage

  • Create generalized triphones by merging contexts with an agglomerative clustering procedure

  • To determine the similarity btw two models, we use the following distance metric:

Presentation for the speech recognition system


Generalized triphones21 l.jpg

Generalized Triphones

  • In measuring the distance btw the two models, we only consider the o/p probabilities and ignore the transition probabilities, which are of secondary important

  • This context generalization algorithm provides the ideal means for finding the equilibrium btw trainability and sensitivity.

Presentation for the speech recognition system


Smoothing detailed models l.jpg

Smoothing Detailed Models

  • Detailed models are accurate, but are less robust since many o/p probabilities will be zeros, which can be disastrous to recognition.

  • Combing these detailed models with other more robust ones.

  • An ideal solution for weighting different estimates of the same event is deleted interpolated estimation.

  • Procedure to combine the detailed models and robust models

  • Using the uniform distribution to smooth the distribution

Presentation for the speech recognition system


Entire training procedure l.jpg

Entire training procedure

  • The summary of the

    entire training procedure

    is illustrated in figure 2

Presentation for the speech recognition system


Summary of results l.jpg

Summary of Results

  • The six versions correspond to the following descriptions with incremental improvements:

    • the baseline system, which uses only LPC cepstral parameters in one codebook;

    • the addition of differenced LPC cepstral coefficients, power, and differenced power in one codebook;

    • all four feature sets were used in three separate codebooks

    • tuning of phone models and the pronunciation dictionary, and the use of word duration modelling;

    • function word dependent phone modelling

    • generalized triphone modelling

Presentation for the speech recognition system


Results of five versions of sphinx l.jpg

Results of five versions of SPHINX

Presentation for the speech recognition system


Conclusion l.jpg

Conclusion

  • Given a fixed amount of training, model specificity and model trainability pose two incompatible goals.

  • More specificity usually reduces trainability, and increased trainability usually results in over generality.

  • Our work lies on finding an equilibrium btw specificity and trainability

Presentation for the speech recognition system


Conclusion27 l.jpg

Conclusion

  • To improve trainability, using one of the largest speaker-independent speech databases.

  • To facilitate sharing btw models, using deleted interpolation to combine robust models with detailed ones.

  • Improving trainability through sharing by combining poorly trained models with well-trained models

  • To improve specificity, using multiple codebookds of various LPC-derived features, and integrated external knowledge sources into the system

  • Improving the phone set to include multiple representations of some phones, and introduce the use of function-word-dependent phone modeling and generalized triphone modeling

Presentation for the speech recognition system


Reference l.jpg

Reference

  • An Overview of the SPHINX Speech Recognition System, Kai-Fu LEE, member IEEE, Hsiao-Wuen, Hon, and Raj Reddy, fellow, IEEE, 1989

  • The SPHINX Speech Recognition system, Kai-Fu Lee, Hsiao-Wuen Hon, Mei-Yuh Hwang, Sanjoy Mahajan, Raj Reddy, 1989

Presentation for the speech recognition system


Thank you very much l.jpg

Thank you very much!

Presentation for the speech recognition system


  • Login