1 / 16

Modeling Speech using POMDPs

Modeling Speech using POMDPs. In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use state of the art techniques to build and decode our new model. We demonstrate improved recognition results on a small data set.

tegan
Download Presentation

Modeling Speech using POMDPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Speech using POMDPs • In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. • We use state of the art techniques to build and decode our new model. • We demonstrate improved recognition results on a small data set.

  2. Description of a POMDP • A Markov Decision Processes (MDP) is a mathematical formalization of problems in which a decision maker, an agent, must decide which actions to choose that will maximize its expected reward as it interacts with its environment. • MDPs have been used in modeling an agents behavior in • planning problems • robot navigation problems • In a fully observable MDP an agent always knows precisely what state it is in.

  3. If an agent cannot determine its state, its world is said to be partially observable. • In such a situation we use a generalization of MDPs, called a Partially Observable Markov Decision Process (POMDP). • POMDP vs HMM • differs from HMM • multiple transitions between two states representing actions • added reward to each state • as with HMM • you do not know which state you are in

  4. POMDP in Speech As with HMMs left to right topology with 3 to 5 states states represent pronunciation tasks: beginning, middle, end of phoneme observed acoustic features are associated with each state Randomness in state transitions still accounts for time stretching in the phoneme: Short, long, hurried pronunciations Randomness in the observations still accounts for the variability in pronunciations

  5. Differs from HMMs In theory model all possible context classes (infinite number) model all contexts of a particular context class In practice model three context classes Triphone, biphone, monophone model all contexts of a particular context class Use actions of our model to represent context Beg. Mid. End

  6. Training a POMDP • We train each context class independently on the same training data • treated as HMM models trained using standard EM • We then collect all context models for each phoneme over the four different context classes and combine them into a single, unified POMDP model • we label each action with both the context and context class that the particular HMM model belongs to

  7. Decoding a POMDP • We look at 3 decoding strategies based on Viterbi: • Uniform Mixed Model (UMM) Viterbi • Weighted Mixed Model (WMM) Viterbi • Cross-Context Mixed Model (CMM) Viterbi

  8. UMM Viterbi • From Viterbi point of view • Add to mix all context classes and allow Viterbi to choose the best path through entire search space • relax context rules by matching up all partial context phonemes • wild-card all monophones to match up with all biphones and triphones sharing same center phone • wild-card all biphones to mach up with triphones whose other context they share • Add class weight, Wc, to each context class c • applied to each model as we enter it • From POMDP point of view • the model constrains actions • add constraint to leave state with same action that we entered that state in the model • insures model’s context as in HMM

  9. relax constraint of allowing to choose different context classes in model • differs from HMM • class weight is reward given at start state for entering model Viterbi expansion of “tomato” having two spellings • “t-ow-m-ey-t-ow” • “t-ow-m-aa-t-ow” • (a) standard Viterbi and (b) UMM Viterbi m+ev ow-m+ey ow+m ow-m+ey t-ow+m t-ow+m m ow-m+aa ow-m+aa ow ow-m-aa m+aa (a) (b)

  10. WMM Viterbi • Similar to UMM Viterbi, except now we weigh each context model of each context class individually, based on frequency counts of its occurrence in training data wcm = Lc + min(fcm / Kc, 1) * (Wc – Lc) • fcm – frequency count for model m of context class c • Lc – lower bound for context class c • Wc – upper bound for context class c • Kc – frequency count cutoff threshold for context class c

  11. CMM Viterbi • Similar to WMM Viterbi, except now our POMDP model relaxes the constraint on actions • allows cross model jumps • jumps are now weighted by model weight wcm • constraint relaxed to sub-class of context models as follows: • models can jump between triphone and associated biphone and monophone whose partial context they share

  12. t-ow+m ow ow+m • Various strategies to relaxing cross model jump constraints • Maximum cross context • for each cross context model jump, add weight to the likelihood score and choose jump that yields highest score • Expanded cross context • choose all context model jumps at every state, adding the weight to the likelihood score of each jump • Restricted form of both Maximum and Expanded • add constraint that once we choose a lower order context class model, cannot go back to higher order context class model, only stay within own or lower • idea is to abandon higher order models that perform poorly

  13. Experiments • Tested our model on TIMIT data sets: • TIMIT – read English sentences • 45 phonemes, ~8000 word dictionary • 3 hours training on 3869 utterances by 387 speakers • 6 minute decoding on 110 utterances by 11 speakers • independent of training data • trigram language model built from training data and outside source (OGI: Stories and NatCell )

  14. Baseline • Found best system configuration for each corpus. • created 16 mixture SCTM models for each HMM context class using ISIP prototype system (v 5.10) • ran baseline for all 3 HMM models

  15. Results • Results for all three modified Viterbi algorithms similar to development set • POMDP model shows robustness to different test sets • not tuned to data

  16. Future Work • Apply new model to larger data set • Find better method to generate individual context model weights • linear interpolation and backoff techniques used in language modeling • Find better method for adjusting overall POMDP model context class weights for the various decoding strategies • current method of experimentation is inefficient • For CMM Viterbi, look to find better ways to constrain cross model jumps outside of partial context classes • use similar technique of linguistic information used in tying mixtures at the state level

More Related