1 / 33

Pronunciation Modeling

Pronunciation Modeling. Lecture 11 Spoken Language Processing Prof. Andrew Rosenberg. What is a pronunciation model?. Audio Features. Word Hypothese. Acoustic Model. Pronunciation Model. Language Model. Phone Hypothese. Word Hypothese. Why do we need one?.

akio
Download Presentation

Pronunciation Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pronunciation Modeling Lecture 11 Spoken Language Processing Prof. Andrew Rosenberg

  2. What is a pronunciation model? Audio Features Word Hypothese Acoustic Model Pronunciation Model Language Model Phone Hypothese Word Hypothese

  3. Why do we need one? • The pronunciation model defines the mapping between sequences of phones and words. • The acoustic model can deliver a one-best, hypothesis – “best guess”. • From this single guess, converting to words can be done with dynamic programming alignment. • Or viewed as a Finite State Automata.

  4. Simplest Pronunciation “model” • A dictionary. • Associate a word (lexical item, orthographic form) with a pronunciation. ACHE EY K ACHES EY K S ADJUNCT AE JH AH NG K T ADJUNCTS AE JH AN NG K T S ADVANTAGE AH D V AE N T IH JH ADVANTAGE AH D V AE N IH JH ADVANTAGE AH D V AE N T AH JH

  5. Example of a pronunciation dictionary

  6. Finite State Automata view • Each word is an automata over phones EY K EY K S AH D V AE N T I JH

  7. Size of whole word models • these models get very big, very quickly START END EY K EY K S AH D V AE N T I JH

  8. Potential problems • Every word in the training material and test vocabulary must be in the dictionary • The dictionary is generally written by hand • Prone to errors and inconsistencies ACHE EY K ACHES EY K S ADJUNCT AE JH AH NG K T ADJUNCTS AE JH AN NG K T S ADVANTAGE AH D V AE N T IH JH ADVANTAGE AH D V AE N IH JH ADVANTAGE AH D V AE N T AH JH

  9. Baseforms represented by graphs

  10. Composition • From the word graph, we can replace each phone by its markov model

  11. Automating the construction • Do we need to write a rule for every word? • pluralizing? • Where is it +[Z]? +[IH Z]? • prefixes, unhappy, etc. • +[UH N] • How can you tell the difference between “unhappy”, “unintelligent” and “under” and “

  12. Is every pronunciation equally likely? • Different phonetic realizations can be weighted. • The FSA view of the pronunciation model makes this easy. ACAPULCO AE K AX P AH L K OW ACAPULCO AA K AX P UH K OW THE TH IY THE TH AX PROBABLY P R AA B AX B L IY PROBABLY P R AA B L IY PROBABLY P R AA L IY

  13. Is every pronunciation equally likely? • Different phonetic realizations can be weighted. • The FSA view of the pronunciation model makes this easy. ACAPULCO AE K AX P AH L K OW 0.75 ACAPULCO AA K AX P UH K OW 0.25 THE TH IY 0.15 THE TH AX 0.85 PROBABLY P R AA B AX B L IY 0.5 PROBABLY P R AA B L IY 0.4 PROBABLY P R AA L IY 0.1

  14. Collecting pronunciations • Collect a lot of data • Ask a phonetician to phonetically transcribe the data. • Count how many times each production is observed. • This is very expensive – time consuming, finding linguists.

  15. Collecting pronunciations • Start with equal likelihoods of all pronunciations • Run the recognizer on transcribed speech • forced alignment • See how many times the recognizer uses each pronunciation. • Much cheaper, but less reliable

  16. Out of Vocabulary Words • A major problem for Dictionary based pronunciation is out of vocabulary terms. • If you’ve never seen a name, or new word, how do you know how to pronounce it? • Person names • Organization and Company Names • New words “truthiness”, “hypermiling”, “woot”, “app” • Medical, scientific and technical terms

  17. Collecting Pronunciations from the web • Newspapers, blog posts etc. often use new names and unknown terms. • For example: • Flickeur (pronounced like Voyeur) randomly retrieves images from Flickr.com and creates an infinite film with a style that can vary between stream-of-consciousness, documentary or video clip. • Our group traveled to Peterborough (pronounced like “Pita-borough”)... • The web can be mined for pronunciations [Riley, Jansche, Ramabhadran 2009]

  18. Grapheme to Phoneme Conversion • Given a new word, how do you pronounce it. • Grapheme is a language independent term for things like “letters”, “characters”, “kanji”, etc. • With a phoneme to grapheme-to-phoneme converter, dictionaries can be augmented with any word. • Some languages are more ambiguous than others.

  19. Grapheme to Phoneme conversion • Goal: Learn an alignment between graphemes (letters) and phonemes (sounds) • Find the lowest cost alignment. • Weight rules, and learn contextual variants.

  20. Grapheme to Phoneme Difficulties • How to deal with Abbreviations • US CENSUS • NASA, scuba vs. AT&T, ASR • LOL • IEEE • What about misspellings? • should “teh” have an entry in the dictionary? • If we’re collecting new terms from the web, or other unreliable sources, how do we know what is a new word?

  21. Application of Grapheme to Phoneme Conversion • This Pronunciation Model is used much more often in Speech Synthesis than Speech Recognition • In Speech Recognition we’re trying to do Phoneme-to-Grapheme conversion • This is a very tricky problem. • “ghoti” -> F IH SH • “ghoti” -> silence

  22. Approaches to Grapheme to Phoneme conversion • “Instance Based Learning” • Lookup based on a sliding window of 3 letters • Helps with sounds like “ch” and “sh” • Hidden Markov Model • Observations are phones • States are letters

  23. Machine Learning for Grapheme to Phoneme Conversion • Input: • A letter, and surrounding context, e.g. 2 previous and 2 following letters • Output: • Phoneme

  24. Decision Trees • Decision trees are intuitive classifiers • Classifier: supervised machine learning, generating categorical predictions Feature > threshold? Class A Class B

  25. Decision Trees Example

  26. Decision Tree Training • How does the letter “p” sound? • Training data • P loophole, peanuts, pay, apple • F physics, telephone, graph, photo • ø apple, psycho, pterodactyl, pneumonia • pronunciation depends on context

  27. Decision Trees example • Context: L1, L2, p, R1, R2 R1 = “h” Yes No P loophole F physics F telephone F graph F photo P peanut P pay P apple ø apple ø psycho ø psycho ø pterodactyl ø pneumonia

  28. Decision Trees example • Context: L1, L2, p, R1, R2 R1 = “h” Yes No P peanut P pay P apple ø apple ø psycho ø pterodactyl ø pneumonia P loophole F physics F telephone F graph F photo Yes No L1 = “o” P loophole F physics F telephone F graph F photo Yes No R1 = consonant P apple ø psycho ø pterodactyl øpneumonia P peanut P pay

  29. Decision Trees example • Context: L1, L2, p, R1, R2 try “PARIS” R1 = “h” Yes No P peanut P pay P apple ø apple ø psycho ø pterodactyl ø pneumonia P loophole F physics F telephone F graph F photo Yes No L1 = “o” P loophole F physics F telephone F graph F photo Yes No R1 = consonant P apple ø psycho ø pterodactyl øpneumonia P peanut P pay

  30. Decision Trees example • Context: L1, L2, p, R1, R2 Now try “GOPHER” R1 = “h” Yes No P peanut P pay P apple ø apple ø psycho ø pterodactyl ø pneumonia P loophole F physics F telephone F graph F photo Yes No L1 = “o” P loophole F physics F telephone F graph F photo Yes No R1 = consonant P apple ø psycho ø pterodactyl øpneumonia P peanut P pay

  31. Training a Decision Tree • At each node,decide what the most useful split is. • Consider all features • Select the one that improves the performance the most • There are a few ways to calculate improved performance • Information Gain is typically used. • Accuracyis less common. • Can require many evaluations

  32. Pronunciation Models in TTS and ASR • In ASR, we have phone hypotheses from the acoustic model, and need word hypotheses. • In TTS, we have the desired word, but need a corresponding phone sequence to synthesize.

  33. Next Class • Language Modeling • Reading: J&M Chapter 4

More Related