slide1
Download
Skip this Video
Download Presentation
Construction of phoneme-to-phoneme converters

Loading in 2 Seconds...

play fullscreen
1 / 1

Construction of phoneme-to-phoneme converters - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

High level features. Orthography. Initial transcription. Target transcription. Alignment process (letter-to-sound). Alignment process (sound-to-sound). Transformation learning. Learn morphological classes. Example generation. Stochastic rule induction.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Construction of phoneme-to-phoneme converters' - cyndi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

High level features

Orthography

Initial transcription

Target transcription

Alignment process

(letter-to-sound)

Alignment process

(sound-to-sound)

Transformation learning

Learn morphological classes

Example generation

Stochastic rule induction

Towards improved proper name recognition

Bert Réveil and Jean-Pierre Martens

DSSP group, Ghent University, Department of Electronics and Information Systems

Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium

{breveil,martens}@elis.ugent.be

  • Topic description
  • --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the mismatch between the way the names are represented in the recognizer and the way they are actually pronounced:
    • Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P)

converters can’t cope with archaic spelling and foreign name parts, manual

transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan)

    • Multiple plausible name pronunciations: within or across languages (e.g. Roger)
    • Cross-lingual pronunciation variation: foreign names, foreign application users
  • In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon.

Please guide me towards ‘A&u.stIn

RECOGNITION SYSTEM GPS

HMMs

Lexicon

Austin \'O.stIn

“O”

  • Experimental set-up
  • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • Database: Autonomata Spoken Name Corpus (ASNC)
  • 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers
  • Every speaker reads 181 names with either Dutch, English, French, Moroccan or Turkish origin
  • Non-overlapping train and test set (disjunctive names, speakers)
  • Human expert transcriptions
    • TY: typical Dutch transcription (one for each name from TeleAtlas)
    • AV: auditory verified Dutch transcription (one for each name utterance)
  • This work: only Dutch native utterances + non-native utterances of Dutch names
  • Speech recognizer: state-of-the-art VoCon 3200 from Nuance
  • Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others)
  • Acoustic and lexical modeling strategies
  • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • The modeling approaches are firstly conceived for the primary targeted users, also called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two types of non-native languages are distinguished: foreign languages that most NAT speakers are familiar with (NN1), and other foreign languages (NN2).
  • Strategy 1: Incorporating NN1 language knowledge
  • Acoustic modeling: two model sets
    • AC-MONO : standard NAT Dutch model (trained on Dutch speech alone)
    • AC-MULTI : Dutch (20%) and NN1 training data (English, French and German)
  • Lexical modeling
    • G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS)
      • Foreign transcriptions are nativized in combination with AC-MONO
    • Data-driven selection of one extra G2P converter per name origin
  • Strategy 2: Creating pronunciation variants (lexical modeling)
    • Computed per (speaker, name) combination
    • Created from initial G2P transcriptions by means of automatically learned

phoneme-to-phoneme (P2P) converters

  • Construction of phoneme-to-phoneme converters
  • --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of name utterances. These 3-tuples are supplied to a 4 step training procedure:
  • Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription
  • Transformation retrieval
  • Generation of training examples: describe linguistic context
      • Previous and next phonemes and graphemes
      • Lexical context (Part Of Speech)
      • Prosodic context (stressed syllable or not)
      • Morphological context (word prefix/suffix)
      • External features: e.g. name type, name source, speaker tongue
  • Rule induction
      • Learn decision tree per input (pattern): stochastic rules in leaf nodes
      • Rule formalism: if context→ leaf node then [input pattern] → [output pattern] with probability Pfir

In generation mode: rules applied to initial G2P transcription of unseen name  variants with probabilities

  • Experimental assessment
  • --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • Incorporating NN1 language knowledge
    • Including extra G2P transcriptions (acoustic model = AC-MONO)
      • Boost for (DU,-DU): NAT speakers use NN1 knowledge when

reading foreign names, including NN2 names

      • Degradation for (DU,DU): reduced by selecting only one extra G2P
    • Decoding with multilingual acoustic model
      • NAT speakers: loss for NAT names, boost for English names only
        • Dutch sounds not as well modeled as before
        • English better known than French?
        • English and Dutch sound inventories differ more than French and Dutch?
      • Foreign speakers: boost for both NN1 name origins
        • mother tongue sounds better modeled
    • Plain multilingual G2P transcriptions bring no improvement
  • Creating pronunciation variants
    • Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets
    • Alternative P2Psfor (DU,NN1) and (NN1,DU) cells
      • create additional P2P that starts from NN1 G2P transcriptions
      • combine most probable variants generated by both P2P converters
    • P2P variants lead to significant improvements for all (speaker, name) cells
      • 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers

Acknowledgments

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, Nuance and TeleAtlas.

References

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[1] B. Réveil, J.-P. Martens and B. D’hoore, How speaker tongue and name source language affect the automatic recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton

[2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech 2009, UK, Brighton

[3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta

ad