High level features
This presentation is the property of its rightful owner.
Sponsored Links
1 / 1

Construction of phoneme-to-phoneme converters PowerPoint PPT Presentation


  • 49 Views
  • Uploaded on
  • Presentation posted in: General

High level features. Orthography. Initial transcription. Target transcription. Alignment process (letter-to-sound). Alignment process (sound-to-sound). Transformation learning. Learn morphological classes. Example generation. Stochastic rule induction.

Download Presentation

Construction of phoneme-to-phoneme converters

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Construction of phoneme to phoneme converters

High level features

Orthography

Initial transcription

Target transcription

Alignment process

(letter-to-sound)

Alignment process

(sound-to-sound)

Transformation learning

Learn morphological classes

Example generation

Stochastic rule induction

Towards improved proper name recognition

Bert Réveil and Jean-Pierre Martens

DSSP group, Ghent University, Department of Electronics and Information Systems

Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium

{breveil,[email protected]

  • Topic description

  • --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  • Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the mismatch between the way the names are represented in the recognizer and the way they are actually pronounced:

    • Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P)

      converters can’t cope with archaic spelling and foreign name parts, manual

      transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan)

    • Multiple plausible name pronunciations: within or across languages (e.g. Roger)

    • Cross-lingual pronunciation variation: foreign names, foreign application users

  • In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon.

Please guide me towards ‘A&u.stIn

RECOGNITION SYSTEM GPS

HMMs

Lexicon

Austin 'O.stIn

“O”

  • Experimental set-up

  • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  • Database: Autonomata Spoken Name Corpus (ASNC)

  • 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers

  • Every speaker reads 181 names with either Dutch, English, French, Moroccan or Turkish origin

  • Non-overlapping train and test set (disjunctive names, speakers)

  • Human expert transcriptions

    • TY: typical Dutch transcription (one for each name from TeleAtlas)

    • AV: auditory verified Dutch transcription (one for each name utterance)

  • This work: only Dutch native utterances + non-native utterances of Dutch names

  • Speech recognizer: state-of-the-art VoCon 3200 from Nuance

  • Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others)

  • Acoustic and lexical modeling strategies

  • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  • The modeling approaches are firstly conceived for the primary targeted users, also called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two types of non-native languages are distinguished: foreign languages that most NAT speakers are familiar with (NN1), and other foreign languages (NN2).

  • Strategy 1: Incorporating NN1 language knowledge

  • Acoustic modeling: two model sets

    • AC-MONO: standard NAT Dutch model (trained on Dutch speech alone)

    • AC-MULTI: Dutch (20%) and NN1 training data (English, French and German)

  • Lexical modeling

    • G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS)

      • Foreign transcriptions are nativized in combination with AC-MONO

    • Data-driven selection of one extra G2P converter per name origin

  • Strategy 2: Creating pronunciation variants (lexical modeling)

    • Computed per (speaker, name) combination

    • Created from initial G2P transcriptions by means of automatically learned

      phoneme-to-phoneme (P2P) converters

  • Construction of phoneme-to-phoneme converters

  • --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  • P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of name utterances. These 3-tuples are supplied to a 4 step training procedure:

  • Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription

  • Transformation retrieval

  • Generation of training examples: describe linguistic context

    • Previous and next phonemes and graphemes

    • Lexical context (Part Of Speech)

    • Prosodic context (stressed syllable or not)

    • Morphological context (word prefix/suffix)

    • External features: e.g. name type, name source, speaker tongue

  • Rule induction

    • Learn decision tree per input (pattern): stochastic rules in leaf nodes

    • Rule formalism: if context→ leaf node then [input pattern] → [output pattern] with probability Pfir

      In generation mode: rules applied to initial G2P transcription of unseen name  variants with probabilities

    • Experimental assessment

    • --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    • Incorporating NN1 language knowledge

      • Including extra G2P transcriptions (acoustic model = AC-MONO)

        • Boost for (DU,-DU): NAT speakers use NN1 knowledge when

          reading foreign names, including NN2 names

        • Degradation for (DU,DU): reduced by selecting only one extra G2P

      • Decoding with multilingual acoustic model

        • NAT speakers: loss for NAT names, boost for English names only

          • Dutch sounds not as well modeled as before

          • English better known than French?

          • English and Dutch sound inventories differ more than French and Dutch?

        • Foreign speakers: boost for both NN1 name origins

          • mother tongue sounds better modeled

      • Plain multilingual G2P transcriptions bring no improvement

    • Creating pronunciation variants

      • Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets

      • Alternative P2Psfor (DU,NN1) and (NN1,DU) cells

        • create additional P2P that starts from NN1 G2P transcriptions

        • combine most probable variants generated by both P2P converters

      • P2P variants lead to significant improvements for all (speaker, name) cells

        • 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers

    Acknowledgments

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, Nuance and TeleAtlas.

    References

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    [1] B. Réveil, J.-P. Martens and B. D’hoore, How speaker tongue and name source language affect the automatic recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton

    [2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech 2009, UK, Brighton

    [3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta


  • Login