Internationalizing W3C
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia PowerPoint PPT Presentation


  • 125 Views
  • Uploaded on
  • Presentation posted in: General

Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006. Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia [email protected]

Download Presentation

Considerations on using PLS for Slovenian Pronunciation Lexicon Construction Jerneja Ž ganec Gros Alpineon d.o.o. , Ljubljana, Slovenia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

  • Internationalizing W3C's Speech Synthesis Markup Language, Workshop II, Heraklion, Crete, May 2006

  • Considerations on using PLS for Slovenian Pronunciation Lexicon Construction

  • Jerneja Žganec Gros

  • Alpineond.o.o., Ljubljana, Slovenia

  • [email protected]


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Presentation outline

  • Introduction

  • SI-PRONlexicon:

    • word list

    • lexicon format

    • phonetic transcription

    • morpho-syntactic descriptions

  • Proposed extensions to PLS, SSML

  • Conclusions


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Introduction

  • Speech technology applications:

    • automatic speech recognition (ASR)

    • text-to-speech synthesis (TTS)

    • require consistent specification of pronunciation

    • Slovenian: lexical stress position not fixed -> pron lex crucial

  • Pronunciation lexicons:

    • general

    • application-specific

      • word/phrase pronunciations

      • application-specific proper nouns: personal&location names


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Slovenian pron lex

  • General:

    • S5 (Gros et al., 1996)

    • Onomastica (Derlić and Kačič, 1997)

    • SImlex/SIflex (Verdonik et al, 2002)

    • SI-LC-STAR (Verdonik and Rojc, 2004)

    • AlpSynth (Gros et al., 2002)

    • SI-BN (Žibert, 2005, Žgank; 2005)

  • Application-specific:

    • Gopolis, SpeechDAT, etc


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Word-list

  • SI-PRON wordlist:

    (a) 93,154 lemmas from SSKJ

    (b) over 1,000,000 word form derived from (a) – morphol. deriv.

    (c) additional word list:

    • corpus-based search

    • 20,000 most freq inflected word forms not covered by SSKJ lemmas

      (d) collocations, multi-word expressions

      SSKJ: Slovar slovenskega knjižnega jezika


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Phonetic transcriptions

  • SSKJ lemmas:

    • automatic derivation, based on dynamic/tonemic accent information

    • manual corrections for about 2.500 lemmas (words of foreign origin)

  • Word forms derived from SSKJ:

    • automatic: SSKJ lemma pronunciation look-up, inflectional paradigms

  • Additional corpus-based word list:

    • automatic lexical stress assignment

    • AlpSynth grapheme-to-phoneme rule set


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

GTP rules

  • 193 context-dependent grapheme-to-phoneme rules:

Left

Grapheme

Right

Phonetic

Example

Rule explanation

context

string

context

transcr.

$

er

_

[@r]

Gaber

@ occurs before each -r not

followed by a vowel

(T

opori

sic

91, p.49)

=

m

f

[F]

Simfonija

<m> in front of <f> and <v> is

pronounced as a labiodental

(Pravopis90, p. 145)


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Transcription accuracy experiment

  • reference: hand-crafted pron lex, 30K lexemes

  • automatic lexical stress assignment: 25% error rate

  • lexical stress & o/e pronunciation known in advance:

    • transcription success rate 99.01 %

      (0.6% handcrafting errors)

  • conclusion:

    • for semi-automatic derivation of Slovenian phonetic

      transcriptions with a 0.03% error rate only lexical

      stress positions&e/o need to be manually validated


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

SI-PRON format

  • LC-STAR lexicon specs – STTS (Shamas & v Heuvel, 2004)

  • Pronunciation Lexicon Specification (PLS)

    • W3C Voice Browser Activity

    • Pronunciation lexicon markup language

    • Version 1.0, W3C Last Call Working Draft 31 January 2006

      • http://www.w3.org/TR/pronunciation-lexicon/

  • Two main applications:

    • Speech Synthesis (SSML documents)

      • PLS improves SSML on text normalization, GTP

    • Speech Recognition (SRGS grammars)

  • W3C standard! recommendation


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

PLS in SSML

  • SSML document references an external pron lexicon:

    • TTS engine loads the PLS documents and applies them to the SSML document

    • applications may specify contextual PLS documents, which are to be used in different points of the interaction (like airports.pls, carriers.pls, …)

<?xml version="1.0" encoding="ISO-8859-1"?>

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="SI">

<lexicon uri="http://www.alpineon.com/airports.pls"/>

Letalo letalske družbe British Airlines, ki prihaja iz

Manchestra, bo imelo 5 minut zamude.

</speak>


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Phonetic alphabet

  • SI-SAMPA (Zemljak et al., 2002)

    • Speech Assessment Methods Phonetic Alphabet

    • only ASCII characters, not the IPA extended char set

    • augmented with additional markers for tonemic accents (tonemic acute&tonemic circumflex), lexical stress accents (acute, circumflex&grave)


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

PLS

  • The <lexeme> element - container of a lexicon entry:

    • usually only one<grapheme> element

    • several<phoneme>or <alias>elements

<?xml version="1.0" encoding="UTF

-

8"?>

<lexicon version="1.0" xml:lang="si

-

SI" alphabet="x

-

sampa

-

SI

-

reduced">

<lexeme>

<grapheme>dober</grapheme>

<phoneme>"d/o:

-

[email protected]</phoneme>

<!

--

This is an example of the x

-

samp

a

-

SI

-

reduced string

for the pronunciation of the Slovenian word: "dober",

meaning "good" in English

--

>

</lexeme>

</lexicon>


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Pronunciation variations

  • multiple pronunciations:

    • several<phoneme> elements

    • preferred pronunciation:

      • indicated by the prefer element

      • usually the 1st pronunciation from the SSKJ

      • for some words, 2 pronunciations are equally preferred EXAMPLE:

        - male Slovenian nouns, terminating with "ilec" like

        /borilec/, /darovalec/

        • "iUts"/"ilts", "ilts"/"iUts", "ilts", or "iUts"

        • typically account for more fluent"iUts" or overarticulated"ilts"pronunciation


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Extensions…

  • proposed extension:

    • a new optional attribute for the <phoneme> element:

      • pron-styleattribute

      • values: "fluent", "overarticulated"

    • pron-stylealso for other elements:

      • <voice>, <speak>, <p>, <s>

      • another optional attribute for the above elements:

        emotionfor expressive TTS ?


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Extensions…

  • dialects:

    • user-friendly apps require dialect/sociolect pronunciation variations

    • another optional attribute for the following elements:

      <phoneme>, <voice>, <speak>, <p>, <s>

      - rfc3066-like identifiers may be used to indicate dialects


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Extensions…

  • source/creator:

    • only the <metadata>element

    • source of multiple pronunciations:

      • useful info when merging multiple PLS dox

      • some sources/creators may be more reliable than others…

        - additional optional attribute pron-sourcefor the <phoneme>element


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Extensions…

  • part-of-speech tags:

    • Slovenian language – complex inflectional paradigm

      • including "dual" – like ancient Greek!

    • morphological, syntactic and semantion descriptors welcome in future revisions of the PLS document

  • proprietary <lemma>, <MSD> elements used in SI-PRON

    • MULTEXT-East MSDs (Erjavec, 2004)


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Conclusion

  • SI-PRON pronunciation lexicon for Slovenian

  • proposed extensions to PLS, SSML

    • pron-styleattribute 

    • emotionattribute

    • annotating dialects/sociolects

    • source/creatorattribute 

    • morpho-syntactic, semantic descriptors 


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

Project Partners

  • L6-5405 project

    • Research of Slovenian Language in Lexicography and Lexicology based on Digital Language Resources

    • Spoken representation of Slovenian words:

      • http://bos.zrc-sazu.si/sskj.html

  • Alpineon

  • ZRC-SAZU

    • Fran Ramovš Institute of the Slovenian Language


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

SSKJ speech files


Considerations on using pls for slovenian pronunciation lexicon construction jerneja ganec gros alpineon d o o

THANK YOU FOR YOUR ATTENTION!


  • Login