1 / 27

Chinese Romanization for Chinese Voice Browsing

Chinese Romanization for Chinese Voice Browsing. IBM China Research Lab. Index. Motivations & Proposals IPA. VS. Chinese Romanization Chinese Romanization Standards Implementations of Chinese Romanization in SSML Extensions for other languages. Motivations & Proposals.

Download Presentation

Chinese Romanization for Chinese Voice Browsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chinese Romanization for Chinese Voice Browsing IBM China Research Lab

  2. Index • Motivations & Proposals • IPA. VS. Chinese Romanization • Chinese Romanization Standards • Implementations of Chinese Romanization in SSML • Extensions for other languages

  3. Motivations & Proposals

  4. IBM Speech Synthesis System • IBM speech synthesis system support about 20 languages. • For Asian Language, we cover: • Mandarine, • Cantonese, • Korean, • Japanese, • Thai.

  5. Pronunciations Annotations are important for Chinese • A Chinese character represents a meaning more than a pronunciation. • The homograph phenomenon is very common for Chinese characters. • So it will be very helpful if the pronunciation can be given explicitly.

  6. Proposals • We propose to use Chinese Romanization to annotate Chinese pronunciation in “phoneme” element. • We also propose SSML to use diverse predefined and widely used pronunciation annotation standards for different languages. • Thus SSML can be more easily accepted and used around the world. • Note: Chinese Romanization = Hanyu Pinyin in this PPT.

  7. IPA. VS. Chinese Romanization

  8. Comparison Rule: Goal of SSML • The goal of SSML is to “provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications”. • To reach the goal, we need more and more users of SSML, such as ordinary Web applications developers, to learn and use the SSML easily. • So, we need to define the SSML based on ordinary people’s knowledge and skill rather than professional linguistics’ knowledge. • Otherwise, it will be a long way for SSML be widely accepted and used around the world.

  9. IPA is not very fit for Chinese • IPA tries to collect an exhaustive set of pronunciations for all kinds of languages. • It has become very complicated and difficult to input. • A well educated Chinese adult can not annotate Chinese Pronunciation in IPA without special training. • IPA is not very popular in China. • Special linguistic phenomena in Chinese, such as tone, retroflex, can not be conveniently described by IPA.

  10. Chinese Romanization is fit for Chinese • Chinese Romanization is specially designed only for Chinese instead of all languages. • Adding ‘r’ in the end to describe a “retroflex” syllable. • Adding ‘tone’ attribute to describe the tone. • Chinese Romanization is widely used and learnt. • Chinese people learn Chinese Romanization in primary school. • Many foreigners begin to learn Chinese by Chinese Romanization. • Chinese Romanization is widely used to input Chinese Characters on computer. • Chinese government has brought into effect a standard for Chinese Romanization. • It is in effect for education, publishing, information processing and other related industries in China.

  11. Chinese Romanization Standards

  12. Chinese Romanization Standard • The writing rules of Chinese Romanization conform to P.R.C state standard “Basic rules for Hanyu Pinyin Orthography” [1] published by (CSBQTS) in 1996. • This Orthography is based on “Hanyu Pinyin Schema” published in 1958. • According to the naming method of alphabet, we propose to use “x-CSBQTS-96” to represent Chinese Romanization alphabet. However, we also propose to use “x-Pinyin-96”, which is easier to remember. * CSBQTS: China State Bureau of Quality and Technical Supervision

  13. Hanyu Pinyin Schema (published in 1958) • Character Set. • 25 characters, all from ‘a’ to ‘z’ except ‘ü’. • (For easy to input on computer: ü is replaced by v.) • Initial Set: • b, p m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s • Final Set: • i, u, ü, a , ia, ua, o, uo, e, ie, eü, ai, uai, ei, uei, • ao, iao, ou, iou, an, ian, uan, üan, en, in, uen, ün • ang, iang, uang, eng, ing, ueng, ong, iong, • Tone Annotation: • mā , má, mǎ, mà, ma • Separator: ' • pi’ao

  14. Pinyin VS. IPA

  15. Basic rules forHanyu Pinyin Orthography(published in 1996) 1. Words are the basic units for spelling the Chinese Common Language. (Space is used to separate Word) • rén (person/people), péngyou (friend[s]), túshūguǎn (library/libraries) • wǒrén hé nóngmín (Workers and Farmers) 2. Structures of two or three syllables that indicate a complete concept are linked: • quánguó (the whole nation), duìbuqǐ (sorry), 3. Separate terms with more than 4 syllables if they can be separated into words, otherwise link all the syllables: • wúfèng gāngbǐ (seamless pen), Hóngshízìhuì (Red Cross)

  16. Basic rules forHanyu Pinyin Orthography(published in 1996) 4. Reduplicated monosyllabic words are linked, but reduplicated disyllabic words are separated: • rénrén (everybody), chángshi chángshi (give it a try) 5. In certain situations, for the purpose of making it convenient to read and understand the words, a hyphen can be added: • huán-bǎo (environmental protection), shíqī-bā suì (17 or 18 years old)

  17. Implementations of Chinese Romanization in SSML

  18. Implementation 1 • <?xml version="1.0"?> • <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation="http://www.w3.org/2001/10/synthesis • http://www.w3.org/TR/speech-synthesis/synthesis.xsd" • xml:lang="zh-CH"> • <phoneme alphabet=" x-CSBQTS-96" ph="duìbuqǐ"> 对不起 </phoneme> • <!-- This is an example of Chinese Romanization Standard Tone Annotation--> • </speak>

  19. Implementation 2 • <?xml version="1.0"?> • <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" • xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" • xsi:schemaLocation="http://www.w3.org/2001/10/synthesis • http://www.w3.org/TR/speech-synthesis/synthesis.xsd" • xml:lang="zh-CH"> • <phoneme alphabet="x-CSBQTS-96" ph="dui4bu0qi3"> 对不起 </phoneme> • <!-- This is an example of Chinese Romanization • using number to describe tone --> • </speak>

  20. Comparison between Two implementations Implementation 1: <phoneme alphabet=" x-CSBQTS-96" ph="duìbuqǐ"> 对不起 </phoneme> Implementation 2: <phoneme alphabet="x-CSBQTS-96"ph="dui4bu0qi3"> 对不起 </phoneme> Note: "x-CSBQTS-96" may be replaced by "x-Pinyin-96"

  21. Extensions for other languages

  22. Extension for Cantonese • The Linguistic society of Hong Kong has published a simple, easy-to-learn and easy-to-use “LSHK Cantonese Romanization Scheme” in 1993. • This scheme is widely adopted in various areas: education, Cantonese information process and computer input method, etc. • So we also propose to use “The LSHK Cantonese Romanization Scheme” to annotate Cantonese pronunciation.

  23. Extension for more languages • Though it is possible to form up a general standard to annotate all languages’ pronunciation, such a standard may become very complex to use. • Another way is to use the predefined and widely accepted pronunciation annotation standards for different language. • At least, these diverse standards should be an important complement to the general standard.

  24. Thank you!

  25. Korea Romanization It is used in our Korea Speech Synthesis System.

  26. Japanese Romanization • Japanese: • まだ覚えているでしょう 波音に包まれて • Japanese Romanization: • mada oboeteiru deshou nami oto ni tsutsumarete • English meaning: • Do you remember being surrounded by the sound of tide?

  27. Discussion of “Word” • What is the definition of “Word” in Chinese? • Prosodic Word or Grammar Word • 你来还是不来?nǐ lái háishi bù lái? • Is “不来” a word? • What is the difference between ‘Word’ & ‘break’? • The misunderstanding problem can be solved by adding ‘break’. • Can Word information be handled by ‘Hanyu Pinyin Orthography’? • In ‘Hanyu Pinyin Orthography’, space is used to separate words.

More Related