1 / 20

Chinese Languages - What we are facing in speech synthesis

Chinese Languages - What we are facing in speech synthesis. Chinese Languages. Dialects Minority languages (55 big families) Official language: Putonghua or Standard Chinese (SC), common writing system based on SC. Chinese Dialects.

claire
Download Presentation

Chinese Languages - What we are facing in speech synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chinese Languages- What we are facing in speech synthesis

  2. Chinese Languages • Dialects • Minority languages (55 big families) Official language: Putonghua or Standard Chinese (SC), common writing system based on SC

  3. Chinese Dialects 9 /10 dialectal areas: Guan官话, Jin晋语,Wu 吴语, Hui 徽语, Xiang湘语, Gan赣语, Hakka 客家话, Yue粤语 and Min闽语 (平话 PingHua) Mandarin is referred to as Standard Chinese or common language. • The language that was used by the government • The language that was normally spoken by the native speakers. • Guan (Mandarin) dialect is different from Mandarin • Mandarin is based on the Guan (Mandarin) dialects. • Guan (Mandarin) dialects cover large regional areas. Each has tremendous difference from one and other. • Guan (Mandarin) dialects exist objectively without any limitations or conventions as Mandarin. The difference of Chinese language is like the difference of many European Languages such as Portuguese , Spanish and French.

  4. Chinese Guan distribution LanYin Guan Northeast Guan BeijingGuan JiLu Guan Zhong Yuan guan JiaoLiao Guan Jianghuai Guan Southwest Guan After Chinese language map from Institute of Linguistic , Cass

  5. A dialect is usually spoken by people from different provinces Example of Modern Wu dialects Population Place 16,400,000 JianSu 11,850,000 Shanghai 36,650,000 ZheJiang 1,850,000 Northeast part of Jiangxi 270,000 North Pu of Fujian 3,100,000 South Anhui Total area: 137,500 Km2

  6. Sub-dialects of WU 江淮官话 (Jiang Huai) 苏沪嘉小片(SuHu) 宣州片 (Xuan Zhou) 杭州小片 (Hang Zhou) 徽语(Hui Yu) 临绍小片(Lin Shao) 太湖片 (Tai Hu) After Chinese language map from Institute of Linguistic , Cass 台州片(Tai zhou) 处衢片(ChuQu) 瓯江片(OuJiang)

  7. Wu dialect is a group of dialects spoken in ShangHai, ZheJiang, southern JiangSu, and part of FuJian and AnHui. • Wu dialect has about 70 million speakers, which makes it the second biggest dialect running after Mandarin. The dialect of interest in this paper is Shanghainese, the native dialect spoken in Shanghai covering more than 11,850,000 populations. • Although it is rather young in Wu dialect family, Shanghainese becomes more and more interesting to researchers because of its economical and political importance.

  8. Can we synthesize these dialectal words?Do we need to synthesize Chinese dialects? • A special phenomenon for most dialects except Cantonese: A sound without a corresponding writing character. • “口” used for those spoken syllables without corresponding Chinese Characters • Or find a homophone syllable to substitute the sound • (?%) • 有音無字现象:現今所有的漢語方言中,只有粵語已經成功地發展出一套漢字書寫系統,而且深植於民眾的日常生活中。 After Phonology of FuJian ShiPo dialect- 福建石陂 (North Min)

  9. Examples from MinNan dialect • 赶紧去口9+50+103淡薄水来互伊啉。 kua~ ki_n k_hi tsa~ ta_m po? tsui lai hO i li_m (in SAMPA-C) • 口4+27+103使甲侬口12+33+103中指,无礼貌。 bue sai ka? lO_N kiau tiO_N tsai bo le bau (in SAMPA-C) • 字写甲歪歪口9+86+1039+86+106真否看。 li sia ka? uai uai tsuai? tsuai? tsi_n p_ha~i k_hua~ (in SAMPA-C) • 阿瑛敢会困口5+37+102灶骹? a i_N ka_m e k_hu_n tia_m tsau k_ha (in SAMPA-C) (numbers are initial, final and tone coding for the syllable without writing character) A kind of confusion when processing dialectal text. 65% sound in MinNan dialect with correct writing characters (correct meaning with correct sound)

  10. Different Pronunciations between written words and spoken words (文白异读) Example in Minnan dialect: “命” (life) • in written words “命令、命名” (command, name) uttered as /mi/, • in spoken words “性命、好命、命运” (life, good fortune, fortune) uttered as /mia/。

  11. Xun du-训读 In polytonal syllables, there is a general phenomenon called XunDu, the character has correct meaning but wrong pronunciation. For an stance, when Xiamen speakers see the monosyllabic word “书” /su/ (book), they will speak it as /tse/, but this sound correspond to another word “册” (a book) . So in XiaMen dialect, “书”(book) has two sounds, one is /su/ as in words “书法、书写、楷书”(calligraphy, writing, regular script);one is /tse/ as in words “书包、书皮、买书、书呆、书虫”(school bag, book cover, buy books, bookworm)

  12. Examples from MinNan Dialect From CRI news reports ( China Radio report 白话音) 胡錦濤指出,近年來,中朝各領域交流合作取得了豐碩成果,給兩國人民帶來了實實在在的利益。中方願繼續本著互惠互利、共同發展的原則,鼓勵和支援中國企業同北韓企業開展不同形式的投資合作,推動兩國經貿合作關係不斷取得新進展。   金永南說,胡錦濤總書記的訪問必將在傳統的朝中友好合作關係史上寫下新的篇章。朝方將同中方攜手努力,加強朝中傳統友誼,按照互利原則,採取有力措施推進兩國合作。 29日當天,胡錦濤在北韓勞動黨總書記、國防委員會委員長金正日的陪同下,參觀了象徵朝中友誼的朝中大安友誼玻璃廠。 If the written characters are common official system, MinNan dialect can be spoken as well except some lexical words. If the written characters are common official system, MinNan dialect can be spoken as well except some lexical words.

  13. Examples from MinNan Dialect • 聽眾朋友,說起江西廬山相信許多人都不會陌生,它是我國著名的旅遊勝地,是一座集風景、文化、宗教、教育、政治為一體的千古名山。這裡是中國山水詩的搖籃,古往今來,無數文人墨客慕名登臨廬山,為其留下4000餘首詩詞歌賦。 (A Minnan speaker reading a paragraph selected from an article from CRI introducing LUSHAN Mountain) • 听众朋友,讲起江西庐山,相信真多人拢(勿会)生分,伊是咱中国有名的旅游圣地,是一座集風景、文化、宗教、教育、政治為一體的千古名山。遮是中国山水诗的弧(同音字)篮,古往今来,無數文人墨客慕名登臨廬山,為伊留落来4000外首詩詞歌賦。 • (The same content really spoken in MinNan dialect, red part are extremely different from the SC-based on as above. ) Although they can speak following the text of SC, MinNan speakers really don’t speak as that in real life.

  14. Spontaneous speech with a script From CRI title: Finding a real LUSHAN Mountain 看看廬山真面目 • After a CRI script 聽眾朋友,說起江西廬山相信許多人都不會陌生,它是我國著名的旅遊勝地,是一座集風景、文化、宗教、教育、政治為一體的千古名山。這裡是中國山水詩的搖籃,古往今來,無數文人墨客慕名登臨廬山,為其留下4000餘首詩詞歌賦。晉代高僧慧遠(西元334~416年)在山中建立東林寺,開創了佛教中的“凈土宗”,使廬山成為中國封建時代重要的宗教勝地。遺存至今的白鹿洞書院,是中國古代教育和理學的中心學府。廬山上還薈萃了各種風格迥異的建築傑作,包括羅馬式與哥特式的教堂、融合東西方藝術形式的拜佔庭式建築,以及日本式建築和伊斯蘭教清真寺等,堪稱廬山風景名勝區的精華部分。廬山不但擁有“秀甲天下”的自然風光,更有著豐厚燦爛的文化內涵。在今天的《中國百姓生活遊》節目中,今天我們就帶各位到廬山趴趴走。《中國百姓生活遊》節目和國家旅遊局共同主辦。 • what the two announcers really talking in a more spontaneous speech style, the lexical words, grammar are seriously different from the text based on SC as shown in the above text(transcription for this spontaneous dialogue): 嗯,神州抛抛走。今囝日咱要去走的即位所在呢,是足介赞的。即就是江西的庐山。啊讲起许个庐山哦,我是勿八去过,但是呃自小汉有讲读甲真多即的诗啊词啊。而且真多课文当中嘛有写遘介绍即的庐山。是啊庐山呢,是即的,呃,已经互联合国科教文组织号做世界自然遗产甲世界文化遗产。在咱中国安尼三十统个的即个世界遗产当中哦,象伊即个号做,咱叫做双宜哦,双宜的即的并无真多。是啊,我知影讲伊阁是汇集真多,呃,风景啊文化、宗教、教育、政治為一體,着是讲,伊也是真多年的一的古山啊。是啊。所以无伊要叫讲,阁要自然遗产,阁要文化遗产。你看哦,在咧即的为古代到现主时是诚多文人墨客拢来遘即的庐山。所以庐山伊着留落偌多即的诗词甲歌賦你知唔?4000外首啊。哇,看势即的所在一定是有伊真水的所在,无敢会有遮多人想要留落来。阿阁而且哦,你4000外条无可能过过共款啊……

  15. What is the problem or contradictionfor Speech Synthesis (ML)? Common language written spoken Common writing system Standard Chinese Regional writing systems Regional Dialects Lacking of the writing system, dialectal grammar, lexicon for dialectal sounds for most dialects except Cantonese. Dialects

  16. What’s our present task in Speech synthesis? to synthesize the speech according to the common written words easier OR to synthesize the speech really spoken by local people more complicate

  17. Coping with Tones in Chinese in SSML • Phonological tones can reach to 9 tones (Cantonese), only 4 lexical tones in SC. • Complicate tone sandhi rules for many dialects, not only occur within words, but also between words depending on the syntactic or semantic relations.

  18. New specifications relating Chinese dialects proposed by our Institute • ISO/ICE 10646 accepted 24 tonal symbols (2004,6 propose by our institute) • Phonetic Alphabet used in China ,including minority languages and Mandarin Chinese. (“中国通用音标符号集”) . This specification has been submitted to the Education Ministry of China, which will become a standard specification for Chinese language survey, teaching and study, even used in speech information processing.

  19. Tones 25 tones, when use a traditional five letter tone scale in this specification. • 5 level tones • 10 rising tones • 10 falling tones If tone sandhi (tonal icons are presented at right side) and short tones (shorter line) considered, (25+25)*2 =100 tones are got. • 20 long-short tones • 20 short-long tones • 30 concave tones • 30 vaulted tones

  20. Sub-dialects coding • Handbooks for Survey Chinese Dialects edited by the leading office of Chinese Languages investigation, YUWEN Publishing House. 《中国语言文字使用情况调查 调查员手册》, 中国语言文字使用情况调查领导小组办公室编, 语文出版社。

More Related