thomas fang zheng oct 29 2004 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Dialectal Chinese Speech Recognition PowerPoint Presentation
Download Presentation
Dialectal Chinese Speech Recognition

Loading in 2 Seconds...

play fullscreen
1 / 55

Dialectal Chinese Speech Recognition - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

Thomas Fang Zheng Oct. 29, 2004. Workshop of KFIS ( Korea Fuzzy Logic and Intelligent Systems Society ) Oct. 29-30, 2004, Kyungnam Univ. , Masan, Korea. Dialectal Chinese Speech Recognition. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Dialectal Chinese Speech Recognition' - chastity


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
thomas fang zheng oct 29 2004
Thomas Fang Zheng

Oct. 29, 2004

Workshop of KFIS (Korea Fuzzy Logic and Intelligent Systems Society)

Oct. 29-30, 2004, Kyungnam Univ., Masan, Korea

Dialectal Chinese Speech Recognition

motivation
Motivation
  • Chinese ASR encounters an issue that is bigger than that of any other language - dialect.
  • There are 8 major dialectal regions in addition to Mandarin (Northern China), including:-
    • Wu (Southern Jiangsu, Zhejiang, and Shanghai);
    • Yue (Guangdong, Hong Kong, Nanning Guangxi);
    • Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan);
    • Hakka (Meixian Guangdong, Hsin-chu Taiwan);
    • Gan (Jiangxi);
    • Xiang (Hunan);
    • Hui (Anhui)
    • Jin (Shanxi, Hohehot Inner Mongolia).
  • Can be further divided into over 40 sub-categories.
slide3
Chinese dialects share a same written language:-
    • The same Chinese pinyin set (canonically),
    • The same Chinese character set (canonically), and
    • The same vocabulary (canonically).
  • And standard Chinese (known as Putonghua, or PTH) is widely spoken in most regions over China.
  • However, speech is strongly influenced by the native dialects;
  • Most Chinese people speak in both standard Chinese and their own dialect, resulting in dialectal Chinese - Putonghua influenced by native dialect
slide4
In dialectal Chinese:-
    • Word usage, pronunciation, and syntax and grammar vary depending on the speaker's dialect.
    • ASR relies to a great extent on the consistent pronunciation and usage of words within a language.
    • ASR systems constructed to process PTH perform poorly for the great majority of the population.
project goal
Project Goal
  • To develop a general framework to model in dialectal Chinese ASR tasks :-
    • Phonetic variability,
    • Lexical variability, and
    • Pronunciation variability
  • To find suitable methods to modify the baseline PTH recognizer to obtain a dialectal Chinese recognizer for the specific dialect of interest, which employ :-
    • dialect-related knowledge (syllable mapping, cross-dialect synonyms, …), and
    • training/adaptation data (in relatively small quantities, or even no)
  • Expectation: the resulted recognizer should also work for PTH, in other words, it should be good for a mixture of PTH and dialectal Chinese.
slide6

Dialectal Chinese Related

Knowledge & Resources

Standard Chinese

Speech Recognizer

+

Dialectal Chinese Speech

Recognition Framework

Dialectal Chinese

Speech Recognizer

slide7

Standard Chinese

Speech Recognizer

Dialectal Chinese Speech

Recognition Framework

LM Adapter

AM Adapter

Acoustic Regulator

Dialect-Related

Lexical Entry Replacement Rules

Toned-Syllable Mappings:

Word-Independent/-Dependent

Pronunciation Modeling

(PM) Techniques:

Accents & Spontaneous Speech

Language Post-Processing

Algorithms

slide8
This proposal was selected as one of three projects for '2003 Johns Hopkins University Summer Workshop from tens of proposals collected from universities/companies over the world;
  • Postponed to 2004 due to SARS;
  • For practical reasons, during the summer we only focused on one specific dialect, the Wu dialect (Shanghai Area), and the target language was Wu dialectal Chinese (WDC for short);
slide9
Why Wu dialect (1) ?
    • Population: more than 70 million people use WUdialect, the 2nd popular dialect in China;
    • Economy: one of the most advanced city in China - Shanghai
slide10
Why Wu dialect (2)?
    • Wu dialect is a full-developed language
      • The syntax of Wu dialect is very complex;
      • The vocabulary is even more larger than Mandarin;
      • Many literature masterpiece were influenced by WU dialect (in history).
useful dialect related knowledge
Useful Dialect-Related Knowledge
  • Chinese Syllable Mapping (CSM)
    • This CSM is dialect-related.
    • Two types:
      • Word-independent CSM: e.g. in Southern Chinese, Initial mappings include zhz, chc, shs, nl, and so on, and Final mappings include engen, ingin, and so on;
      • Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin 'guo2' is changed into 'gui0' in word '中国(China)' but only the tone is changed in word '过去(past)'.
slide13

The CSM is not exact. For any mapping AB, it is mostly that the resulted pronunciation is not B exactly, but something quite similar to B, more similar to B than to any other syllable.

  • The CSM could be N→1, 1→N, or crossed.
  • Bi is a variation of B, such as :-
    • nasalization, centralization, voiced,voiceless, rounding, syllabic, pharyngrealization, aspiration
slide14
Lexicon:
    • Linguistician says the vocabulary similarity rate between PTH and Wu dialect is about 60~70%

60~70%

slide15
A dialect-related lexicon containing two parts :-
    • a common part shared by standard Chinese and most dialectal Chinese languages (over 50k words), and
    • a dialect-related part (several hundreds).
  • And in this lexicon :-
    • each word has one pinyin string for standard Chinese pronunciation and a kind of representation for dialectal Chinese pronunciation, and
    • each of those dialect-related words is corresponding to a word in the common part with the same meaning
slide16
Language
    • Though it is difficult to collect dialect texts, dialect-related lexical entry replacement rules could be learned in advance, and therefore
    • The language post-processing or language model adaptation techniques could be adopted.
slide17

我 做饭 给 你 吃 (PTH)我 烧饭 给 你 吃(Wu)

w1

w2

w3

  • Dialectal words substitute for some words

w3

w3

  • 你 先 走(PTH)你 走 先(Wu)

w1

w2

w3

  • Word-order changes

w2

w2

w3

w2

1

V2

2

pre workshop work

Data Creation for WDC

e-Dictionary

Database

IF & Syllable

Set Definition

Speech

Transcription

Database

Collection

PTH

Words

Wu Dialect

Words

Read

Speech

Spontaneous

Speech

C-Chars

Syllables

IFs/GIFs

PTH Pron.

PTH Pron.

PTH Words Only

Misc Info

Wu Dialect Pron.

Wu Dialect Pron.

Topics

PTH + Wu Words

PTH Synonym

Pre-workshop Work

IF: a Chinese Initial or Final; GIF: generalized IF; PTH: Putonghua (standard Chinese); WDC: Wu Dialectal Chinese

slide20
Wu Dialectal Chinese (WDC) Database Collection (1)
    • Collection:
      • Totally 11 hours - Half read (R) + half spontaneous (S):
        • 100 Shanghai speakers * (3R +3S) minutes / speaker
        • 10 Beijing speakers * 6S minutes / speaker
      • Read speech with well-balanced prompting sentences;
        • Type I: each sentence contains PTH words only (5-6k)
        • Type II: each sentence contains one or two most commonly used Wu dialectal words while others are PTH words
      • Spontaneous speech with Pre-defined talking topics;
        • Conversations with PTH speaker on self-selected topic from: sports, policy/economy, entertainment, lifestyles, technology
      • Balanced Speaker (gender, age, education, PTH level, …)
slide21

Goal

Actual WDC Data Diversity

slide22

Accent Assessment by experts

1A. State-level radiobroadcaster; 1B. Province-level radiobroadcaster; 2A. Quite good;2B. Less accented; 3A. More accented;3B. Hard to understand but known it is PTH

slide26
Wu Dialectal Chinese (WDC) Database Collection (2)
    • Transcriptions include:-
      • For 100 Wu Dialectal Chinese speakers:-
        • Canonical Chinese Initial/Final labels, and
        • Generalized IF (GIF) labels.
      • For 10 Beijing speakers:-
        • Chinese character and pinyin transcriptions only
slide28
Dialectal Lexicon Construction
    • Establish a 50k-word electronic dialect dictionary with each word having :-
      • PTH pronunciation in PTH IF string
      • Wu dialect pronunciation in Wu IF string
    • Purpose: summarizing Dialect-Related Knowledge
      • Figure out Chinese syllable mappings:-
        • Same written form (character), different pronunciations;
        • Both word-independent and word-dependent;
      • Find dialect-related word variations:-
        • Same meanings in Chinese language;
        • Different written forms (character);
        • Uttered in standard Chinese manner;
        • For LM adaptation/modification
workshop experiments
Workshop Experiments
  • Experiment Conditions (1):
    • Using HTK 3.2.1 (latest version downloadable on web);
    • Data Set Division:
      • Using spontaneous speech data only
      • Data were split according to age (younger, older), education (higher, lower), and PTH level into
        • Train Set: 80 speakers
        • devTest Set: 20 speakers (a part of devTrain)
        • Test Set: 20 speakers
slide31
Experiment Conditions (2):
    • Acoustic model:
      • Trained from Mandarin Broadcast News (MBN);
      • 39 dimensional MFCC_E_D_A_Z;
      • diagonal covariance matrix;
      • 4 states per unit;
      • 103,041 units (triIF), 10,641 real units (triIF);
      • 3,063 different states (after state tying);
      • 16 mixtures per state, 28 mixtures per state for silence unit;
    • Language model:
      • Built on HKUST 100 hour CTS data, plus Hub5, plus Wu-Accented Training Data Transcriptions
slide32
Observation on WDC Data
    • IF-mapping / Syllable-mapping:
      • Influenced by Wu dialect, a Wu dialectal Chinese (WDC) speaker often pronounce any of a certain set of IFs into another IF, and there are rules to follow, such as zh -> z, ch -> c, sh -> s, and so on.
    • Observations on three sets - Train (80 speakers), devTest (20), and Test (20):
      • Mapping pairs almost the same among all three sets;
      • Mapping pairs almost identical to experts' knowledge;
      • Mapping probabilities also almost equal;
    • Remarks:
      • Experts' knowledge could be useful;
      • Mapping rules can be learned from less data.
slide33
Using only devTest set + dialect-based knowledge
    • Step 1: Apply PTH-IF mapping rules;
    • Step 2: Apply WDC-IF mapping rules;
    • Step 3: Apply syllable-dependent mapping rules;
    • Step 4: Perform multi-pronunciation expansion (MPE) based on unigram probability.
slide34
Why trying this method?
    • "IF-mapping" in dialectal Chinese is the fact (human uses it);
    • "In-domain data training" will sure get a good result but collecting data is a huge task, especially for 40 sub-dialects of Chinese;
    • "Mere Adaptation" will be easier and better but might make it hard to distinguish those mapping pairs, each pair tends to become a single IF;
    • This is not practical in such applications where you have no more information about the speakers and a mixture of WDC and PTH is used as Call Centers;
    • It is expected that knowledge based method would result in an overall good performance for both WDC and PTH.
slide35
Step 1: Applying PTH-IF mapping rules
    • Rules are based on experts' knowledge (with AM unchanged)
      • (zh, z) (z, zh)
      • (ch, c) (c, ch)
      • (sh, s) (s, sh)
      • (eng, en) (en, eng)
      • (ing, in) (in, ing)
      • (r, l)
    • Gain not so significant: 0.5% Chinese Character Error Rate (CER) reduction
    • Pronunciation entry probability does not help improve performance
slide36
Step 2: Applying WDC-IF mapping rules
    • There indeed are some Wu dialect Chinese specific IFs, such as iao -> io^;
    • Rules learned from devTest
    • Newly introduced WDC specific IFs trained from devTest using adaptation method
    • 8.66% absolute CER reduction
    • MLLR adaptation outperforms MLLR+MAP
      • About 10% difference
      • Possibly due to less data
    • We referred it to surface form (WDC) MLLR adaptation; for comparison purpose, the base form (PTH) MLLR adaptation is also evaluated where only canonical IFs are used.
slide37
Step 3: Apply syllable-dependent mapping rules
    • Assumption: most IF-mappings are context-independent, but some are syllable-dependent (such as iii|(sh iii) -> ii|(s ii)), we believe there are others
    • Rules learned from devTest
    • We do not succeed in improving the accuracy, on the contrary, the character accuracy reduced by about 6%
    • We do not have a clear explanation yet
    • So we keep using context-free mapping rules
slide38
Step 4: Multi-pronunciation expansion (MPE) based on unigram probability
    • Motivation: more pronunciations help model pron. variations, but lead to more confusion, there should be tradeoff;
    • Accumulated unigram probability (AccProb) used as the criterion
      • Only words with higher unigram probabilities will have multiple pronunciations each;
      • Words with lower unigram probabilities will have a single standard pronunciation each;
slide40

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.10

AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion;

Base-form MLLR + PTH-IF mapping + MPE (CER)

slide41

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.24

AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion;

Surface-form MLLR + WDC-IF mapping + MPE (CER)

slide42

Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.24

AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion;

Base-form MLLR + PTH-IF mapping + MPE (CER)

Surface-form MLLR + WDC-IF mapping + MPE (CER)

q how about recognizing pth using the resulted wdc recognizer
Q: How about recognizing PTH using the resulted WDC recognizer?
  • We obtain WDC recognizer from PTH recognizer;
  • We get a CER reduction of over 10% when recognizing WDC on an average;
  • How about using it to recognize PTH?
slide45

sh

Adaptation

sh

s

(Conventional Method)

s

sh

sh

MPE

Rule

+

(Our method)

s

s

slide46
We can expect that using WDC recognizer to recognize PTH, the performance will degrade;
  • But we would expect it will not decrease too much;
  • Results: using WDC recognizer, you get
    • Over 10% CER reduction to recognize WDC;
    • 0.62% CER increase to recognize PTH.
discussions
Discussions
  • The use of knowledge is useful and effective
  • In this project, there are several problems to solve: channel, speaking-style, dialect background, and domain problems.
    • It is easier to solve all these problems by simply using the adaptation method;
    • Our method focuses only on the dialect problem;
    • The results using our method could be better if we integrate those methods related to channel, and speaking-style.
future plan
Future Plan
  • Continue on the current project, including:
    • Investigating the syllable-dependent mapping;
    • Rank-based Rescoring;
    • Language Model Adaptation;
slide49
Rank-based AM Rescoring
    • Assumption: ranks in lattice when using the recognizer derived from the PTH one to recognize WDC speech has a relatively stable distribution
slide50

Generate lattice (“SIL” marks pauses) for each sentence in devTest

Turn the lattice into multiple alignment (“-” marks deletions) - information of arcs in the lattice will be remembered for later back-tracking.

Lidia Mangu et al [1999]

slide51

Learn P (a | a, rank): probability of a if seen in the rank-th position

  • Learning:
  • Count (B) ++; Count (B | B, 1)++
  • Count (AI)++; Count (AI | AI, 2)++
  • Count (T)++; Count (T | T, 1)++
  • Count (IAN)++; Count (IAN | IAN, 1)++
  • Count (E)++; Count (E | E, 2)++

Post-processing:

Prob. ( a | a, rank) =Cnt ( a | a, rank) / Cnt (a)

slide52
Rescoring during recognition:
    • Original lattice
    • Multi-alignment lattice
    • Original lattice rescoring: using the ranks in this multiple alignment and the back-tracking information, modify the probability of the WDC-IF in each arc in the lattice.
slide53
Language Model Adaptation
    • Different word form with same meaning
      • Such as: 喜欢 vs. 欢喜 - like; 做饭 vs. 烧饭 - cook
      • Linguists say the vocabulary similarity rate between Putonghua and Wu dialect is about 60~70%.
    • Different word order
      • 你先走 (you first go) vs. 你走先 (you go first)
slide54
References:
    • http://www.clsp.jhu.edu/ws04
    • http://cst.cs.tsinghua.edu.cn/~fzheng/PUBLICATIONS.htm#PUB_PronModeling
    • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Reducing pronunciation lexicon confusion and using more data without phonetic transcription for pronunciation modeling,” ICSLP’2002, pp. 2461-2464, Sep. 16-20, 2002, Colorado, USA
    • Fang Zheng, Zhanjiang Song, Pascale Fung, William Byrne. “Mandarin Pronunciation Modeling Based on CASS Corpus,” J. Computer Science & Technology, 17(3): 249-263, May 2002
    • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne, “Mandarin Pronunciation Variation Modeling,” National Conference on Man-Machine Speech Communications (NCMMSC6), pp.K51-64, 20-22 Nov 2001, Shenzhen (Invited Keynote Speech)
    • Fang Zheng, Zhanjiang Song, Pascale Fung, William Byrne, “Modeling Pronunciation Variation Using Context-Dependent Weighting and B/S Refined Acoustic Modeling,” EuroSpeech, 1:57-60, Sept. 3-7, 2001, Aalborg, Denmark
    • W. Byrne, V. Venkataramani, T. Kamm, T. F. Zheng, Z. Song, P. Fung, Y. Liu, U. Ruhi, "Automatic generation of pronunciation lexicons for Mandarin spontaneous speech," ICASSP, May 7-11, 2001, Salt Lake City, USA
    • Fang Zheng, Zhanjiang Song, Pascale Fung, and William Byrne. “Mandarin pronunciation modeling based on CASS corpus,” Sino-French Symposium on Speech and Language Processing, pp. 47-53, Oct. 16, 2000, Beijing
    • Pascale Fung, William Byrne, ZHENG Fang Thomas, Terri Kamm, LIU Yi, SONG Zhanjiang, Veera Venkataramani, and Umar Ruhi, “Pronunciation Modeling of Mandarin Casual Speech,” Final Report for Workshop 2000 for Language Engineering for Students and Professionals Integrating Research and Education, http://www.clsp.jhu.edu/ws2000/final_reports/mpm/.
    • XIONG Zhenyu, ZHENG Fang, LI Jing and WU Wenhu, “An automatic prompting texts selecting algorithm for di-IFs balanced speech corpus,” National Conference on Man-Machine Speech Communications (NCMMSC7), pp. 252-256, Nov. 23-25, 2003, Xiamen
    • Thomas Fang Zheng, “Making Full Use of Chinese Speech Corpora,” Invited Keynote Speech, Oriental-COCOSDA, pp.9-23, Oct. 1-3, Sentosa, Singapore
    • Jing Li, Fang Zheng, Zhenyu Xiong, and Wenuhu Wu, “Construction of Large-Scale Shanghai Putonghua Speech Corpus for Chinese Speech Recognition,” Oriental-COCOSDA, pp.62-69, Oct. 1-3, Sentosa, Singapore
thanks http www clsp jhu edu ws04 http cst cs tsinghua edu cn fzheng

Thanks !http://www.clsp.jhu.edu/ws04/http://cst.cs.tsinghua.edu.cn/~fzheng