1 / 67

Outline

Speech and Language Processing @ CSLT Thomas Fang Zheng 04 Oct 2007 NUS-Tsinghua Workshop, National Singapore University, Singapore. Outline. Brief Introduction to CSLT Speech and Language Processing @ CSLT Database Creation and Standardization Activities. Brief Introduction to CSLT.

Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech and Language Processing@ CSLTThomas Fang Zheng04 Oct 2007NUS-Tsinghua Workshop, National Singapore University, Singapore

  2. Outline • Brief Introduction to CSLT • Speech and Language Processing @ CSLT • Database Creation and Standardization Activities

  3. Brief Introduction to CSLT

  4. Mission and History • Mission: • To develop advanced speech and language processing technology to meet the growing demand for human-computer interaction anywhere, anytime, and in any way. • To focus on multi-lingual and multi-platform speech recognition, pattern recognition of multi-modal biometric features, and natural language processing. • History: • Founded in February 2007, with • faculty members from research groups including: • the Center for Speech Technology (CST, founded in 1979, second early in China), the State Key Laboratory of Intelligent Technology and Systems (SKLits, ranked A for all 3 times’ competition), Department of Computer Science and Technology • the Speech Processing Technology Group (founded in 1986) and the Speech-on-Chip Group, Department of Electronic Engineering, • the Future Information Technology (FIT) R&D Center, Research Institute of Information Technology (RIIT), as well as • the Division of Computer Science and Artificial Intelligence of Tsinghua National Laboratory for Information Science and Technology.

  5. Guideline • Keeping in mind of “application, innovation focus and accumulation”, CSLT directs its research efforts on automatic speech recognition (ASR), voiceprint recognition (VPR) and natural language Processing (NLP). • 面向应用 -- base on the applications, • 推进创新 -- advocate the innovations, • 突出重点 -- focus on the emphases, and • 厚积薄发 -- spur with long accumulation. • By exploring an effective operational mode with combination of “Study-Research-Product (产学研)”, CSLT aims to develop technology and application with IPR, and push forward applied basic research and technology innovation.

  6. Director Advisory Board Deputy -R&D Deputy - Students Deputy -Executive Assistants 语音识别实验室 声纹识别实验室 语言理解实验室 得意升文实验室 语音芯片实验室 智能搜索实验室 资源与标准实验室 金融工程研究所 Organization Chat • 6 research groups: • Speech Recognition, • Speaker Recognition, • Speech-on-Chip, • Intelligent Searching, • Language Understanding, and • Resource and Standardization. • 1 joint lab + 1 joint institute

  7. Advisory Board: • Victor Zue (MIT, IEEE Fellow, NAE member) • B.-H. (Fred) Juang (GeorgiaTech, IEEE Fellow, NAE member) • William Byrne (Cambridge) • Dan Jurafsky (Stanford) • Richard Stern (CMU) • FANG Ditang (Tsinghua) • WU Wenhu (Tsinghua) • LIU Runsheng (Tsinghua) • Directors: • Director: Prof. Thomas Fang Zheng • Deputy Director: Asso. Prof. LIU Yi (Executive) • Deputy Director: Asso. Prof. XIAO Xi (R&D) • Deputy Director: Asso. Prof. XU Mingxing (Students)

  8. Faculty Members and Others • Speech Processing (ASR&VPR): • Associate Professor: LIU Yi • Associate Professor: XIAO Xi • Associate Professor: XU Mingxing • Assistant Professor: LIANG Weiqian • Assistant Professor: OU Zhijian • Natural Language Processing (NLP): • Associate Professor: SUN Jiasong • Associate Professor: ZHOU Qiang • Assistant Professor: WU Xiaojun • Assistant Professor: XIA Yunqing • 2 Research Associates + 2 Postdoctors • 3 PhD Students + 13 Master Students

  9. I. Automatic Speech Recognition (ASR) Speech and Language Processing @CSLT

  10. Current Focuses • Large vocabulary Chinese speech recognition (Chinese dictation machine) • Voice command, and embedded speech recognition on chip • Keyword spotting with confidence measure and semantic templates • Spontaneous speech recognition (starting from JHU Summer Workshop ’2000) • Dialectal Chinese speech recognition (starting from JHU Summer Workshop ’2004) -- in this talk

  11. Motivation • Chinese ASR encounters an issue that is bigger than that of any other language - dialect. • There are 8 major dialectal regions in addition to Mandarin (Northern China), including:- • Wu (Southern Jiangsu, Zhejiang, and Shanghai); • Yue (Guangdong, Hong Kong, Nanning Guangxi); • Min (Fujian, Shantou Guangdong, Haikou Hainan, Taipei Taiwan); • Hakka (Meixian Guangdong, Hsin-chu Taiwan); • Gan (Jiangxi); • Xiang (Hunan); • Hui (Anhui) • Jin (Shanxi, Hohehot Inner Mongolia). • Can be further divided into over 40 sub-categories.

  12. 13

  13. Background • Chinese dialects share a same written language:- • The same Chinese pinyin set (canonically), • The same Chinese character set (canonically), and • The same vocabulary (canonically). • And standard Chinese (known as Putonghua, or PTH) is widely spoken in most regions over China. • However, speech is strongly influenced by the native dialects, most Chinese people speak in both standard Chinese and their own dialect, resulting in dialectal Chinese - Putonghua influenced by native dialect • In dialectal Chinese :- • Word usage, pronunciation, and syntax and grammar vary depending on the speaker's dialect. • ASR relies to a great extent on the consistent pronunciation and usage of words within a language. • ASR systems constructed to process PTH perform poorly for the great majority of the population.

  14. Research Goal • To develop a general framework to model in dialectal Chinese ASR tasks :- • Phonetic variability, • Lexical variability, and • Pronunciation variability • To find suitable methods to modify the baseline PTH recognizer to obtain a dialectal Chinese recognizer for the specific dialect of interest, which employ :- • dialect-related knowledge (syllable mapping, cross-dialect synonyms, …), and • training/adaptation data (in relatively small quantities) • Expectation: the resulted recognizer should also work for PTH, in other words, it should be good for a mixture of PTH and dialectal Chinese. • This proposal was selected as one of three projects for '2003 Johns Hopkins University Summer Workshop from tens of proposals collected from universities/companies over the world, and was postponed to 2004 due to SARS.

  15. Dialectal Chinese Related Knowledge & Resources Standard Chinese Speech Recognizer + Dialectal Chinese Speech Recognition Framework Dialectal Chinese Speech Recognizer Framework

  16. Useful Dialect-Related Knowledge • Chinese Syllable Mapping (CSM) • This CSM is dialect-related. • Two types: • Word-independent CSM: e.g. in Southern Chinese, Initial mappings include zhz, chc, shs, nl, and so on, and Final mappings include engen, ingin, and so on; • Word-dependent CSM: e.g. in dialectal Chuan Chinese, the pinyin 'guo2' is changed into 'gui0' in word '中国(China)' but only the tone is changed in word '过去(past)'.

  17. Lexicon Linguists say the vocabulary similarity rate between PTH and Wu dialect is about 60~70% A dialect-related lexicon containing two parts :- a common part shared by standard Chinese and most dialectal Chinese languages (over 50k words), and a dialect-related part (several hundreds). And in this lexicon :- each word has one pinyin string for standard Chinese pronunciation and a kind of representation for dialectal Chinese pronunciation, and each of those dialect-related words is corresponding to a word in the common part with the same meaning. 18

  18. Language Though it is difficult to collect dialect texts, dialect-related lexical entry replacement rules could be learned in advance, and therefore The language post-processing or language model adaptation techniques could be adopted. 19

  19. 20 • 我 做饭 给 你 吃 (PTH)我 烧饭 给 你 吃(Wu) • … • … w1 w2 w3 • Dialectal words substitute for some words w3 w3 • … • … • 你 先 走(PTH)你 走 先(Wu) w1 w2 w3 • Word-order changes w2 w2 w3 w2 1 V2 2

  20. Data Creation for WDC e-Dictionary Database IF & Syllable Set Definition Speech Transcription Database Collection PTH Words Wu Dialect Words Read Speech Spontaneous Speech C-Chars Syllables IFs/GIFs PTH Pron. PTH Pron. PTH Words Only Misc Info Wu Dialect Pron. Wu Dialect Pron. Topics PTH + Wu Words PTH Synonym Database Collection

  21. Dialectal Chinese database – Wu (Shanghai) 22 Goal Actual WDC Data Diversity

  22. Totally 11 hours - Half read (R) + half spontaneous (S): 100 Shanghai speakers * (3R +3S) minutes / speaker 10 Beijing speakers * 6S minutes / speaker Read speech with well-balanced prompting sentences; Type I: each sentence contains PTH words only (5-6k) Type II: each sentence contains one or two most commonly used Wu dialectal words while others are PTH words Spontaneous speech with Pre-defined talking topics; Conversations with PTH speaker on self-selected topic from: sports, policy/economy, entertainment, lifestyles, technology Balanced Speaker (gender, age, education, PTH level, …) 23

  23. 24 Accent Assessment for Wu dialectal Chinese database by experts 1A. CCTV-level radiobroadcaster; 1B. Province-level radiobroadcaster; 2A. Quite good;2B. Less accented; 3A. More accented;3B. Hard to understand but known it is PTH

  24. Dialectal Chinese database – Min (Xiamen)

  25. Dialectal Chinese database – Chuan (Chengdu)

  26. 27 27 Accent distribution for Min/Chuan-dialectal Chinese corpora

  27. The effect of multi-pronunciation expansion with CSM Best result achieved at a suitable AccProb value, say 94%, with VocSizeRatio=1.24 AccProb: 0% means no multiple pronunciation expansion, while 100% full expansion; Surface-form MLLR + WDC-IF mapping + MPE (CER)

  28. 29 Performance improvement comparison: overall, and in terms of speaker clusters * J. of Computer Science and Technology, May 2002

  29. State-Dependent Phoneme-Based Model Merging (SDPBMM) At acoustic level, approaches include: Retraining the AM based on the standard speech and a certain amount of dialectal speech Interpolation between standard speech-based HMMs and their corresponding dialectal speech based HMMs Combination of AM with state-level pronunciation modeling Adaptation with a certain amount of dialectal speech based on the standard speech-based AM Existing problems: A large amount of dialectal speech to build dialect-specific acoustic models The acoustic model cannot demonstrate good performance in standard speech as well as dialectal speech recognition Some acoustic modeling methods are too complicated to be deployed readily 30

  30. What we proposed: Taking a precise context-dependent HMMfrom the standard speech and its corresponding less precise context-independent HMMfrom dialectal speech into consideration simultaneously Merging HMMs on a state-level basis according to certain criteria 31

  31. 32 Illustration for SDPBMM

  32. The seen disadvantage so far The scale of Gaussian mixtures in the merged state is expanded Is it possible to downsize the scale? A straightforward criterion is distance measure The larger distance, the more coverage acoustically merging, if distance (d,s)  threshold no-merging, if distance (d,s) < threshold 33

  33. 34 Distinguishable states

  34. Dataset division

  35. 36 Evaluations on Putonghua and Wu-dialectal Chinese

  36. 37 Integration of SDPBMM with adaptation Results similar for Min dialectal Chinese * To appear in Speech Communication, 2007

  37. II. Voiceprint Recognition (VPR) Speech and Language Processing @CSLT

  38. Current Focuses • Cross-channel (channel-mismatch) -- in this talk, • Multi-speaker (such as in telephone conversations), • Text- and language-independent, • Very short speech segment (such as verification in monitoring for public security), • Background noise issue, and …

  39. Cross-channel (1) -- IEEE.t.ASL’07 • A cohort-based speaker model synthesis (SMS) algorithm, designed for synthesizing robust speaker models without requiring channel-specific enrollment data • Assumption: if two speakers' voices are similar in one channel, their voices will also be similar in another channel

  40. Exception always exists • We propose to use a cohort set of speaker models instead of a single speaker model to perform the SMS.

  41. Cross-channel (2) -- IEEE.ICASSP’07 • We propose a new method based on • the idea of projection in nuisance attribute projection (NAP) (designed for GMM-SVM system), and • the idea of model compensation in factor analysis • -- called Session variability subspace projection (AVSP), the idea of which is to use the session variability in a test utterance to compensate speaker models whose session variability has been removed during training • The SVSP consists of four modules • Estimation of session variability subspace (during training) • Speaker model training by adaptation from UBM, session variability removed • Speaker model compensation with the test utterance (during recognition) • Test utterance verification

  42. Performance comparison for different methods

  43. Performance comparison for different test utterance lengths

  44. NIST SRE’06 Post-evaluation results • On the left: • DET curve of the best result (STBU3) in 1c4w-1c4w • STBU = Spescom DataVoice, South Africa + TNO, Netherlands + Brno Univ of Tech, Chech Republic + Univ Stellenbosch, South Africa • Fusion of 11 systems; • On the right: • DET curve of our system in 1c4w-1c4w • Fusion of 2 systems (GMM-UBM and SVM)

  45. Applications of VPR • Identification for network security (w/ Ministry of Public Security) • Verification for user authentication (w/ Ministry of Public Security and People’s Armed Police College) • Verification for user authentication (w/ China Mobile, Commercial Bank of Baotou, ...)

  46. Demos -- Speaker recognition application in passport control Supported by Ministry of Public Security and People’s Armed Police College

  47. III. Natural Language Processing (NLP) Speech and Language Processing @CSLT

More Related