1 / 41

Prosodic Modeling for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech

Prosodic Modeling for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech. Che-Kuang Lin, Shu-Chuan Tseng* & Lin-Shan Lee College of EECS, National Taiwan University, Institute of Linguistics, Academia Sinica*, Taipei, Taiwan. Outline. Introduction Prosodic features

nero
Download Presentation

Prosodic Modeling for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prosodic Modeling for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech Che-Kuang Lin, Shu-Chuan Tseng* & Lin-Shan Lee College of EECS, National Taiwan University, Institute of Linguistics, Academia Sinica*, Taipei, Taiwan

  2. Outline • Introduction • Prosodic features • IP detection models • Latent prosodic modeling (LPM) • LPM-based detection models • Experiment results & further analysis • Conclusion

  3. Overt repair 是(shi4) 進口(jin4kou3) 嗯(EN) 出口(chu1kou3) 嗎(ma1) is import [discourse export [interrogative particle] particle] • Do you import * uhn export products? reparandum reparandum resumption The disfluency interruption point (IP) (*) optional editing term resumption optional editing term Examples of disfluency considered in this paper (1/2) Abandoned utterances 它(ta1) 有(you3) 一個(yi2ge5) 呃(E) 那邊(ne4bian1) it has one [discourse particle] there 有個(you3ge5) 度假村(du4jian4cun1) 嘛(MA) has a resort [discourse particle ] • It has a * eh there is a resort there.

  4. On the tele- * on the television, there is a new film recently. reparandum resumption resumption reparandum Partial repetition The disfluency interruption point (IP) (*) 看(kan4) 電(dian4) 看(kan4) 電視(dian4shi4) watch electricity watch television 最近(zui4jin4) 有(you3) 新(xin1) 電影(dian4ying3) recently has new movie Examples of disfluency considered in this paper (2/2) Direct repetition 因為(yin1wei4) 因為(yin1wei4) 它(ta1) 有(you3) because because it has 健身(jian4shen1) 中心(zhong1xin1) fitness center • Because * because it has a fitness center.

  5. Introduction • One of the primary problem in spontaneous speech recognition is the presence of disfluencies • Accurate identification of various types of disfluencies • help the recognition process • provide structural information about the utterances • Purpose of this study • To identify useful and important features for interruption point (IP) detection • To analyze how these features are helpful in spontaneous Mandarin speech

  6. bi-character (bi-syllabic) word boundary mono-character (mono-syllabic) syllable boundary Define a whole set of prosodic features for spontaneous Mandarin speech (1/2) • A set of features have been proposed (Shriberg, 2000) for English • Spontaneous Mandarin quite different from western languages • Tonal language nature • The same syllable with different tones represents different characters • PCA-based pitch contour smoothing and pitch-related features • Mono-syllabic structure • Every character has its own meaning and is pronounced as a monosyllable • A word is composed of one to several characters (or syllables)

  7. bi-character (bi-syllabic) word boundary mono-character (mono-syllabic) syllable boundary Define a whole set of prosodic features for spontaneous Mandarin speech (2/2) • Every syllable boundary (rather than word boundary) is considered as candidate for IP • Define a whole set of prosodic features for each syllable boundary and use them to detect the IPs

  8. …… y v = c1x + c2 y PC1 V’ = w1PC1 x Syllable-wise pitch contour smoothing • We first proposed to use Principal Component Analysis (PCA) for efficient pitch contour smoothing Conversion to vectors PCA Projection

  9. d2 d1 P2 P1 Feature Definitions –Pitch-related Prosodic Features (1/2) • The average pitch value within the syllable • The maximum difference of pitch value within the syllable • The average of absolute values of pitch variations within the syllable • The magnitude of pitch reset for boundaries • The difference of such feature values of adjacent syllable boundaries ( P1-P2 , d1-d2 , etc.) • A total of 54 pitch-related features were obtained

  10. syllable boundary pause pause syllable boundary A B C D E a b end of utterance begin of utterance • Pause duration b • Average syllable duration (B+C+D+E)/4 or ( (D+E)/2 + C )/2 • Average syllable duration ratio (D+E)/(B+C) or(D+E)/2 /C • Combination of pause & syllable features (ratio or product) C*b , D*b, C/b, D/b • Lengthening C / ( (A+B)/2 ) • Standard deviation of feature values Feature Definitions –Duration-related Prosodic Features (2/2) • deviation from normal speaking rhythmic structure is important • A total of 38 duration-related features were obtained

  11. Detection model

  12. pitch offset< 12.99? [ IP, nonIP ] 0.2, 0.8 have pause? 0.6,0.4 syl_dur_ratio<4.5? 0.8,0.2 0.3,0.7 Decision Tree • Trees are grown on training data based on maximum entropy reduction criterion • Probability of IP is found by traversing across the trees down to a certain leaf node An illustrative decision tree for IP detection

  13. If we have pause at the boundary (x) and the boundary is an IP (y) Maximum Entropy Model (1/2) • Various problem-specific knowledge can be incorporated into the model through many properly designed feature functions • Feature functions fi • Take binary feature function for example ( can be binary or real valued ) x: prosodic features at the syllable boundary i: a set of features y: IP or non-IP

  14. Model Known statistics the expectation of each feature function fiwith respect to the desired model the expectation of each feature functon fi obtained from the training data Maximum Entropy Model (2/2) Among all the distributions that satisfy the set of constraints, choose the one with the highest entropy

  15. feature vector Integrating DT & Maxent (DT-ME) • We use decision trees built with the training data to derive the feature functions for maximum entropy model • First, grow deep and bushy trees from training data • Then, for each sample (training or testing) • Traverse across the trees down to certain leaves • Each leaf serves as a single (binary) feature function • i.e. whether the sample falls in this leaf (1) or not (0) 0 1 0 0 1 1 0 0 0 0 1 0

  16. Latent Prosodic Modeling

  17. recognized syllables 1st pass recognition speech x : prosodic features 23,5,0,14,31,… VQ ( prosodic characters) (23,5),(5,0),(23,5,0)… N-grams ( prosodic terms) Latent prosodic modeling (LPM) • Model the probabilistic behavior of prosodic features in terms of latent factors • Prosodic characters and terms: derived from prosodic features

  18. Latent prosodic modeling (LPM) • Prosodic documents of three levels (segment, utterance & speaker): collections of prosodic terms • (The segments are obtained from the best fitting piece-wise linear function for the pitch contour)

  19. prosodic states d1 t1 d2 z1 t2 z2 .. … P(tk| zl) . P(zl| di) tk zl di .. zL tN’ … dN prosodic documents prosodic terms Latent prosodic modeling (LPM) • The relationship between the prosodic terms and prosodic documents are modeled via latent prosodic states in the probabilistic framework of Probabilistic Latent Semantic Analysis (PLSA) .

  20. Latent prosodic modeling (LPM) • The probabilities were trained with EM algorithm by maximizing the total likelihood function: • The complicated behavior of the prosodic features can then be analyzed based on these probabilities • For instance: similarity measures

  21. LPM-based Detection model

  22. Pre-trained LPM Latent Prosodic Space Training LPM Training Latent Prosodic Space Construction corpus of training prosodic documents Latent Prosodic Space Projection Projection testing prosodic documents X X Compare & Select HAC-based selection KNN-based selection LPM-adapted Detection Models Detection Model Training … LPM-based Detection Model Adaptation selected prosodic docs LPM for IP detection • LPM-based model adaptation • LPM model is trained on the raw data and used for actively selecting relevant training data for a specific testing condition

  23. class c class 1 …… …. …. Anchor model training classification LPM for IP detection • Anchor model training • Merging the associated prosodic documents for different classes of disfluency IPs into super-documents to be used in training Anchor models • The prosody of IP candidates was then compared against these Anchors • Can be used with training data selection mentioned above

  24. LPM for IP detection • Integration of LPM-adapted classification models with SVM • Two classification models: DT-ME or Anchor model • Adapted at segment, utterance, or speaker level

  25. LPM for IP detection • LPM-based feature expansion for DT-ME • Two sets of features can be used • The probabilities of each prosodic state related to the prosodic document: • The likelihood of the prosodic terms given the prosodic document: (F1) (F2)

  26. Experiment Results

  27. Corpus Used in the Research • Mandarin Conversational Dialogue Corpus (MCDC) • 30 conversational dialogues (27 hours totally) • 8 dialogues out of the 30 were annotated with disfluencies • (8.2 hours totally, 9 female & 7 male speakers) • The summary of experiment data

  28. IP detection results • decision tree achieved moderate and balanced recall and precision rates • Integrated approach trades degraded recall for significantly better precision • From the purpose of speech recognition, integrated approach is more appropriate • Incorrectly detected IP may cause recognition errors • Missing IP can be processed as usual

  29. Analysis of Importance of Roles for Different Features

  30. Original performance Degraded performance Identify important features for IP detection • Exclude each single feature from the full set and then perform the complete IP detection process • Detection performance degradation due to the missing of each single feature is obtained features

  31. Investigation on how the two feature categories are related to the IP detection • The most serious performance degradation caused by removing one single feature from the two categories is shown • for overt repair and partial repetition, pitch-related features play relatively more important role for IP detection • for direct repetition IP detection, the duration-related features are more important • for abandoned utterances IP detection, both features have equally important impact

  32. Importance of each individual pitch-related feature • The most serious performance degradation caused by removing one single pitch-related feature (a)(b)(c)(d)(e)(f): some specific features • Average pitch value within a syllable (b),(d)1.3.4. • Maximum difference of pitch values within a syllable (e),(f)3.4. • Magnitude of pitch reset for boundaries (a)1.2.

  33. Importance of each individual duration-related feature • The most serious performance degradation caused by removing one single duration-related feature (g)(h)(i)(j)(k)(l)(m) : some specific features • Jointly considering both the syllable & pause duration is useful • The ratio of syllable duration to pause duration (g),(h),(k)1.3.4. • The product of them (i),(j),(m)2.4. • The character duration ratio across the boundary (l)3. • Standard deviation of the product of syllable with pause duration (m)4.

  34. Results for IP detection

  35. IP detection experiment • Three different feature sets tested (using Decision tree) • [feature set 1] The same as used in previous work • [feature set 2] The same as the above but extracted for syllable boundaries • [proposed feature set] Proposed features extracted for syllable boundaries ktr: known transcription rec: recognition results with errors

  36. IP detection experiment • Comparison of different IP detection approaches • Using the new feature set proposed here DT-ME

  37. LPM for IP detection • IP detection accuracy using LPM-based DT-ME or anchor models DT-ME DT-ME

  38. (F1) (F1)+ (F2) (F2) (F1) (F2) (F1)+(F2) LPM for IP detection • (e) yielded the best result, when the finally enhanced DT-ME model combined with the anchor model by SVM

  39. Conclusion • A whole set of features for disfluency IP detection is developed, tested and analyzed. • The most important features for each disfluency type were identified and discussed

  40. Conclusion • A new disfluency IP detection model that incorporates decision trees into a maximum entropy model was developed • Latent prosodic modeling for analyzing speech prosody using a probabilistic framework of latent prosodic states is proposed to adapt IP detection models

  41. Thanks for your attention

More Related