1 / 44

Prosody in Spoken Language Understanding

Prosody in Spoken Language Understanding. Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008. U: Give me the price for AT&T. U: Give me the price for AT&T. U: Give me the price for AT&T. U: Give me the price for American Telephone and Telegraph.

Download Presentation

Prosody in Spoken Language Understanding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prosody in Spoken Language Understanding Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008

  2. U: Give me the price for AT&T. • U: Give me the price for AT&T. • U: Give me the price for AT&T. • U: Give me the price for American Telephone and Telegraph.

  3. Error Correction Spiral • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½ since yesterday. • U: Give me the price for AT&T. • S: Data General was at 10 ½ up a quarter. • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½. • U: Give me the price for American Telephone and Telegraph. • S: Excuse me?

  4. Roadmap • Corrections: A motivating example • Defining prosody • Why prosody? • Challenges in prosody • Prosody in language understanding • Recognizing tone and pitch accent • Spoken corrections, Topic segmentation • Conclusions

  5. Defining Prosody • Prosody • Phonetic phenomena in speech than span more than a single segment-“suprasegmental” • Prosody includes: • Stress, focus, tone, intonation, length/pause, rhythm • Prosodic features include: • Pitch: perceptual correlate of fundamental frequency • f0: rate of vocal fold vibration • Loudness/intensity, duration, segment quality

  6. Why Prosody? • Prosody plays a crucial role • At all levels of language • Lexical, syntactic, pragmatic/discourse • Establishes meaning • Disambiguates sense and structure • Across languages families • Common physiological, articulatory basis • In synthesis and recognition of fluent speech

  7. Prosody and the Lexicon • Lexical: Determines word identity • Prosodic effect at the syllable level (minimal unit) • Lexical stress: syllable prominence • Combination of length, pitch movement, loudness • REcord (N) vs reCORD (V) • Pitch accent can differentiate words in some languages • Lexical tone: tone languages, e.g. Chinese, Punjabi • Pitch height (register) and/or shape (contour) Ma (high): mother Ma (rising): hemp Ma (low): horse Ma (falling): scold

  8. Prosody and Syntax • Prosody can disambiguate structure • Associated with chunking and attachment • Not identical with syntactic phrase boundaries • “Prosody is predictable from syntax, except when it isn’t” • Prosodic phrasing indicated by: • Some combination of pause, change in pitch

  9. Chunking, or “phrasing” A1: I met Mary and Elena’s mother at the mall yesterday. A2: I met Mary and Elena’s mother at the mall yesterday. Example from Jennifer Venidetti

  10. Punctuation & Prosody Humor • A panda goes into a restaurant and has a meal. Just before he leaves he takes out a gun and fires it. The irate restaurant owner says ‘Why did you do that?’ The panda replies, ‘ I'm a panda. Look it up.’The restaurateur goes to his dictionary and under ‘panda’ finds: ‘black and white arboreal, bear like creatures; eats, shoots and leaves.’

  11. Prosody in Pragmatics & Discourse • Focus: • Prominence, new information: pitch accent • “October eleventh”: • Sentence type, dialogue act: • Statement vs. declarative question :“It’s raining (?)” • Discourse Structure (Topic), Emotion from Shih, Prosody Learning and Generation

  12. Challenges in Prosody I • Highly variable • Actual realization differs from ideal • Speaker variation: • Gender, vocal track differences, idiosyncrasy • Tonal coarticulation • Neighboring tones influence (like segmental) • Underlying fall can become rise • Parallel encoding • Effects at multiple levels realized simultaneously

  13. Challenges in Prosody II • Challenges for learning • Lack of training data • Sparseness: • Many prosodic phenomena are infrequent • E.g., non-declarative utterances, topic boundaries, contrastive accents, etc • Challenging for machine learning methods • Costs of labeling: • Many prosodic events require expert labeling • Need large corpus to attest • Time-consuming, expensive

  14. Context and Learning in Multilingual Tone and Pitch Accent Recognition

  15. Strategy: Context • Common model across languages • Pure acoustic-prosodic model • No word label, POS, lexical stress info • English, Mandarin Chinese (also Cantonese, isiZulu) • Exploit contextual information • Features from adjacent syllables, phrase contour • Analyze impact of • Context position, context encoding, context type • > 12.5% reduction in error over no context

  16. Data Collections • English: (Ostendorf et al, 95) • Boston University Radio News Corpus, f2b • Manually annotated, aligned, syllabified • 4 Pitch accent labels, aligned to syllables • Mandarin: • TDT2 Voice of America Mandarin Broadcast News • Automatically aligned, syllabified • 4 main tones, neutral

  17. Local Feature Extraction • Uniform representation for tone, pitch accent • Motivated by Pitch Target Approximation Model • Tone/pitch accent target exponentially approached • Linear target: height, slope (Xu et al, 99) • Base features: • Pitch, Intensity max, mean, min, range • (Praat, speaker normalized) • Pitch at 5 points across voiced region • Duration • Initial, final in phrase • Slope: • Linear fit to last half of pitch contour

  18. Context Features • Local context: • Extended features • Pitch max, mean, adjacent points of preceding, following syllables • Difference features • Difference between • Pitch max, mean, mid, slope • Intensity max, mean • Of preceding, following and current syllable • Phrasal context: • Compute collection average phrase slope • Compute scalar pitch values, adjusted for slope

  19. Classification Experiments • Classifier: Support Vector Machine • Linear kernel • Multiclass formulation • SVMlight (Joachims), LibSVM (Cheng & Lin 01) • 4:1 training / test splits • Experiments: Effects of • Context position: preceding, following, none, both • Context encoding: Extended/Difference • Context type: local, phrasal

  20. Results: Local Context

  21. Results: Local Context

  22. Results: Local Context

  23. Discussion: Local Context • Any context information improves over none • Preceding context information consistently improves over none or following context information • English: Generally more context features are better • Mandarin: Following context can degrade • Little difference in encoding (Extend vs Diffs) • Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory

  24. Results & Discussion: Phrasal Context • Phrase contour compensation enhances recognition • Simple strategy • Use of non-linear slope compensate may improve

  25. Strategy: Training • Challenge: • Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? • Exploit semisupervised and unsupervised learning • Semi-supervised Laplacian SVM • K-means and asymmetric k-lines clustering • Substantially outperform baselines • Can approach supervised levels

  26. Semi-supervised Learning • Approach: • Employ small amount of labeled data • Exploit information from additional – presumably more available –unlabeled data • Few prior examples: several weakly supervised: (Wong et al, ’05) • Classifier: • Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) • Semi-supervised variant of SVM • Exploits unlabeled examples • RBF kernel, typically 6 nearest neighbors, transductive

  27. Experiments • Pitch accent recognition: • Binary classification: Unaccented/Accented • 1000 instances, proportionally sampled • Labeled training: 200 unacc, 100 acc • 80% accuracy (cf. 84% w/15x labeled SVM) • Mandarin tone recognition: • 4-way classification: n(n-1)/2 binary classifiers • 400 instances: balanced; 160 labeled • Clean lab speech- in-focus-94% • cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples • Broadcast news: 70% • Cf. < 50% w/SVM 160 training samples

  28. Unsupervised Learning • Question: • Can we identify the tone structure of a language from the acoustic space without training? • Analogous to language acquisition • Significant recent research in unsupervised clustering • Established approaches: k-means • Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004): asymmetric k-lines • Little research for tone • Self-organizing maps (Gauthier et al,2005) • Tones identified in lab speech using f0 velocities • Cluster-based bootstrapping (Narayanan et al, 2006) • Prominence clustering (Tambourini ’05)

  29. Contrasting Clustering • Contrasts: • Clustering: 2-16 clusters, label w/most freq class • 3 Spectral approaches: • Perform spectral decomposition of affinity matrix • Asymmetric k-lines (Fischer & Poland 2004) • Symmetric k-lines (Fischer & Poland 2004) • Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) • Binary weights, k-lines clustering • K-means: Standard Euclidean distance • # of clusters: 2-16 • Best results: > 78% • 2 clusters: asymmetric k-lines; > 2 clusters: kmeans • Larger # clusters: all similar

  30. Contrasting Learners

  31. Tone Clustering: I • Mandarin four tones: • 400 samples: balanced • 2-phase clustering: 2-5 clusters each • Asymmetric k-lines, k-means clustering • Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised) • Broadcast news: 57% (cf. 74% supervised) • K-means requires more clusters to reach k-lines level

  32. Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height

  33. Conclusions • Common prosodic framework for tone and pitch accent recognition • Contextual modeling enhances recognition • Local context and broad phrase contour • Carryover coarticulation has larger effect for Mandarin • Exploiting unlabeled examples for recognition • Semi- and Un-supervised approaches • Best cases approach supervised levels with less training • Exploits acoustic structure of tone and accent space

  34. Error Correction Spiral • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½ since yesterday. • U: Give me the price for AT&T. • S: Data General was at 10 ½ up a quarter. • U: Give me the price for AT&T. • S: Hewlett Packard was 83 ¾, up 2 ½. • U: Give me the price for American Telephone and Telegraph. • S: Excuse me?

  35. Recognizing Spoken Corrections • Spoken Corrections • Recognize user attempts to correct ASR failures • Compare original input to repeat corrections • Significant differences: • Corrections: increases in duration, pause #/length, final fall • Increases in pitch accent for misrecognitions • Automatic recognition with decision trees, boosting • Distinguish corrective/not (human level) • Key features: raw/normalized duration, pause • Identify specific word being corrected • Key features: highest pitch, widest pitch range

  36. The Problem:Speech Topic Segmentation • Separate audio stream into component topics On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. || And the millennium bug, Lubbock Texas prepares for catastrophe, Bangalore, in India, sees only profit.||

  37. Is It Possible in Mandarin?

  38. Recognizing Shifts in Topic & Turn • Topic & Turn boundaries in English & Mandarin • Initial syllables: • Significantly higher pitch, loudness than final • Lexical and prosodic cues: • Cue words, tf*idf similarity; pitch, loudness, silence • Automatic recognition with decision trees, boosting • Voting to combine text, prosody, silence: 97% accuracy • Key features: • Pause; pitch, loudness contrast between syllables

  39. Conclusions & Opportunities • Prosody • Rich source of information for languages • Challenging due to variation, paucity of data • Can be successfully employed, with learning, to improve language understanding • Pitch accent, tone, dialogue act, turn, topic,… • Unrestricted conversational, multi-party, multimodal speech much more challenging • Increased variability, interaction with non-verbal evidence

  40. Thanks • Dinoj Surendran, Siwei Wang, Yi Xu • V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin • This work supported by NSF Grant #0414919 • http://people.cs.uchicago.edu/~levow/tai

  41. Phrasing can disambiguate Mary & Elena’s mother mall I met Mary and Elena’s mother at the mall yesterday One intonation phrase with relatively flat overall pitch range.

  42. Phrasing can disambiguate Elena’s mother mall Mary I met Mary and Elena’s mother at the mall yesterday Separate phrases, with expanded pitch movements.

  43. Lists of numbers, nouns twenty.eight.five ninety.four.three seventy.three.seven forty.seven.seven seventy.seven.seven coffee cake and cream chocolate ice cream and cake fish fingers and bottles cheese sandwiches and milk cream buns and chocolate [from Prosody on the Web tutorial on chunking]

  44. Clustering • Pitch accent clustering: • 4 way distinction: 1000 samples, proportional • 2-16 clusters constructed • Assign most frequent class label to each cluster • Classifier: • Asymmetric k-lines: • context-dependent kernel radii, non-spherical • > 78% accuracy: • 2 clusters: asymmetric k-lines best • Context effects: • Vector w/preceding context vs vector with no context comparable

More Related