1 / 17

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs. Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013. Prosody. Prosody – Pitch, Intensity, Rhythm, Silence Prosody carries information about a speaker’s intent and identity .

fauve
Download Presentation

Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013

  2. Prosody • Prosody – Pitch, Intensity, Rhythm, Silence • Prosody carries information about a speaker’s intent and identity. • Here: prosodic recognition of • Speaking Style • Nativeness • Speaker

  3. Approach • Unsupervised clustering of acoustic/prosodic features. • Sequence modeling of cluster identities

  4. K-Means • K-means is a simple distance based clustering algorithm. • Iterative, non-deterministic (sensitive to initialization) • Must specify K. • We evaluate K between 2 and 100. Optimal value from cross-validation for each task

  5. Dirichlet Process GMMs • Non-parametric infinite mixture model • need a prior of π – the dirichlet process • and a prior over N – a zero mean gaussian • still need to set hyper parametersαand G0 • Stick-breaking & Chinese Restaurant metaphors • Bleiand Jordan 2005Variational Inference • “Rich get Richer” Plate notation from M. Jordan 2005 NIPS tutorial

  6. DPGMM “Rich get Richer” Artificially omit the largest cluster α= 0. 25

  7. Prosodic Event Distribution • ToBI Prosodic Labels • Pitch Accents, Phrase Accent/Boundary Tones Accent Type Distribution Phrase Ending Distribution

  8. Sequence Modeling • SRILM 3-gram model • Backoff & GT smoothing • Clusters learned over all material • Sequence models trained over train sets

  9. Experiments • Classification • Train one SRILM model per class. • Classify by lowest perplexity • Outlier Detection • Train a single model. • Classifier learns a perplexity threshold • Speaking Style, Nativeness, Speaker Recognition • Evaluation • 500 samples between 10-100 syllables (~2-20 seconds) • ToBI, K-Means, DPGMM, DPGMM’ (removing the largest cluster) • 5 fold Cross-validation to learn hyperparameters

  10. Data • Boston Directions Corpus • READ, SPONTANEOUS • 4 speakers (used for Speaker Classification) • Boston University Radio News Corpus • BROADCAST NEWS • 6 speakers • Columbia Games Corpus • SPONTANEOUS DIALOG • 13 speakers • Native Mandarin Chinese Speakers reading BURNC stories. • 4 speakers • All ToBI Labeled

  11. Features • Villing (2004) pseudosyllabification • Syllables with mean intensity below 10dB are considered “silent” • 7 Features • Mean range normalized intensity • Mean range normalized delta intensity • Mean z-score normalized log f0 • Mean z-score normalized delta log f0 • Syllable duration • Duration of previous silence (if any) • Duration of following silence (if any)

  12. Consistency with ToBI labels • V-Measure between • ToBI Accent Types and clusters • ToBIIntonational Phrase-ending Tones and clusters • K-means, solid line • DPGMM, gray line for reference (doesn’t vary by more than 0.001) Accenting Phrasing

  13. Speaking Style Recognition • 4 styles: READ, SPON, BN, DIALOG • Single speaker for evaluation. Outlier Detection - Dialog Classification

  14. Nativeness Recognition • Native (BURNC) vs. Non-Native • Single speaker for evaluation. Outlier Detection - Native Classification

  15. Speaker Recognition • 6 BURNC Speakers • Detect f2b vs. others • 4 BDC Speakers • 6 tasks for training, 3 for testing Outlier Detection Classification

  16. Conclusions • K-means works well to represent prosodic information • DPGMM does not work so well out-of-the-box. • Despite being non-parametric, hyperparameter setting is still critically important • Future Work • Larger acoustic/prosodic feature set. • requires pre-processing • Evaluating the universality of prosodic representations • Integration of K-means and DPGMM. • Use one to seed the other.

  17. Thank you andrew@cs.qc.cuny.edu http://speech.cs.qc.cuny.edu

More Related