1 / 42

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent. Gina-Anne Levow University of Chicago June 6, 2006. Roadmap. Challenges for Tone and Pitch Accent Variation and Learning Data collections & processing Learning with less Semi-supervised learning Unsupervised clustering

sukey
Download Presentation

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent Gina-Anne Levow University of Chicago June 6, 2006

  2. Roadmap • Challenges for Tone and Pitch Accent • Variation and Learning • Data collections & processing • Learning with less • Semi-supervised learning • Unsupervised clustering • Approaches, structure, and context • Conclusion

  3. Challenges: Tone and Variation • Tone and Pitch Accent Recognition • Key component of language understanding • Lexical tone carries word meaning • Pitch accent carries semantic, pragmatic, discourse meaning • Non-canonical form (Shen 90, Shih 00, Xu 01) • Tonal coarticulation modifies surface realization • In extreme cases, fall becomes rise • Tone is relative • To speaker range • High for male may be low for female • To phrase range, other tones • E.g. downstep

  4. Challenges: Training Demands • Tone and pitch accent recognition • Exploit data intensive machine learning • SVMs (Thubthong 01,Levow 05, SLX05) • Boosted and Bagged Decision trees (X. Sun, 02) • HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson et al, 04,…) • Can achieve good results with large sample sets • ~10K lab syllabic samples -> > 90% accuracy • Training data expensive to acquire • Time – pitch accent 10s of time real-time • Money – requires skilled labelers • Limits investigation across domains, styles, etc • Human language acquisition doesn’t use labels

  5. Strategy: Training • Challenge: • Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data? • Exploit semi-supervised and unsupervised learning • Semi-supervised Laplacian SVM • K-means and asymmetric k-lines clustering • Substantially outperform baselines • Can approach supervised levels

  6. Data Collections I: English • English: (Ostendorf et al, 95) • Boston University Radio News Corpus, f2b • Manually ToBI annotated, aligned, syllabified • Pitch accent aligned to syllables • 4-way: Unaccented, High, Downstepped High, Low • (Sun 02, Ross & Ostendorf 95) • Binary: Unaccented vs Accented

  7. Data Collections II: Mandarin • Mandarin: • Lexical tones: • High, Mid-rising, Low, High falling, Neutral

  8. Data Collections III: Mandarin • Mandarin Chinese: • Lab speech data: (Xu, 1999) • 5 syllable utterances: vary tone, focus position • In-focus, pre-focus, post-focus • TDT2 Voice of America Mandarin Broadcast News • Automatically force aligned to anchor scripts • Automatically segmented, pinyin pronunciation lexicon • Manually constructed pinyin-ARPABET mapping • CU Sonic – language porting • 4-way: High, Mid-rising, Low, High falling

  9. Local Feature Extraction • Motivated by Pitch Target Approximation Model • Tone/pitch accent target exponentially approached • Linear target: height, slope (Xu et al, 99) • Scalar features: • Pitch, Intensity max, mean (Praat, speaker normalized) • Pitch at 5 points across voiced region • Duration • Initial, final in phrase • Slope: • Linear fit to last half of pitch contour

  10. Context Features • Local context: • Extended features • Pitch max, mean, adjacent points of adjacent syllable • Difference features wrt adjacent syllable • Difference between • Pitch max, mean, mid, slope • Intensity max, mean • Phrasal context: • Compute collection average phrase slope • Compute scalar pitch values, adjusted for slope

  11. Experimental Configuration • English Pitch Accent: • Proportionally sampled: 1000 examples • 4-way and binary classification • Contextualization representation, preceding syllables • Mandarin Tone: • Balanced tone sets: 400 examples • Vary data set difficulty: clean lab -> broadcast • 4 tone classification • Simple local pitch only features • Prior lab speech experiments effective with local features

  12. Semi-supervised Learning • Approach: • Employ small amount of labeled data • Exploit information from additional – presumably more available –unlabeled data • Few prior examples: EM, co-& self-training: Ostendorf ‘05 • Classifier: • Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) • Semi-supervised variant of SVM • Exploits unlabeled examples • RBF kernel, typically 6 nearest neighbors

  13. Experiments • Pitch accent recognition: • Binary classification: Unaccented/Accented • 1000 instances, proportionally sampled • Labeled training: 200 unacc, 100 acc • >80% accuracy (cf. 84% w/15x labeled SVM) • Mandarin tone recognition: • 4-way classification: n(n-1)/2 binary classifiers • 400 instances: balanced; 160 labeled • Clean lab speech- in-focus-94% • cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples • Broadcast news: 70% • Cf. <50% w/supervised SVM 160 training samples; 74% 4x training

  14. Unsupervised Learning • Question: • Can we identify the tone structure of a language from the acoustic space without training? • Analogous to language acquisition • Significant recent research in unsupervised clustering • Established approaches: k-means • Spectral clustering: Eigenvector decomposition of affinity matrix • (Shih & Malik 2000, Fischer & Poland 2004, BNS 2004) • Little research for tone • Self-organizing maps (Gauthier et al,2005) • Tones identified in lab speech using f0 velocities

  15. Unsupervised Pitch Accent • Pitch accent clustering: • 4 way distinction: 1000 samples, proportional • 2-16 clusters constructed • Assign most frequent class label to each cluster • Learner: • Asymmetric k-lines clustering (Fischer & Poland ’05): • Context-dependent kernel radii, non-spherical clusters • > 78% accuracy • Context effects: • Vector w/context vs vector with no context comparable

  16. Contrasting Clustering • Approaches • 3 Spectral approaches: • Asymmetric k-lines (Fischer & Poland 2004) • Symmetric k-lines (Fischer & Poland 2004) • Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) • Binary weights, k-lines clustering • K-means: Standard Euclidean distance • # of clusters: 2-16 • Best results: > 78% • 2 clusters: asymmetric k-lines; > 2 clusters: kmeans • Larger # of clusters more similar

  17. Contrasting Learners

  18. Tone Clustering • Mandarin four tones: • 400 samples: balanced • 2-phase clustering: 2-3 clusters each • Asymmetric k-lines • Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised) • Broadcast news: 57% (cf. 74% supervised) • Contrast: • K-means: In-focus syllables: 74.75% • Requires more clusters to reach asymm. k-lines level

  19. Tone Structure First phase of clustering splits high/rising from low/falling by slope Second phase by pitch height, or slope

  20. Conclusions • Exploiting unlabeled examples for tone and pitch accent • Semi- and Un-supervised approaches • Best cases approach supervised levels with less training • Leveraging both labeled & unlabeled examples best • Both spectral approaches and k-means effective • Contextual information less well-exploited than in supervised case • Exploit acoustic structure of tone and accent space

  21. Future Work • Additional languages, tone inventories • Cantonese - 6 tones, • Bantu family languages – truly rare data • Language acquisition • Use of child directed speech as input • Determination of number of clusters

  22. Thanks • V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin • Dinoj Surendran, Siwei Wang, Yi Xu • This work supported by NSF Grant #0414919 • http://people.cs.uchicago.edu/~levow/tai

  23. Spectral Clustering in a Nutshell • Basic spectral clustering • Build affinity matrix • Determine dominant eigenvectors and eigenvalues of the affinity matrix • Compute clustering based on them • Approaches differ in: • Affinity matrix construction • Binary weights, conductivity, heat weights • Clustering: cut, k-means, k-lines

  24. K-Lines Clustering Algorithm • Due to Fischer & Poland 2005 • 1. Initialize vectors m1...mK (e.g. randomly, or as the ¯first K eigenvectors of the spectraldata yi) • 2. for j=1 . . .K: • Define Pj as the set of indices of all points yi that are closest to the line defined by mj , and create the matrix Mj = [yi], i in Pi whose columns are the corresponding vectors yi • 3. Compute the new value of every mj as the ¯first eigenvector of MjMTj • 4. Repeat from 2 until mj 's do not change

  25. Asymmetric Clustering • Replace Gaussian kernel of fixed width • (Fischer & Poland TR-ISDIA-12-04, p. 12), • Where tau = 2d+ 1 or 10, largely insensitive to tau

  26. Laplacian SVM • Manifold regularization framework • Hypothesize intrinsic (true) data lies on a low dimensional manifold, • Ambient (observed) data lies in a possibly high dimensional space • Preserves locality: • Points close in ambient space should be close in intrinsic • Use labeled and unlabeled data to warp function space • Run SVM on warped space

  27. Laplacian SVM (Sindhwani)

  28. Input : l labeled and u unlabeled examples • Output : • Algorithm : • Contruct adjacency Graph. Compute Laplacian. • Choose Kernel K(x,y). Compute Gram matrix K. • Compute • And

  29. Current and Future Work • Interactions of tone and intonation • Recognition of topic and turn boundaries • Effects of topic and turn cues on tone real’n • Child-directed speech & tone learning • Support for Computer-assisted tone learning • Structured sequence models for tone • Sub-syllable segmentation & modeling • Feature assessment • Band energy and intensity in tone recognition

  30. Related Work • Tonal coarticulation: • Xu & Sun,02; Xu 97;Shih & Kochanski 00 • English pitch accent • X. Sun, 02; Hasegawa-Johnson et al, 04; Ross & Ostendorf 95 • Lexical tone recognition • SVM recognition of Thai tone: Thubthong 01 • Context-dependent tone models • Wang & Seneff 00, Zhou et al 04

  31. Pitch Target Approximation Model • Pitch target: • Linear model: • Exponentially approximated: • In practice, assume target well-approximated by mid-point (Sun, 02)

  32. Classification Experiments • Classifier: Support Vector Machine • Linear kernel • Multiclass formulation • SVMlight (Joachims), LibSVM (Cheng & Lin 01) • 4:1 training / test splits • Experiments: Effects of • Context position: preceding, following, none, both • Context encoding: Extended/Difference • Context type: local, phrasal

  33. Results: Local Context

  34. Results: Local Context

  35. Results: Local Context

  36. Discussion: Local Context • Any context information improves over none • Preceding context information consistently improves over none or following context information • English: Generally more context features are better • Mandarin: Following context can degrade • Little difference in encoding (Extend vs Diffs) • Consistent with phonological analysis (Xu) that carryover coarticulation is greater than anticipatory

  37. Results & Discussion: Phrasal Context • Phrase contour compensation enhances recognition • Simple strategy • Use of non-linear slope compensate may improve

  38. Context: Summary • Employ common acoustic representation • Tone (Mandarin), pitch accent (English) • SVM classifiers - linear kernel: 76%, 81% • Local context effects: • Up to > 20% relative reduction in error • Preceding context greatest contribution • Carryover vs anticipatory • Phrasal context effects: • Compensation for phrasal contour improves recognition

  39. Aside: More Tones • Cantonese: • CUSENT corpus of read broadcast news text • Same feature extraction & representation • 6 tones: • High level, high rise, mid level, low fall, low rise, low level • SVM classification: • Linear kernel: 64%, Gaussian kernel: 68% • 3,6: 50% - mutually indistinguishable (50% pairwise) • Human levels: no context: 50%; context: 68% • Augment with syllable phone sequence • 86% accuracy: 90% of syllable w/tone 3 or 6: one dominates

  40. Aside: Voice Quality & Energy • By Dinoj Surendran • Assess local voice quality and energy features for tone • Not typically associated with Mandarin • Considered: • VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt; Band energy • Useful: Band energy significantly improves • Esp. neutral tone • Supports identification of unstressed syllables • Spectral balance predicts stress in Dutch

  41. Roadmap • Challenges for Tone and Pitch Accent • Contextual effects • Training demands • Modeling Context for Tone and Pitch Accent • Data collections & processing • Integrating context • Context in Recognition • Reducing Training demands • Data collections & structure • Semi-supervised learning • Unsupervised clustering • Conclusion

  42. Strategy: Context • Exploit contextual information • Features from adjacent syllables • Height, shape: direct, relative • Compensate for phrase contour • Analyze impact of • Context position, context encoding, context type • > 20% relative improvement over no context

More Related