1 / 29

Automatic Phoneme Alignment

Automatic Phoneme Alignment. Hung-Yi Lo , Jen-Wei Kuo and Hsin-Ming Wang NGASR Summer Meeting 2011 July 12, 2011. attitude. Context. In "A good attitude is unbeatable". Phonetic Transcription. ae dx ax tcl t ux dcl. Acoustic Signal. Spectrogram.

celina
Download Presentation

Automatic Phoneme Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Phoneme Alignment Hung-Yi Lo, Jen-Wei Kuo and Hsin-Ming Wang NGASR Summer Meeting 2011 July 12, 2011

  2. attitude Context In "A good attitude is unbeatable" Phonetic Transcription ae dx ax tcl t ux dcl Acoustic Signal Spectrogram Automatic Phoneme Alignment Problem

  3. Outline • A Minimum Boundary Error Framework for Automatic Phoneme Alignment • Phonetic Boundary Refinement Using Support Vector Machine • Phoneme Boundary Refinement Using Ranking Methods

  4. A Minimum Boundary Error Framework for Automatic Phoneme Alignment

  5. boundary error of compared to posterior probability of summation of all possible alignments MBE Discriminative Training • Let be a set of various possible phonetic alignments for a given observation utterance O, and S is the true phonetic alignment in the canonical transcription. • Then, the expected boundary error is expressed as

  6. Various possible hypothesized alignments generated by conventional HMM-based forced alignment approaches We want to select the alignment with the minimal expected boundary error from We examine each hypothesized alignment one by one MBE Forced Alignment (N-best rescoring) …

  7. MBE Forced Alignment … If is the true alignment, then we get the error

  8. MBE Forced Alignment … If is the true alignment, then we get the error

  9. MBE Forced Alignment … The expected boundary error of can be expresses as: The best alignment is found by minimizing the expected error

  10. Experimental Setup • TIMIT corpus • Training set: 4546 utterances (3.87 hrs) • Test set: 1646 utterances (1.41 hrs) • Context independent HMMs are used • 50 phone models (left-to-right topology) • 3 states for each HMM • 5265 mixtures totally • Frame window size and shift are 20 ms and 5 ms, respectively • 12 MFCCs and log energy, and their first and second differences • CMS and CVN are applied • Evaluation metric • Percentage of boundaries placed without a tolerance

  11. Experimental Results • Boundary Accuracy

  12. Phonetic Boundary Refinement Using Support Vector Machine

  13. Acoustic Signal Spectrogram Boundaries detected by HMM-based Forced Alignment Hypothesized Boundaries as Input to SVM SVM Boundary Classifier Output Boundary Refinement Using SVM

  14. Feature Extraction • Each frame represented by a 45-dim vector: • 39 MFCC-based coefficients • Zero crossing rate • Bisector frequency • Burst degree • Spectral entropy • General weighted entropy • Subband energy • Two features are extracted for each hypothesis boundary: • Symmetrical Kullback-Leibler distance • Spectral feature transition rate • Each hypothesized boundary is represented by a 92-dim augmented vector consisting of elements of the feature vectors of the left and right frames next to it and the two boundary features.

  15. Phone-Transition-Dependent Classifier • Training data is always limited • Cannot train a classifier for each type of phone transition • Many phone transitions have similar acoustic characteristics, we can partition them into clusters • The phone transitions with little training data can be covered by the classifiers of the categories they belong to • Two methods for phone transition clustering: • K-means-based Clustering (KM) • Decision-tree-based Clustering (DT)

  16. Training Data for Boundary Classifier • Positive samples: the feature vectors associated with the true phone boundaries • Negative samples: the randomly selected feature vectors at least 20ms away from the true boundaries Positive Instance Negative Instance Negative Instance Classification Model

  17. Experimental Results

  18. Phonetic Boundary Refinement Using Ranking Methods

  19. Training Data for Boundary Classifier • Positive samples: the feature vectors associated with the true phone boundaries • Negative samples: the randomly selected feature vectors at least 20ms away from the true boundaries However, this classification-based method has two drawbacks Positive Instance Negative Instance Negative Instance Classification Model

  20. First Drawback: Losing Information (1/2) • Only information about the boundary and far away non-boundary signal characteristics is used • What about the information nearby the boundary? Positive Instance Negative Instance Negative Instance Classification Model

  21. First Drawback: Losing Information (2/2) • Preference ranking • Instances extracted from the true boundaries: high preference • Nearby instances: medium preference • Far away instances: low preference Positive Instance Negative Instance Negative Instance Classification Model Preference Ranking Mid High Mid Low Low

  22. Second Drawback : Imbalanced Training • A lot of negative instances but only a limited amount of positive instances • General classification algorithms will be biased to predict all instances to be negative • Since they are learned to minimize the number of incorrectly classified instances

  23. Boundary Refinement as a Ranking Problem • Learn a function H: X → R where H(xi) > H(xj) means that instance xi is preferred to xj • The hypothesized boundary closed to the true boundary should have higher score • We only care about relative order • Correct order: A-B-C-D • OK: {A:-100, B:-10, C: 0, D:1000} • OK: {A: 0.1, B: 0.3, C: 0.4, D: 0.41} • We exploit two learning-to-rank methods: • Ranking SVM • RankBoost

  24. Generation of Training Pairs • Four ordered ranking lists generated from each true boundary True Boundary …… ……

  25. Generation of Training Pairs • Four ordered ranking lists generated from each true boundary …… …… (1) (2) (3) (4) …… ……

  26. Generation of Training Pairs • Couple the preferred instances with each remaining instance …… …… (1) (2) (3) (4) …… ……

  27. Experiment Results

  28. References • Jen-Wei Kuo and Hsin-Min Wang, Minimum Boundary Error Training for Automatic Phonetic Segmentation, ICSLP 2006 • Jen-Wei Kuo and Hsin-Min Wang, A Minimum Boundary Error Framework for Automatic Phonetic Segmentation, ISCSLP 2006 • Hung-Yi Lo and Hsin-Min Wang, Phonetic Boundary Refinement Using Support Vector Machine, ICASSP 2007 • Hung-Yi Lo and Hsin-Min Wang, Phone Boundary Refinement Using Ranking Methods, ISCSLP 2010

  29. Thank you!

More Related