1 / 14

An Iterative Technique for Segmenting Speech and Text Alignment

An Iterative Technique for Segmenting Speech and Text Alignment. Arthur R. Toth Speech Seminar - 4/18/2003. Basic Problem. Have Large Audio File, Associated Text Want to Align Text With Audio Useful for Synthesis Useful for Acoustic Modeling Doing this manually is tedious

KeelyKia
Download Presentation

An Iterative Technique for Segmenting Speech and Text Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Iterative Technique for Segmenting Speech and Text Alignment Arthur R. Toth Speech Seminar - 4/18/2003

  2. Basic Problem • Have Large Audio File, Associated Text • Want to Align Text With Audio • Useful for Synthesis • Useful for Acoustic Modeling • Doing this manually is tedious • What if it could be done automatically? • or even if part could be done automatically?

  3. Related Problem • Splitting audio file can help • Phrases can be good candidate • Can’t only be so long (have to breathe) • Short enough where forced alignment feasible • Existing work on predicting break locations • But then you need to split associated text

  4. Constraints • Different Data is available • Acoustic data, i.e. waveform • Supra-segmental information • For our first attempts, we are trying to see how far we can get using only waveform • Differs from strategies which use word info • cf. Wang & Hirschberg, Wightman et al.

  5. Data Set • BostonUniversity Radio Corpus • Single speaker monologue • No dialogue turn information • Female newscaster • Some idiosyncrasies • Loud breathing • Broad f0 range, sometimes large dips

  6. Segmenting Strategy • Want to focus on Phrase Break Levels>2 • Tool for first approximation: vad • end-pointer available from MS State University • public domain • uses power and zero-crossings • lists beginnings and ends of found segments • http://www.isip.msstate.edu/projects/speech/software/legacy/signal_detector/index.html

  7. Splitting Text - First Pass • Use Festival to predict lengths of words • Linearly scale total predicted length to actual length • Look at positions of segment endpoints from vad and use scaled length predictions to predict word

  8. Iterations • Refine estimates iteratively as follows: • In each iteration, work left-to-right • Use sphinx-align to score forced alignments • for words through initial final word prediction • also try final words up to 2 before and 2 after • take best scoring list of words as new estimate • Note: forced alignment can fail

  9. Experiment and Results • 5 iterations were run • Estimated word locations were compared with actual ones • Had to convert from times to words • Criterion - break associated with last previous word ending time • Most substantial improvement appeared to be in first iteration

  10. Discussion • Points close to correct improved quickly • Points further away didn’t improve as much • Window size probably too small • Need to expand window sizes, but keep other constraints in mind • Heuristic like Itakura rule might be handy • Many misses only 1 off, and biased • May result from measurement or labeling

  11. Further Work • More sophisticated phrase break detection • Using a general purpose tool • Want the option of using supra-segmental data, if available • Would a Switching State-Space Model help? (Ghahramani & Hinton) • Is left-to-right iteration approach best? • Non-iterative model for splitting text?

More Related