An Iterative Technique for Segmenting Speech and Text Alignment

An Iterative Technique for Segmenting Speech and Text Alignment Arthur R. Toth Speech Seminar - 4/18/2003

Basic Problem • Have Large Audio File, Associated Text • Want to Align Text With Audio • Useful for Synthesis • Useful for Acoustic Modeling • Doing this manually is tedious • What if it could be done automatically? • or even if part could be done automatically?

Related Problem • Splitting audio file can help • Phrases can be good candidate • Can’t only be so long (have to breathe) • Short enough where forced alignment feasible • Existing work on predicting break locations • But then you need to split associated text

Constraints • Different Data is available • Acoustic data, i.e. waveform • Supra-segmental information • For our first attempts, we are trying to see how far we can get using only waveform • Differs from strategies which use word info • cf. Wang & Hirschberg, Wightman et al.

Data Set • BostonUniversity Radio Corpus • Single speaker monologue • No dialogue turn information • Female newscaster • Some idiosyncrasies • Loud breathing • Broad f0 range, sometimes large dips

Segmenting Strategy • Want to focus on Phrase Break Levels>2 • Tool for first approximation: vad • end-pointer available from MS State University • public domain • uses power and zero-crossings • lists beginnings and ends of found segments • http://www.isip.msstate.edu/projects/speech/software/legacy/signal_detector/index.html

Splitting Text - First Pass • Use Festival to predict lengths of words • Linearly scale total predicted length to actual length • Look at positions of segment endpoints from vad and use scaled length predictions to predict word

Iterations • Refine estimates iteratively as follows: • In each iteration, work left-to-right • Use sphinx-align to score forced alignments • for words through initial final word prediction • also try final words up to 2 before and 2 after • take best scoring list of words as new estimate • Note: forced alignment can fail

Experiment and Results • 5 iterations were run • Estimated word locations were compared with actual ones • Had to convert from times to words • Criterion - break associated with last previous word ending time • Most substantial improvement appeared to be in first iteration

Discussion • Points close to correct improved quickly • Points further away didn’t improve as much • Window size probably too small • Need to expand window sizes, but keep other constraints in mind • Heuristic like Itakura rule might be handy • Many misses only 1 off, and biased • May result from measurement or labeling

Further Work • More sophisticated phrase break detection • Using a general purpose tool • Want the option of using supra-segmental data, if available • Would a Switching State-Space Model help? (Ghahramani & Hinton) • Is left-to-right iteration approach best? • Non-iterative model for splitting text?

An Iterative Technique for Segmenting Speech and Text Alignment