an iterative technique for segmenting speech and text alignment
Download
Skip this Video
Download Presentation
An Iterative Technique for Segmenting Speech and Text Alignment

Loading in 2 Seconds...

play fullscreen
1 / 14

An Iterative Technique for Segmenting Speech and Text Alignment - PowerPoint PPT Presentation


  • 505 Views
  • Uploaded on

An Iterative Technique for Segmenting Speech and Text Alignment. Arthur R. Toth Speech Seminar - 4/18/2003. Basic Problem. Have Large Audio File, Associated Text Want to Align Text With Audio Useful for Synthesis Useful for Acoustic Modeling Doing this manually is tedious

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Iterative Technique for Segmenting Speech and Text Alignment' - KeelyKia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an iterative technique for segmenting speech and text alignment

An Iterative Technique for Segmenting Speech and Text Alignment

Arthur R. Toth

Speech Seminar - 4/18/2003

basic problem
Basic Problem
  • Have Large Audio File, Associated Text
  • Want to Align Text With Audio
    • Useful for Synthesis
    • Useful for Acoustic Modeling
  • Doing this manually is tedious
  • What if it could be done automatically?
    • or even if part could be done automatically?
related problem
Related Problem
  • Splitting audio file can help
  • Phrases can be good candidate
    • Can’t only be so long (have to breathe)
    • Short enough where forced alignment feasible
    • Existing work on predicting break locations
  • But then you need to split associated text
constraints
Constraints
  • Different Data is available
    • Acoustic data, i.e. waveform
    • Supra-segmental information
  • For our first attempts, we are trying to see how far we can get using only waveform
  • Differs from strategies which use word info
    • cf. Wang & Hirschberg, Wightman et al.
data set
Data Set
  • BostonUniversity Radio Corpus
  • Single speaker monologue
    • No dialogue turn information
  • Female newscaster
  • Some idiosyncrasies
    • Loud breathing
    • Broad f0 range, sometimes large dips
segmenting strategy
Segmenting Strategy
  • Want to focus on Phrase Break Levels>2
  • Tool for first approximation: vad
    • end-pointer available from MS State University
    • public domain
    • uses power and zero-crossings
    • lists beginnings and ends of found segments
  • http://www.isip.msstate.edu/projects/speech/software/legacy/signal_detector/index.html
splitting text first pass
Splitting Text - First Pass
  • Use Festival to predict lengths of words
  • Linearly scale total predicted length to actual length
  • Look at positions of segment endpoints from vad and use scaled length predictions to predict word
iterations
Iterations
  • Refine estimates iteratively as follows:
  • In each iteration, work left-to-right
  • Use sphinx-align to score forced alignments
    • for words through initial final word prediction
    • also try final words up to 2 before and 2 after
    • take best scoring list of words as new estimate
  • Note: forced alignment can fail
experiment and results
Experiment and Results
  • 5 iterations were run
  • Estimated word locations were compared with actual ones
    • Had to convert from times to words
    • Criterion - break associated with last previous word ending time
  • Most substantial improvement appeared to be in first iteration
discussion
Discussion
  • Points close to correct improved quickly
  • Points further away didn’t improve as much
  • Window size probably too small
    • Need to expand window sizes, but keep other constraints in mind
    • Heuristic like Itakura rule might be handy
  • Many misses only 1 off, and biased
    • May result from measurement or labeling
further work
Further Work
  • More sophisticated phrase break detection
    • Using a general purpose tool
    • Want the option of using supra-segmental data, if available
    • Would a Switching State-Space Model help? (Ghahramani & Hinton)
  • Is left-to-right iteration approach best?
  • Non-iterative model for splitting text?
ad