1 / 27

Automatic Speech Recognition

Automatic Speech Recognition. Julia Hirschberg CS 6998. What is speech recognition?. Transcribing words? Understanding meaning?. It’s hard to recognize speech. People speak in very different ways Across speaker variation Within speaker variation

ulla-burt
Download Presentation

Automatic Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Speech Recognition Julia Hirschberg CS 6998

  2. What is speech recognition? • Transcribing words? • Understanding meaning?

  3. It’s hard to recognize speech... • People speak in very different ways • Across speaker variation • Within speaker variation • Speech sounds vary according to the speech context • Environment varies wrt noise • Transcription task must handle all of this and produce a transcript of spoken words

  4. Success: low WER (S+I+D)/N * 100 • Thesis test vs. This is a test. 75% WER • Progress: • Very large training corpora • Fast machines and cheap storage • Bake-offs • Market for real-time systems • New representations and algorithms: Finite State Transducers

  5. Varieties of Speech Recognition

  6. ASR and the Noisy Channel Model • Source --> noisy channel --> Hypothesis • Find the most likely input to have generated the (observed) “noisy” sentence by finding most likely sentence W in language given acoustic input O • W’= P(W|O) • Bayes rule • W’=

  7. P(O) same for all hypothetical W, so • W’=P(O|W)P(W) • P(W) the prior; P(O|W) the (acoustic) likelihood

  8. Simple Isolated Digit Recognition • Train 10 acoustic templates Mi: one per digit • Compare input x with each • Select most similar template j according to some comparison function, minimizing differences • j = min{f(x,Mi)}

  9. Scaling Up: Continuous Speech Recognition • Collect training and test corpora of • Speech + word transcription • Speech + phonetic transcription • Built by hand or using TTS • Text corpus • Determine a representation for the signal • Build probabilitistic • Acoustic model: signal to phones

  10. Pronunciation model: phones to words • Language model: words to sentences • Select search procedures to decode new input given these training models

  11. Representing the Signal • What parameters (features) of the waveform • Can be extracted automatically • Will preserve phonetic identity and distinguish it from other phones • Will be independent of speaker variability and channel conditions • Will not take up too much space • …Power Spectrum

  12. Speech captured by microphone and digitized • Signal divided into frames • Power spectrum computed to represent energy in different bands of the signal • LPC spectrum, Cepstra, PLP • Each frame’s spectral features represented by small set of numbers

  13. Why it works? • Different phonemes have different spectral characteristics • Why it doesn’t work? • Phonemes can have different properties in different acoustic contexts, spoken by different people, ...

  14. Acoustic Models • Model likelihood of phone given spectral features and prior context • Usually represented as HMM • Set of states representing phones or other subword units • Transition probabilities on states: how likely is it to see one phone after another? • Observation/output likelihoods: how likely is spectral feature vector to be observed from state i, given state i-1?

  15. Train initial model on small hand-labeled corpus to get estimate of transition and observation probabilities • Tune parameters on large corpus with only transcription • Iterate until no further improvement

  16. Pronunciation Model • Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice) • Allophones: butter vs. but • Lexicon may be HMM or simple dictionary

  17. Language Models • Models likelihood of word sequence given candidate word hypotheses • Grammars • Finite state or CFG • Ngrams • Corpus trained • Smoothing issues • Out of Vocabulary (OOV) problem

  18. Search • Find the best hypothesis given • Lattice of subword units (AM) • Segmentation of all paths into possible words (PM) • Probabilities of word sequences (LM) • Huge search space • Viterbi decoding • Beam search

  19. Challenges for Transcription • Robustness to channel characteristics and noise • Portability to new applications • Adapatation: to speakers, to environments • LMs: simple ngrams need help • Confidence measures • OOV words • New speaking styles/genres • New applications

  20. Challenges for Understanding • Recognizing communicative ‘problems’ • ASR errors • User corrections • Disfluencies and self-repairs

  21. An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning.

  22. S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore.

  23. U: Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. U: Yes. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev-

  24. Disfluencies and Self-Repairs • Disfluencies abound in spontaneous speech • every 4.6s in radio call-in (Blackmer & Mitton ‘91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. • Hard to recognize Ch- change strategy. --> to D C D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago.

  25. Possibilities for Understanding • Recognizing speaker emotion • Identifying speech acts: okay • Locating topic boundaries for topic tracking, audio browsing, speech data mining

  26. Next Week

More Related