Automatic Speech Recognition

Automatic Speech Recognition Julia Hirschberg CS 6998

What is speech recognition? • Transcribing words? • Understanding meaning?

It’s hard to recognize speech... • People speak in very different ways • Across speaker variation • Within speaker variation • Speech sounds vary according to the speech context • Environment varies wrt noise • Transcription task must handle all of this and produce a transcript of spoken words

Success: low WER (S+I+D)/N * 100 • Thesis test vs. This is a test. 75% WER • Progress: • Very large training corpora • Fast machines and cheap storage • Bake-offs • Market for real-time systems • New representations and algorithms: Finite State Transducers

Varieties of Speech Recognition

ASR and the Noisy Channel Model • Source --> noisy channel --> Hypothesis • Find the most likely input to have generated the (observed) “noisy” sentence by finding most likely sentence W in language given acoustic input O • W’= P(W|O) • Bayes rule • W’=

P(O) same for all hypothetical W, so • W’=P(O|W)P(W) • P(W) the prior; P(O|W) the (acoustic) likelihood

Simple Isolated Digit Recognition • Train 10 acoustic templates Mi: one per digit • Compare input x with each • Select most similar template j according to some comparison function, minimizing differences • j = min{f(x,Mi)}

Scaling Up: Continuous Speech Recognition • Collect training and test corpora of • Speech + word transcription • Speech + phonetic transcription • Built by hand or using TTS • Text corpus • Determine a representation for the signal • Build probabilitistic • Acoustic model: signal to phones

Pronunciation model: phones to words • Language model: words to sentences • Select search procedures to decode new input given these training models

Representing the Signal • What parameters (features) of the waveform • Can be extracted automatically • Will preserve phonetic identity and distinguish it from other phones • Will be independent of speaker variability and channel conditions • Will not take up too much space • …Power Spectrum

Speech captured by microphone and digitized • Signal divided into frames • Power spectrum computed to represent energy in different bands of the signal • LPC spectrum, Cepstra, PLP • Each frame’s spectral features represented by small set of numbers

Why it works? • Different phonemes have different spectral characteristics • Why it doesn’t work? • Phonemes can have different properties in different acoustic contexts, spoken by different people, ...

Acoustic Models • Model likelihood of phone given spectral features and prior context • Usually represented as HMM • Set of states representing phones or other subword units • Transition probabilities on states: how likely is it to see one phone after another? • Observation/output likelihoods: how likely is spectral feature vector to be observed from state i, given state i-1?

Train initial model on small hand-labeled corpus to get estimate of transition and observation probabilities • Tune parameters on large corpus with only transcription • Iterate until no further improvement

Pronunciation Model • Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice) • Allophones: butter vs. but • Lexicon may be HMM or simple dictionary

Language Models • Models likelihood of word sequence given candidate word hypotheses • Grammars • Finite state or CFG • Ngrams • Corpus trained • Smoothing issues • Out of Vocabulary (OOV) problem

Search • Find the best hypothesis given • Lattice of subword units (AM) • Segmentation of all paths into possible words (PM) • Probabilities of word sequences (LM) • Huge search space • Viterbi decoding • Beam search

Challenges for Transcription • Robustness to channel characteristics and noise • Portability to new applications • Adapatation: to speakers, to environments • LMs: simple ngrams need help • Confidence measures • OOV words • New speaking styles/genres • New applications

Challenges for Understanding • Recognizing communicative ‘problems’ • ASR errors • User corrections • Disfluencies and self-repairs

An Unsuccessful Dialogue S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you? U: I'd like to find a train from Ch- to Chicago from Baltimore. S: Sorry, I can't understand you. Please repeat your utterance. U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the morning.

S: You can say the arrival city name, such as “New York City." U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.” U: Baltimore to Chicago. (I hate it.) U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello? S: You can say the name of your departure city, such as "New York City." U: Baltimore.

U: Bal-ti-more.... S: You can say... U: Baltimore. S: Do you want to continue and find a new set of trains? U: Yes. S: You can say "yes" or "I am done here" to find a new set of trains. U: Yes. S: Thank you for using AT&T Amtrak train time table system. See you next time. U: I nev-

Disfluencies and Self-Repairs • Disfluencies abound in spontaneous speech • every 4.6s in radio call-in (Blackmer & Mitton ‘91) hesitation: Ch- change strategy. filled pause: Um Baltimore. self-repair: Ba- uh Chicago. • Hard to recognize Ch- change strategy. --> to D C D C today ten fifteen. Um Baltimore. --> From Baltimore ten. Ba- uh Chicago. --> For Boston Chicago.

Possibilities for Understanding • Recognizing speaker emotion • Identifying speech acts: okay • Locating topic boundaries for topic tracking, audio browsing, speech data mining

Next Week

Automatic Speech Recognition