Overview of Speech Recognition: Acoustic Modeling and Statistical Approaches

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, section 1:Natural Language Processing Lecture #16: Speech Recognition Overview (cont.) Thanks to Alex Acero (Microsoft Research), Jeff Adams (Nuance), Simon Arnfield (Sheffield), Dan Klein (UC Berkeley), Mazin Rahim (AT&T Research) for many of the materials used in this lecture.

Announcements • Reading Report #6 on Young’s Overview • Due: now • Reading Report #7 on M&S 7 • Due: Friday • Review Questions • Typed list of 5 questions for Mid-term exam review • Due next Wednesday

Objectives • Continue our overview of an approach to speech recognition, picking up at acoustic modeling • See other examples of the source / channel (noisy channel) paradigm for modeling interesting processes • Apply language models

Source Noisy Channel Text Recall: Front End • We want to predict a sentence given a feature vector FE ASR Features Speech Text

Decoder,Search Feature Extraction Language Model Acoustic Model Word Lexicon Acoustic Modeling • Goal: • Map acoustic feature vectors into distinct linguistic units • Such as phones, syllables, words, etc.

f sh s k AW OW th t p AH AA AY h ch j v UH z zh AWH w uh r EH d b OO y g eh dh un um m n ul ng AE oh l ur OH UR IH OY OOH EY ih ee EE Acoustic Trajectories

Acoustic Models:Neighborhoods are not Points • How do we describe what points in our “feature space” are likely to come from a given phoneme? • It’s clearly more complicated than just identifying a single point. • Also, the boundaries are not“clean”. • Use the normal distribution: • Points are likely to lie nearthe center. • We describe thedistribution with the mean& variance. • Easy to compute with

Acoustic Models:Neighborhoods are not Points (2) • Normal distributions in M dimensions are analogous • A.k.a. “Gaussians” • Specify the mean point in M dimensions • Like an M-dimensional “hill” centered around the mean point • Specify the variances(as Co-variance matrix) • Diagonal gives the “widths”of the distributionin each direction • Off-diagonal values describe the“orientation” • “Full covariance” possibly “tilted” • “Diagonal covariance” not “tilted”

AMs: Gaussians don’t really cut it • Consider the “AY” frames in our example. How can we describe these with an (elliptical) Gaussian? • A single (diagonal) Gaussian is too big to be helpful. • Full-covariance Gaussians are hard to train. • We often use multiple Gaussians (a.k.a. Gaussian mixture models)

(1 dimensional) Gaussian Mixture Models

AMs: Phonemes are a path, not a destination • Phonemes, like stories, have beginnings, middles, and ends. • This might be clear if you think of how the “AY” sound moves from a sort of “EH” to an “EE”. • Even non-diphthongs show these properties. • We often represent a phoneme with multiple “states”. • E.g. in our AY model, we might have 4 states. • And each of these states is modeled by a mixture of Gaussians. STATE 2 STATE 3 STATE 4 STATE 1

AMs: Whence & Whither • It matters where you come from (whence)and where you are going (whither). • Phonetic contextual effects • A way to model this is to use triphones • I.e. Depend on the previous & following phonemes • E.g. Our “AY” model should really be a silence-AY-S model (… or pentaphones: use 2 phonemes before & after) • So what we really need for our “AY” model is a: • Mixture of Gaussians • For each of multiple states • For each possible set of predecessor & successor phonemes

Hidden Markov Model (HMM) • Captures: • Transitions between hidden states • Feature emissions as mixturesof gaussians • Spectral properties modeled bya parametric random process • i.e., a directed graphical model! • Advantages: • Powerful statistical method for a wide range of data and conditions • Highly reliable for recognizing speech • A collection of HMMs for each: • sub-word unit type • extraneous event: cough, um, sneeze, … • More on HMMs coming up in the course after classification!

sil-AY+S[2] sil-AY+S[1] sil-AY+S[3] Anatomy of an HMM • HMM for /AY/ in context of preceding silence, followed by /S/ 0.2 0.3 0.2 0.8 0.7 0.8 0.5

sil-AY+S[2] sil-AY+S[1] sil-AY+S[3] HMMs as Phone Models 0.2 0.3 0.2 0.8 0.7 0.8 0.5

Words and Phones How do we know how to segment words into phones?

Decoder,Search Feature Extraction Language Model Acoustic Model Word Lexicon Word Lexicon • Goal: • Map sub-word units into words • Usual sub-word units are phone(me)s • Lexicon: (CMUDict, ARPABET) • Phoneme Example Translation • AA odd AA D • AE at AE T • AH hut HH AH T • AO ought AO T • AW cow K AW • AY hide HH AY D • B be B IY • CH cheese CH IY Z • … • Properties: • Simple • Typically knowledge-engineered (not learned – shock!)

Source Noisy Channel Text Decoder • Predict a sentence given a feature vector FE ASR Features Speech Text

Pattern Classification Feature Extraction Language Model Acoustic Model Word Lexicon Decoding: as State-Space Search

Decoding as Search • Viterbi – Dynamic Programming • Multi-pass • A* (“stack decoding”) • N-best • …

Viterbi: DP

Noisy Channel Applications • Speech recognition (dictation, commands, etc.) • text neurons, acoustic signal, transmission  acoustic waveforms  text • OCR • text  print, smudge, scan  image  text • Handwriting recognition • text neurons, muscles, ink, smudge, scan  image  text • Spelling correction • text  your spelling  mis-spelled text  text • Machine Translation (?) • text in target language translation in head  text in source language  text in target language

Noisy-Channel Models • OCR • Handwriting recognition • Spelling Correction • Translation?

What’s Next • Upcoming lectures: • Classification / categorization • Naïve-Bayes models • Class-conditional language models

Extra

Milestones in Speech Recognition Small Vocabulary, Acoustic Phonetics-based Large Vocabulary; Syntax, Semantics, Very Large Vocabulary; Semantics, Multimodal Dialog Medium Vocabulary, Template-based Large Vocabulary, Statistical-based Isolated Words Connected Digits Continuous Speech Continuous Speech Speech Understanding Spoken dialog; Multiple modalities Connected Words Continuous Speech Isolated Words Stochastic language understanding Finite-state machines Statistical learning Pattern recognition LPC analysis Clustering algorithms Level building Filter-bank analysis Time-normalization Dynamicprogramming Concatenative synthesis Machine learning Mixed-initiative dialog Hidden Markov models Stochastic Language modeling 1962 1967 1972 1977 1982 1987 1992 1997 2003 Year

Dragon Dictate Progress • WERR* from Dragon NaturallySpeaking version 7 to version 8 to version 9: DOMAIN 78 89 • US English: 27% 23% • UK English: 21% 10% • German: 16% 10% • French: 24% 14% • Dutch: 27% 18% • Italian: 22% 14% • Spanish: 26% 17% * WERR means relative word error rate reduction on an in-house evaluation set. Results from Jeff Adams, ca. 2006

Crazy Speech Marketplace Philips IBM Inso Articulate MedRemote Kurzweil ScanSoft Nuance L&H etc. Dragon Nuance etc. Dictaphone Speechworks Voice Signal Dictaphone Tegic ca. 1980 ca. 2004 Year

Speech vs. text:tokens vs. characters • Speech recognition recognizes a sequence of “tokens” taken from a discrete & finite set, called the lexicon. • Informally, tokens correspond to words, but the correspondence is inexact. In dictation applications, where we have to worry about converting between speech & text, we need to sort out a “token philosophy”: • Do we recognize “forty-two” or “forty two” or “42” or “40 2”? • Do we recognize “millimeters” or “mms” or “mm”? • What about common words which can also be names, e.g. “Brown” and “brown”? • What about capitalized phrases like “Nuance Communications” or “The White House” or “Main Street”? • What multi-word tokens should be in the lexicon, like “of_the”? • What do we do with complex morphologies or compounding?

Converting between tokens& text TOKEN PHILOSOPHY TOKENS profits rose to twenty eight million dollars .\period see figure one a\a on page one twenty four .\period TOKENIZATION TEXT Profits rose to $28 million. See fig. 1a on p. 124. ITN LEXICON

Three examples (Tokenization) TEXT • P.J. O’Rourke said, "Giving money and power to government is like giving whiskey and car keys to teenage boys." • The 18-speed I bought sold on www.eBay.com for $611.00, including 8.5% sales tax. • From 1832 until August 15, 1838 they lived at No. 235 Main Street, "opposite the Academy," and from there they could see it all. TOKENS • PJ O'Rourke said ,\comma"\open-quotes giving money and power to government is like giving whiskey and car keys to teenage boys .\period "\close-quotes • the eighteen speed I bought sold on www.\WWW_dot eBay .com\dot_com for six hundred and eleven dollars zero cents ,\comma including eight .\point five percent sales tax .\period • from one eight three two until the fifteenth of August eighteen thirty eight they lived at number two thirty five Main_Street ,\comma "\open-quotes opposite the Academy ,\comma "\close-quotes and from there they could see it all .\period

Missing from speech: punctuation • When people speak they don’t explicitly indicate phrase and section boundaries instead listeners rely on prosody and syntax to know where these boundaries belong in dictation applications we normally rely on speakers to speak punctuation explicitly how can we remove that requirement • When people speak, they don’t explicitly indicate phrase and section boundaries. • Instead, listeners rely on prosody and syntax to know where these boundaries belong. • In dictation applications, we normally rely on speakers to speak punctuation explicitly. • How can we remove that requirement?

Punctuation Guessing Example • Punctuation Guessing • As currently shipping in Dragon • Targeted towards free, unpunctuated speech

Overview of Speech Recognition: Acoustic Modeling and Statistical Approaches