Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Landmark-Based Speech RecognitionThe Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations Mark Hasegawa-Johnson jhasegaw@uiuc.edu Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT), Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)

Goal of this Talk • Experiments with human subjects (since 1910 at Bell Labs, since 1950 at Harvard) give us detailed knowledge of human speech perception. • Human speech perception is multi-resolution, like progressive JPEG: • syllables and prosody → distinctive features → words • Automatic speech recognition (ASR) works best if all parameters in the system can be simultaneously learned in order to adjust a global optimality criterion • In 1967, it became possible to globally optimize all parameters of a very simple recognition model called the hidden Markov model • Multi-resolution speech models could not be globally optimized • Therefore from 1985-1999, standard ASR ignored results from speech psychology • In the 1990s, new results in machine learning made it possible to globally optimize a multi-resolution model of speech psychology, and to use the resulting model as an automatic speech recognizer • We do not yet know how best to “marry” speech psychology with new machine learning technology • Goal of this talk: to test globally optimized computational models of speech psychology as automatic speech recognizers

Talk Outline History and Overview • Acoustics → Landmarks • Psychological Results: Landmark-Based Speech Perception • Psychological Results: Perceptual Space ≠ Acoustic Space • Computational Model: Landmark Detection and Classification • Algorithm: Support Vector Machines • Landmarks → Words • The Pronunciation Modeling Problem • Psychological Model #1: An Underspecified Distinctive Feature Lexicon • Computational Model: Discriminative Selection of Landmarks • Psychological Model #2: Articulatory Phonology • Computational Model: Dynamic Bayesian Network (DBN) • Technological Evaluation • Landmark Detection and Classification • Forced Alignment using the DBN • Rescoring of word lattice output from an HMM-based recognizer • Error Analysis and Future Plans

History • Human Speech Recognition Models • 1955, Miller and Nicely: Distinctive Features • 1955, Delattre, Liberman, and Cooper: Landmarks • 1975, Goldsmith: Underspecified Lexicon • 1992, Stevens: Landmark-Based Speech Perception Model • 1990, Browman and Goldstein: Articulatory Phonology • Automatic Speech Recognition • 1999, Niyogi and Ramesh: Support Vector Machines for Landmark Detection • 2003, Livescu and Glass: Dynamic Bayesian Network implementation of Articulatory Phonology • 2004, Hasegawa-Johnson et al., WS04 Summer Workshop at the Johns Hopkins Center for Language and Speech Processing • Underspecified Lexicon with Discriminative Landmark Selection • Hybrid SVM-DBN implementation of Articulatory Phonology

Landmark-Based Speech Recognition Lattice hypothesis: … backed up … Words Times Scores Pronunciation Variants: … backed up … … backtup .. … back up … … backt ihp … … wackt ihp… … ONSET ONSET Syllable Structure NUCLEUS NUCLEUS CODA CODA

Acoustics → Landmarks: Results and Models from Psychology and Linguistics

Spectral DynamicsDelattre, Liberman and Cooper, 1955 • To recognize a stop consonant, one spectrum is not enough. • Recognition depends on the pattern of spectral change over a 50ms period following the release “landmark.”

Landmarks are RedundantMany authors, including Stevens, 1999 • To recognize a stop consonant, it is necessary and sufficient to hear any one of these: • Release into vowel • Closure from vowel • “Ejective” burst • … three “acoustic landmarks” with very different spectral patterns. “backed”

Recognition Depends on Rhythm(Warren, Healy, and Chalikia, 1996) Heard as one voice saying “aa iy uw ow ae iy” Heard as two voices: one says “hi uw,” one says “iowa”

Nonlinear Map from Acoustic Features to Perceptual Features(Kuhl et al., 1992)

In the Perceptual Space, Distinctive Feature Errors are Independent(Miller and Nicely, 1955) • Experimental Method: • Subjects listen to nonsense syllables mixed with noise (white noise or BPF) • Subjects write the consonant they hear • Results: p(q*|q,SNR,BPF) ≈ Pi p(fi* | fi,SNR,BPF) q* = consonant label heard by the listener q = true consonant label F*=[f1*,…,f6*] = perceived distinctive feature labels F=[f1,…,f6] = true distinctive feature labels [±nasal, ±voiced, ±fricated, ±strident, ±lips, ±blade]

Consonant Confusions at -6dB SNR Distinctive Features: ±nasal, ±voiced,±fricative,±strident

In the Acoustic Space, Distinctive Features are Not Independent(Volaitis and Miller, 1992)

Acoustics → LandmarksA Computational Model

Landmark Detection and Explanation(based on Stevens, Manuel, Shattuck-Hufnagel, and Liu, 1992) MAP understanding: … backed up … Search Space: … … buck up … … big dope … … backed up … … bagged up … … … big doowop … … ONSET ONSET NUCLEUS NUCLEUS CODA CODA

Landmark Detector Inputs: Acoustic, Phonetic, and Auditory FeaturesTotal Feature Vector Dimension: 483/frame • MFCCs, 25ms window (standard ASR features) • Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond • Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths (Zheng & Hasegawa-Johnson, ICSLP 2004) • Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996) • Rate-place model of neural response fields in the cat auditory cortex (Carlyon & Shamma, JASA 2003)

Cues for Place of Articulation:MFCC+formants + ratescale, within 150ms of landmark

Landmark Detection using Support Vector Machines (SVMs) False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM (1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2% (2) HMM (*): False Rejection Error=0.3% (3) Linear SVM: EER = 0.15% (4) Radial Basis Function SVM: Equal Error Rate=0.13% Niyogi, Ramesh & Burges, 1999, 2002

What is a Support Vector Machine? • SVM = hyperplane, RBF, or kernel-based classifier, trained to minimize upper bound on EXPECTED TEST CORPUS ERROR • EXPECTED TEST CORPUS ERROR≤ (TRAINING CORPUS ERROR) + l(Distance between the hyperplane and the nearest data point)-2 • Classifier on right: higher TRAINING CORPUS ERROR, but lower EXPECTED TEST CORPUS ERROR

What are the SVMs trained to detect? Simple answer: any binary distinction that will be useful to the recognizer. Hard answer (as implemented in July 2004): • SVMs trained to be correct in every frame • Articulatory-free features: Speech vs. silence, vowel vs. consonant, sonorant vs. obstruent, nasal vs. non-nasal, fricative vs. non-fricative • Landmarks: Stop release vs. any other frame, fricative release vs. any other frame, stop closure vs. any other frame, … • SVMs trained to be correct given a specified context, and meaningless otherwise: • Primary articulator: lips vs. tongue blade vs. tongue body • Secondary articulators: voiced vs. unvoiced, nasal vs. not

Why are we studying binary distinctive features? By focusing on binary distinction, and using regularized learners (SVMs), we can “push the limit” of classifier complexity … … in order to get high binary classification accuracy.

Perceptual Space Encodes Distinctive Features: Errors Independent even if Acoustics Not Nonlinear Transform Implicit in the SVM Kernel

“Phonetic Features” = Nonlinear Transform followed by a One-Dimensional Cut Nonlinear Transform: Implicit in the SVM Kernel SVM Discriminant Dimension = argmin(error(margin)+1/width(margin) SVM Extracts a Discriminant Dimension (Niyogi & Burges, 2002: Posterior PDF = Sigmoid Model in Discriminant Dimension) An Equivalent Model: Likelihoods = Gaussian in Discriminant Dimension

Soft Decisions once/5ms:p ( manner feature d(t) | Y(t) )p( place feature d(t) | Y(t), t is a landmark ) 2000-dimensional acoustic feature vector SVM Discriminant yi(t) Histogram Posterior probability of distinctive feature p(di(t)=1 | yi(t))

Landmarks → WordsThe Problem of Pronunciation Variability

The Problem of Pronunciation Variability(Livescu, 2004) probably p r aa b iy 2 p r ay 1 p r aw l uh 1 p r ah b iy 1 p r aa lg iy 1 p r aa b uw 1 p ow ih 1 p aa iy 1 p aa b uh b l iy 1 p aa ah iy 1

Landmarks → WordsPhonological Model #1: Underspecified Lexicon

Underspecified Lexicon(Goldsmith, 1975) • Once listener hears [+vowel] the features sonorant, continuant, strident, lips, blade, and voicedare meaningless and redundant. • There are no [+sonorant,+strident] or [+sonorant,-voiced] phonemes. Given [+strident], the features [sonorant,strident] are meaningless and redundant. • If /s/ is in a consonant cluster, the listener only needs to hear [+strident] --- no other features are necessary, because no word could have anything but an /s/ in this position.

Computational Model: Select Landmarks to Distinguish Confusable Word Pairs • Rationale: baseline HMM-based system already provides high-quality hypotheses • 1-best error rate from N-best lists: 24.4% (RT-03 dev set) • oracle error rate: 16.2% • Method: • Use an HMM-NN hybrid system to generate a first-pass word lattice • Use landmark detection only where necessary, to correct errors made by baseline recognition system • Example: fsh_60386_1_0105420_0108380 Ref: that cannot be that hard to sneak onto an airplane Hyp: they can be a that hard to speak on an airplane

that can *DEL* on sneak an airplane be hard to speak onto a they can’t Identifying Confusable Hypotheses • Use existing alignment algorithms for converting lattices into confusion networks (Mangu, Brill & Stolcke 2000) • Hypotheses ranked by posterior probability • Generated from n-best lists without 4-gram or pronunciation model scores( higher WER compared to lattices) • Multi-words (“I_don’t_know”) were split prior to generating confusion networks

Identifying Confusable Hypotheses • How much can be gained from fixing confusions? • Baseline error rate: 25.8% • Oracle error rates when selecting correct word from confusion set:

Selecting Relevant Landmarks • Convert each word into a fixed-length vector • Dimensions of the vector = frequencies of occurrence, in the word, of selected binary landmark-pair relationships: • Manner landmarks: precedence, e.g. V ≺ Son. Cons. • Manner & place features: overlap, e.g. Stop o +blade • Not all possible relations are used; dimensionality of feature space is 40 - 60 • The vector for each word • … should be derived from actual pronunciation data, e.g., from landmarks automatically detected in a very large speech corpus • … unfortunately, due to time constraints, that experiment hasn’t been run yet. • In the mean time, the vector for each word was derived from a standard pronunciation dictionary (pronlex).

Vector-Space Word Representation

Maximum-Entropy Discrimination • Use maxent classifier • Here: y = words, x = acoustics, f = landmark relationships • Why maxent classifier? • Discriminative classifier • Possibly large set of confusable words • Later addition of non-binary features • Training: ideally on real landmark detection output • Here: on entries from lexicon (includes pronunciation variants)

Maximum-Entropy Discrimination • Example: sneak vs. speak • Different model is trained for each confusion set  landmarks can have different weights in different contexts speak SC ○ +blade -2.47 FR < SC -2.47 FR < SIL 2.11 SIL < ST 1.75 ….. sneak SC ○ +blade 2.47 FR < SC 2.47 FR < SIL -2.11 SIL < ST -1.75 …..

Landmark Queries • Select N landmarks with highest weights • Ask landmark detection module to produce scores for selected landmarks within word boundaries given by baseline system • Example: sneak 1.70 1.99 SC ○ +blade ? Landmark detectors Confusion networks sneak 1.70 1.99 SC ○ +blade 0.75 0.56

Landmarks → WordsPhonological Model #2: Articulatory Phonology

Articulatory Phonology: Lips and Tongue Have Different Variability

TB-LOC VELUM TT-LOC LIP-OP TB-OPEN TT-OPEN VOICING Articulatory Phonology(Browman and Goldstein, 1990; slide from Livescu and Glass, 2004) • Many pronunciation phenomena can be parsimoniously described as resulting from asynchrony and reduction of quasi-independent speech articulators. • warmth [w ao r m p th] - Phone insertion? • I don’t know[ah dx uh_n ow_n] - Phone deletion?? • several[s eh r v ax l] - Exchange of two phones??? • instruments[ih_n s ch em ih_n n s] everybody[eh r uw ay]

Brief Review: Bayesian Networks • Each node in the graph is a random variable • Arrow represents dependent probabilities • Probability distributions: One per variable • Number of columns = number of different values the variable can take • Number of rows = number of different values the variable’s parents can take • Modularity of the graph  Modularity of computation; very complicated models can be used for speech recognition with not-so-bad computational cost G H L

. . . = = - = 1 ; 2 1 2 Pr( async a ) Pr(| ind ind | a ) given by baseform pronunciations = 1 = 1 0 1 2 3 4 … 0 .7 .2 .1 0 0 … 1 0 .7 .2 .1 0 … 2 0 0 .7 .2 .1 … … … … … … … … Dynamic Bayesian Network Model(Livescu and Glass, 2004) • The model is implemented as a dynamic Bayesian network (DBN): • A representation, via a directed graph, of a distribution over a set of variables that evolve through time • Example DBN with three articulators:

A DBN Model of Articulatory Phonology for Speech Recognition(Livescu and Glass, 2004) • wordt: word ID at frame #t • wdTrt: word transition? • indti: which gesture, from • the canonical word model, • should articulator i be • trying to implement? • asyncti;j: how asynchronous • are articulators i and j? • Uti: canonical setting of • articulator #i • Sti: surface setting of • articulator #i

Incorporating the SVMs: An SVM-DBN Hybrid Model Word LIKE A Canonical Form … Tongue closed Tongue Mid Tongue front Tongue open … Surface Form Tongue front Semi-closed Tongue Front Tongue open … Manner Glide Front Vowel Place Palatal … SVM Outputs p( gPGR(x) | palatal glide release) p( gGR(x) | glide release ) x: Multi-Frame Observation including Spectrum, Formants, & Auditory Model …

Technological Evaluation

Acoustic Feature Selection 1. Accuracy per Frame (%), Stop Releases only, NTIMIT 2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)

SVM Training: Mixed vs. Targeted Data

I don’t know DBN-SVM: Models Nonstandard Phones /d/ becomes flap /n/ becomes a creaky nasal glide

DBN-SVM Design Decisions • What kind of SVM outputs should be used in the DBN? • Method 1 (EBS/DBN): Generate landmark segmentation with EBS using manner SVMs, then apply place SVMs at appropriate points in the segmentation • Force DBN to use EBS segmentation • Allow DBN to stray from EBS segmentation, using place/voicing SVM outputs whenever available • Method 2 (SVM/DBN): Apply all SVMs in all frames, allow DBN to consider all possible segmentations • In a single pass • In two passes: (1) manner-based segmentation; (2) place+manner scoring • How should we take into account the distinctive feature hierarchy? • How do we avoid “over-counting” evidence? • How do we train the DBN (feature transcriptions vs. SVM outputs)?

DBN-SVM Rescoring Experiments • For each lattice edge: • SVM probabilities computed over edge duration and used as soft evidence in DBN • DBN computes a score S  P(word | evidence) • Final edge score is a weighted interpolation of baseline scores and EBS/DBN or SVM/DBN score

Discriminative Pronunciation Model RT-03 dev set, 35497 Words, 2930 Segments, 36 Speakers (Switchboard and Fisher data) • Rescored: product combination of old and new prob. distributions, weights 0.8 (old), 0.2 (new) • Correct/incorrect decision changed in about 8% of all cases • Slightly higher number of fixed errors vs. new errors

Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

Presentation Transcript

Mark Hasegawa-Johnson January 29, 2003

In Collaboration with:

in collaboration with

Work performed with

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

In collaboration with:

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

In collaboration with: Public Interest Research Centre

Mark O’Neil-Johnson

In collaboration with

In collaboration with

In collaboration with:

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Mark Johnson mj@ncni

Collaboration in Research

Mark Johnson mj@ncni

in collaboration with

Mark Hasegawa-Johnson jhasegaw@uiuc Research performed in collaboration with

In collaboration with