Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering

Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering Georgia Institute of Technology October 8, 2006

Synopsis of Project • One of the very few, if any, attempt to address auditory modeling beyond the periphery (ear, cochlea, even auditory nerve fiber) for ASR; • Implemented a model (periphery + 3D cortical model) to calculate cortical response to stimuli; • Investigated cortical representations in ASR; conducted a comprehensive comparative study to understand robustness in auditory representations; • Developed a methodology to analyze robustness based on matched filter theory; • Spawned a new development based on category dependent feature selection and hierarchical pattern recognition.

Matched Filtering • Cortical response: • p(y): power (auditory) spectrum • w(y;l): response area l={x,s,f}, where • r(l): cortical response • R(l): non-zero frequency range of w(y;l) • The Cauchy-Schwarz Inequality tells us that r(l)2 will be maximum when: • If R(l) includes enough spectral peaks, we can also use the spectral envelope v(y):

Signal-Respondent Neurons (all points differ in phase) (a) (c) (d) (b)

Noise-Respondent Neurons (all points differ in phase) (a) (b)

Noise Robustness • Assuming a conventional Fourier power spectrum with stationary white noise as the distortion, it can be shown mathematically that: • Sr,l : SNR of signal-respondent neuron • Sp,l : SNR of auditory spectrum in R(l) • Sr,q : SNR of noise-respondent neuron where R(q) = R(l) • Modeling inhibition can increase Sr,l even more.

Noise Robustness Experiments • Sr(Ai) : Overall SNR of signal-respondent neurons of phoneme wi • Sr(U) : Overall SNR of entire cortical response • Sp : Overall SNR of auditory spectrum Vowel /iy/ Fricative /dh/ Affricate /jh/ Plosive /p/

Category-Dependent Feature Selection • LVF: Low Variance Filter; HAF: High Activation Filter; NR: Neuron Reduction (via Clustering and Remapping); PCA: Principal Component Analysis

Hierarchical Classification • Single-Layer Classifier Uses standard Bayesian Decision Theory to classify a test observation into 1 of N classes using class-wise discriminants that estimate the a posteriori probabilities • Hierarchical Classifier (Two-Layer Classifier) A two-stage process that first classifies a test observation into 1 of M categories, then into 1 of |Cn| classes

Searching for a Categorization • The phoneme-wise variances are arranged into N orderings (each ordering with a different “seed” phoneme). • For each ordering, a CART-style splitting routine is applied to create a “phoneme tree,” from which a list of candidate categorizations is obtained. • We search for the categorization with the best hierarchical classification performance over the training data (using initial models).

Model Training • CI features are used to construct category models, which are refined with MCE training

Hierarchical Classification

Phoneme Categorization • Categorization

Phoneme Classification Results • Classification rates (%) for clean speech in TIMIT database (48 phonemes) • Classification rates (%) for varying SNR, features, and classifier configurations (*74.51 when 48 phonemes are mapped down to 39 according to convention) • SL: Single-Layer Classifier; CI: Category-Independent Features; CD: Category-Dependent Features; TL: Two-Layer (Hierarchical) Classifier (results produced after MCE training)

Generalization of the MCE Method Qiang Fu Research Advisor: Fred Juang School of Electrical and Computer Engineering Georgia Institute of Technology October 8, 2006

Synopsis • Excellent detector results (6-class, 14 class, 44-class) reported; use of detector results as “independent” information for rescoring. • Generalization of minimum error principle to large vocabulary continuous speech recognition • Definition of competing events • Selection of training units (state, phone, ..) • Use of word graph • Unequal error weight.

Rescoring Using MVE Detectors • We investigate effects of combining the conventional ASR paradigm and the phonetic class detectors using MVE training • We keep the segmentation information from the Viterbi decoder, which may affect the final improvement • The rescoring algorithm can be flexible in order to fit different tasks

Minimum Verification Error Assume there are M classes and K training tokens. A token labeled in the ith class may generate one type I (missing) error and M-1 type II (false alarm) errors. Hence, key scores related to these two types of error are: And the overall performance objective becomes In the above,1is the indicator function,lis a sigmoid function, andkIandkIIare penalty weights for missing and false alarm errors. A descent algorithm is then applied for the minimization of the overall error objective.

Speech Signals Decoding Scores Conventional Decoder Rescoring Algorithm Rescoring Candidates Detector Scores MVE Detector 1 Neyman- Pearson MVE Detector 2 Decision Criteria & Thresholds MVE Detector M Rescoring Paradigm

Rescoring Methods (I) Suppose there are M classes of sub-word units. Hence there are M sets of detectors accordingly, each of which consistsof a target model and an anti-model.For a segment thatis decoded as the ith class with log likelihood , its jth (j = 1, 2,…,M)detector scores are andrespectively.Namely, the likelihood ratio forthe jth detector is . We callthe score for the test segment belonging to class i aftercombination . Method 1: Naive-adding (NA) We simply add the decoder score and the detector score together The reason for subtracting the anti-model score is to scale the decodingscore into a relatively close dynamic range withthe likelihood ratio. This procedure is also taken in thefollowing two methods.

Rescoring Methods (II) Method 2: Competitive Rescoring (CR) We add the decoder score and the “competitive” score together, which is a “distance measure” between the claimed class and competitors Method 3: Remodeled Posterior Probability (RPP) We compute the “remodeled posterior probability”

Experiments Setup • Experiments are conducted on the TIMIT database (3696 training utterances and 1344 test utterances. There are 119,580 training tokens for MVE detectors) using three-state HMMs. • Rescoring candidates are generated using HVite. The model for decoder is trained by Maximum Likelihood (ML) method, and the detectors are trained by MVE. Performance is examined on 6-class (Rabiner and Juang, 1993), 14-class (Deller et.al., 1999), and 48-class (Lee and Hon, ASAP-1989) broad phonetic categories, respectively. • The models for both decoder and detectors are trained on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy).

Rescoring performance Need to perform phone or word rescoring

Conclusions and Future work • Three different rescoring methods are introduced and the experiment results show that creating a pseudo-phone graph and re-computing the posterior probability achieves the best performance enhancement; • MVE trained detectors shows promising results in helping the conventional ASR techniques. The detectors can be optimized in the sense of features or attributes (e.g. features representing articulatory knowledge and others), and used for re-ranking the decoded candidates; • Bottom-up event detection and information fusion will be conducted on continuous speech signals in the future.

MCE criterion formulation: Define the performance objective and the corresponding task evaluation measure; Specify the target event (i.e., the correct label), competing events (i.e., the incorrect hypotheses from the recognizer), and the corresponding models; Construct the objective function and set hyper-parameters Choose a suitable optimization method to update parameters. In this presentation, only the first step which is also the most fundamental one is discussed due to limited space. This work is the first of an extensive generalization of the MCE training criterion MCE Generalization

labeled word A A Target words A A B … Competing words start end labeled word A A Target words A A B … Competing words start end Strict Boundary and Relaxed Boundary

Experimental Setup • Experiments are conducted on the WSJ0 database (7077 training utterances and 330 test utterances); • All models are three-state HMMs with 8 Gaussian mixtures in each state. There are totally 7385 physical models, 19075 logical models and 2329 tied states; • The models are constructed on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy) feature vectors; • The baseline recognizer basically follows the large vocabulary continuous speech recognition recipe using HTK; • We investigated three cases of maximizing the GPP on different training levels (word, phone, state)

Results Table 1: Word Error Rate (WER) and Sentence Error Rate (SER)for WSJ0-eval using different training levels

Conclusion & Future Work • We generalize the criterion for minimum classification error (MCE) training and investigate their impact on recognition performance. This paper is the first part of an extensive generation of the MCE training. • The experiments are conducted based on the framework of “maximizing posterior probability”. The impact of different training levels is investigated and the phone level gained the best performance; • Further investigation upon various tasks based on this generalized framework is in progress.

Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering