Automatic speech recognition on the articulation index corpus

Automatic speech recognition on the articulation index corpus Guy J. Brown and Amy Beeston Department of Computer Science University of Sheffield g.brown@dcs.shef.ac.uk

Aims • Eventual aim is to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR). • Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task. • wider vocabulary • range of reverberation conditions • variety of speech contexts • naturalistic speech, rather than interpolated stimuli • consider phonetic confusions in reverberation in general • Initial ASR studies using articulation index corpus • Aim to compare human performance (Amy experiment) and machine performance on same task

Articulation index (AI) corpus • Recorded by Jonathan Wright (University of Pennsylvania) • Intended for speech recognition in noise experiments similar to those of Fletcher. • Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins: • American English • Target syllables are mostly nonsense, but some correspond to real words (including “sir” and “stir”) • Target syllables are embedded in a context sentence drawn from a limited vocabulary

ASR system • HMM-based phone recogniser • implemented in HTK • monophone models • 20 Gaussian mixtures per state • adapted from scripts by Tony Robinson/Dan Ellis • Bootstrapped by training on TIMIT then further 10-12 iterations of embedded training on AI corpus • Word-level transcripts in AI corpus expanded to phones using the CMU pronunciation dictionary • All of AI corpus used for training, except the 80 utterances in Amy’s experimental stimuli

MFCC features • Baseline system trained using mel-frequency cepstral coefficients (MFCCs) • 12 MFCCs + energy + delta+acceleration (total 39 features per frame) • cepstral mean normalization • Baseline system performance on Amy’s clean subset of AI corpus (80 utterances, no reverberation): • 98.75% context words correct • 96.25% test words correct

Amy experiment • Amy’s first experiment used 80 utterances • 20 instances each of “sir”, “skur”, “spur” and “stir” test words • Overall confusion rate was controlled by lowpass filtering at 1, 1.5, 2, 3 and 4 kHz • Same reverberation conditions as in Watkins et al. experiments • Stimuli presented to the ASR system as in Amy’s human studies

Baseline ASR: context words • Performance falls as the cutoff frequency decreases • Performance falls as level of reverberation increases • Near context substantially better than far context at most cutoffs

Baseline ASR: test words No particular pattern of confusions in 2kHz near-near case but more frequent skur/spur/stir errors

Baseline ASR: human comparison • Data for 4 kHz cutoff • Even mild reverberation (near near) causes substantial errors in the baseline ASR system • Human listeners exhibit compensation in the AIC task, the baseline ASR system doesn’t (as expected) Baseline ASR system Human data (20 subjects) percentage error far test word  near test word

Auditory periphery Frame & DCT DRNL Hair cell OME ATT Stimulus Recogniser Efferent system Training on auditory features • 80 channels between 100 Hz and 8 kHz • 15 DCT coefficients + delta + acceleration (45 features per frame) • Efferent attenuation set to zero for initial tests • Performance of auditory features on Amy’s clean subset of AI corpus (80 utterances, no reverberation): • 95% context words correct • 97.5% test words correct

Auditory features: context words • Take a big hit in performance using auditory features • saturation in AN is likely to be an issue • mean normalization • Performance falls sharply with decreasing cutoff • As expected, best performance in the least reverberated conditions

Auditory features: test words

Auditory periphery Frame & DCT DRNL Hair cell OME ATT Stimulus Recogniser Efferent system Effect of efferent suppression • Not yet used fullclosed-loop modelin ASR experiments • Indication of likelyperformanceobtained by increasingefferent attenuationin ‘far’ context conditions

Auditory features: human comparison • 4 kHz cutoff • Efferent suppression effective for mild reverberation • Detrimental to far test word • Currently unable to model human data, but: • not closed loop • same efferent attenuation in all bands No efferent suppression 10 dB efferent suppression Human data (20 subjects) far test word  near test word

Confusion analysis: far-near condition far-near 0 dB attenuation • Without efferent attenuation “skur”, “spur” and “stir” are frequently confused as “sir” • These confusions are reduced by more than half when 10 dB of efferent attenuation is applied far-near 10 dB attenuation

Confusion analysis: far-far condition far-far 0 dB attenuation • Again “skur”, “spur” and “stir” are commonly reported as “sir” • These confusions are somewhat reduced by 10dB efferent attenuation, but: • gain is outweighed by more frequent “skur”, “spur”, “stir” confusions • Efferent attenuation recovers the dip in the temporal envelope but not cues to /k/, /p/ and /t/ far-far 10 dB attenuation

Summary • ASR framework in place for the AI corpus experiments • We can compare human and machine performance on the AIC task • Reasonable performance from baseline MFCC system • Need to address shortfall in performance when using auditory features • Haven’t yet tried the full within-channel model as a front end

Automatic speech recognition on the articulation index corpus

Automatic speech recognition on the articulation index corpus

Presentation Transcript

Automatic Speech Recognition

Automatic Speech Recognition: An Overview

Automatic Speech Recognition

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

Automatic speech recognition

Automatic Speech Recognition II

Automatic Speech Recognition System

Automatic Speech Recognition

A Study on Detection Based Automatic Speech Recognition

Automatic Continuous Speech Recognition

Automatic Speech Recognition Studies

Linguistic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems

Automatic Speech Recognition Introduction

Automatic Speech Recognition

Automatic Speech Recognition - Edukite

Automatic Speech Recognition Introduction

Introduction to Automatic Speech Recognition

Automatic Speech Recognition Introduction

Automatic Speech Recognition