1 / 18

Automatic speech recognition on the articulation index corpus

Automatic speech recognition on the articulation index corpus. Guy J. Brown and Amy Beeston Department of Computer Science University of Sheffield g.brown@dcs.shef.ac.uk. Aims. Eventual aim is to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR).

malory
Download Presentation

Automatic speech recognition on the articulation index corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic speech recognition on the articulation index corpus Guy J. Brown and Amy Beeston Department of Computer Science University of Sheffield g.brown@dcs.shef.ac.uk

  2. Aims • Eventual aim is to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR). • Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task. • wider vocabulary • range of reverberation conditions • variety of speech contexts • naturalistic speech, rather than interpolated stimuli • consider phonetic confusions in reverberation in general • Initial ASR studies using articulation index corpus • Aim to compare human performance (Amy experiment) and machine performance on same task

  3. Articulation index (AI) corpus • Recorded by Jonathan Wright (University of Pennsylvania) • Intended for speech recognition in noise experiments similar to those of Fletcher. • Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins: • American English • Target syllables are mostly nonsense, but some correspond to real words (including “sir” and “stir”) • Target syllables are embedded in a context sentence drawn from a limited vocabulary

  4. Grammar for Amy’s subset of AI corpus $cw1 = YOU | I | THEY | NO-ONE | WE | ANYONE | EVERYONE | SOMEONE | PEOPLE; $cw2 = SPEAK | SAY | USE | THINK | SENSE | ELICIT | WITNESS | DESCRIBE | SPELL | READ | STUDY | REPEAT | RECALL | REPORT | PROPOSE | EVOKE | UTTER | HEAR | PONDER | WATCH | SAW | REMEMBER | DETECT | SAID | REVIEW | PRONOUNCE | RECORD | WRITE | ATTEMPT | ECHO | CHECK | NOTICE | PROMPT | DETERMINE | UNDERSTAND | EXAMINE | DISTINGUISH | PERCEIVE | TRY | VIEW | SEE | UTILIZE | IMAGINE | NOTE | SUGGEST | RECOGNIZE | OBSERVE | SHOW | MONITOR | PRODUCE; $cw3 = ONLY | STEADILY | EVENLY | ALWAYS | NINTH | FLUENTLY | PROPERLY | EASILY | ANYWAY | NIGHTLY | NOW | SOMETIME | DAILY | CLEARLY | WISELY | SURELY | FIFTH | PRECISELY | USUALLY | TODAY | MONTHLY | WEEKLY | MORE | TYPICALLY | NEATLY | TENTH | EIGHTH | FIRST | AGAIN | SIXTH | THIRD | SEVENTH | OFTEN | SECOND | HAPPILY | TWICE | WELL | GLADLY | YEARLY | NICELY | FOURTH | ENTIRELY | HOURLY; $test = SIR | STIR | SPUR | SKUR; ( !ENTER $cw1 $cw2 $test $cw3 !EXIT ) Audio demos

  5. ASR system • HMM-based phone recogniser • implemented in HTK • monophone models • 20 Gaussian mixtures per state • adapted from scripts by Tony Robinson/Dan Ellis • Bootstrapped by training on TIMIT then further 10-12 iterations of embedded training on AI corpus • Word-level transcripts in AI corpus expanded to phones using the CMU pronunciation dictionary • All of AI corpus used for training, except the 80 utterances in Amy’s experimental stimuli

  6. MFCC features • Baseline system trained using mel-frequency cepstral coefficients (MFCCs) • 12 MFCCs + energy + delta+acceleration (total 39 features per frame) • cepstral mean normalization • Baseline system performance on Amy’s clean subset of AI corpus (80 utterances, no reverberation): • 98.75% context words correct • 96.25% test words correct

  7. Amy experiment • Amy’s first experiment used 80 utterances • 20 instances each of “sir”, “skur”, “spur” and “stir” test words • Overall confusion rate was controlled by lowpass filtering at 1, 1.5, 2, 3 and 4 kHz • Same reverberation conditions as in Watkins et al. experiments • Stimuli presented to the ASR system as in Amy’s human studies

  8. Baseline ASR: context words • Performance falls as the cutoff frequency decreases • Performance falls as level of reverberation increases • Near context substantially better than far context at most cutoffs

  9. Baseline ASR: test words No particular pattern of confusions in 2kHz near-near case but more frequent skur/spur/stir errors

  10. Baseline ASR: human comparison • Data for 4 kHz cutoff • Even mild reverberation (near near) causes substantial errors in the baseline ASR system • Human listeners exhibit compensation in the AIC task, the baseline ASR system doesn’t (as expected) Baseline ASR system Human data (20 subjects) percentage error far test word  near test word

  11. Auditory periphery Frame & DCT DRNL Hair cell OME ATT Stimulus Recogniser Efferent system Training on auditory features • 80 channels between 100 Hz and 8 kHz • 15 DCT coefficients + delta + acceleration (45 features per frame) • Efferent attenuation set to zero for initial tests • Performance of auditory features on Amy’s clean subset of AI corpus (80 utterances, no reverberation): • 95% context words correct • 97.5% test words correct

  12. Auditory features: context words • Take a big hit in performance using auditory features • saturation in AN is likely to be an issue • mean normalization • Performance falls sharply with decreasing cutoff • As expected, best performance in the least reverberated conditions

  13. Auditory features: test words

  14. Auditory periphery Frame & DCT DRNL Hair cell OME ATT Stimulus Recogniser Efferent system Effect of efferent suppression • Not yet used fullclosed-loop modelin ASR experiments • Indication of likelyperformanceobtained by increasingefferent attenuationin ‘far’ context conditions

  15. Auditory features: human comparison • 4 kHz cutoff • Efferent suppression effective for mild reverberation • Detrimental to far test word • Currently unable to model human data, but: • not closed loop • same efferent attenuation in all bands No efferent suppression 10 dB efferent suppression Human data (20 subjects) far test word  near test word

  16. Confusion analysis: far-near condition far-near 0 dB attenuation • Without efferent attenuation “skur”, “spur” and “stir” are frequently confused as “sir” • These confusions are reduced by more than half when 10 dB of efferent attenuation is applied far-near 10 dB attenuation

  17. Confusion analysis: far-far condition far-far 0 dB attenuation • Again “skur”, “spur” and “stir” are commonly reported as “sir” • These confusions are somewhat reduced by 10dB efferent attenuation, but: • gain is outweighed by more frequent “skur”, “spur”, “stir” confusions • Efferent attenuation recovers the dip in the temporal envelope but not cues to /k/, /p/ and /t/ far-far 10 dB attenuation

  18. Summary • ASR framework in place for the AI corpus experiments • We can compare human and machine performance on the AIC task • Reasonable performance from baseline MFCC system • Need to address shortfall in performance when using auditory features • Haven’t yet tried the full within-channel model as a front end

More Related