1 / 24

Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions

Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions. Phil Green Speech and Hearing Research Group, Department of Computer Science, University of Sheffield With thanks to Martin Cooke, Guy Brown, Jon Barker. Overview. Visual and Auditory Scene Analysis

shlomo
Download Presentation

Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department of Computer Science, University of Sheffield With thanks to Martin Cooke, Guy Brown, Jon Barker.. HCSNet December 2005

  2. Overview • Visual and Auditory Scene Analysis • ‘Glimpsing’ in Speech Perception • Missing Data ASR • Finding the glimpses • Current Sheffield Work • Dealing with Reverberation • Identifying Musical Instruments • Multisource Decoding • Speech Separation Challenge HCSNet December 2005

  3. Visual Scenes and Auditory Scenes • Sound is additive • Each time/frequency pixel receives contributions from many sound sources • Sound source recognition apparently requires reconstruction.. • Objects are opaque • Each spatial pixel images a single object • Object recognition has to cope with occlusion HCSNet December 2005

  4. ‘Glimpsing’ in auditory scenes: the dominance effect (Cooke) Although audio signals add ‘additively’, the occlusion metaphor is a good approximation due to loglike compression in the auditory system Consequently, most regions in a mixture are dominated by one or other source, leaving very few ambiguous regions, even for a pair of speech signals mixed at 0 dB. HCSNet December 2005

  5. Can listeners handle glimpses? HCSNet December 2005

  6. Clean speech +noise Missing data Mask (oracle) The robustness problem in Automatic Speech Recognition • Current ASR devices cannot tolerate additive noise, particularly if it’s unpredictable • Listener’s noise-tolerance is 1 or 2 orders of magnitude better in equivalent conditions (Lippmann 97) • Can glimpsing be used as the basis for robust ASR? Requirements: • Adapt statistical ASR to incomplete data case • Identify the glimpses HCSNet December 2005

  7. A common problem: visual occlusion, sensor failure, transmission losses.. Classification with Missing Data Need to evaluate the likelihood that observation vector x was generated by class C , f(x|C) Assume x has been partitioned into reliable and unreliable parts, (xr,xu) Two approaches: Imputation: estimate xu , then proceed as normal Marginalisation: integrate over possible range of xu Marginalisation is preferable if there is no need to reconstruct x HCSNet December 2005

  8. The Missing Data Likelihood Computation • In ASR by Continuous Density HMMS, • State distributions are Gaussian Mixtures with diagonal covariance • The marginal is just the reduced dimensionality distribution • The integral can be approximated by ERFS • This is computed independently for each mixture in the state distribution Cooke et al 2001 HCSNet December 2005

  9. reliable unreliable Mean spectrum for class C energy frequency Observed spectrum x Counter-evidence from bounds Class C matches the reliable evidence well but there is insufficient energy in the unreliable components HCSNet December 2005

  10. Auditory scene analysis identifies spectral regions dominated by a single source • Harmonicity • Common amplitude modulation • Sound source location • Local SNR estimates can be used to compensate for predictable noise sources. Finding the glimpses Cooke 91 HCSNet December 2005

  11. Harmonicity Masks • Only meaningful in voiced segments • Can be combined with SNR masks HCSNet December 2005

  12. Barker et al 2001 Aurora Results (Sept 2001) Average gain over clean baseline under all conditions: 65% HCSNet December 2005

  13. Missing data masks from spatial location Sue Harding, Guy Brown • Cues for spatial location are used to separate a target source from masking sources • Interaural Time Difference from corss-correlation between left and right binaural signals • Interaural Level Difference from ratio of energy in left and right ears • Soft masks • Task: • Target source: male speaker straight ahead • One or two masking sources (also male speakers) at other positions • Added reverberation HCSNet December 2005

  14. Missing data masks from spatial location (2) Localisation mask, ITD only Localisation mask, ILD only 60 60 50 50 Frequency channel 40 100 Frequency channel 40 30 90 30 20 80 20 10 10 70 % Accuracy % Accuracy 20 40 60 80 100 120 Time (frames) 20 40 60 80 100 120 Localisation mask, ILD/ITD 60 Time (frames) 60 50 50 40 40 Frequency channel 30 30 5 7.5 10 15 20 30 40 20 10 20 40 60 80 100 120 Time (frames) • Oracle • ITD only, • ILD only, • combined ITD and ILD. • Best performance is with combined ITD and ILD: Azimuth of masker (degrees) HCSNet December 2005

  15. MD for reverberant conditions (1) • Palomäki, Brown and Barkerhave applied MD to the problem of room reverberation: • Use spectral normalization to deal with distortion caused by early reflections; • Treat late reverberation as additive noise, and apply standard MD techniques. • Select features which are uncontaminated by reverberation and contain strong speech energy. • Approach based on modulation filtering: • Each rate map channel passed through modulation filter • Identify periods with enough energy in the filtered output • Use these to define mask on original rate map HCSNet December 2005

  16. MD for reverberant conditions (2) • Recognition of connected digits (Aurora 2) • Reverberated using recorded room impulse responses • Performance comparable with Brian Kingsbury’s hybrid HMM-MLP recognizer K. J. Palomäki, G. J. Brown and J. Barker (2004) Speech Communication43 (1-2), pp. 123-142 HCSNet December 2005

  17. MD for music analysis (1) • Eggink and Brownhave used MD techniques to identify concurrent musical instrument sounds • Part of a system for transcribing chamber music • Identify the F0 of the target note, and only keep its harmonics in the MD mask • Uses a GMM classifier for each instrument, trained on isolated tones and short phrases • Tested on tones, phrases and commercial CD HCSNet December 2005

  18. MD for music analysis (2) Flute • Example: duet for flute and clarinet • All instrument tones correctly identified in this example J. Eggink and G. J. Brown (2003) Proc. ICASSP, Hong Kong, IV, pp. 553-556 J. Eggink and G. J. Brown (2004) Proc. ICASSP, Montreal, V, pp. 217-220 Clarinet Fundamental Frequency (Hz) Time (frames) HCSNet December 2005

  19. Use primitive ASA and local SNR to identify time-frequency regions (fragments) dominated by a single source… i.e. possible segregations S Multisource Decoding … but NOT to decide what the best segregation is Instead, jointly optimise over the word sequence W and S Decoding algorithm finds best subset of fragments to match speech source Based on missing data techniques – regions hypothesised as non-speech are missing Barker, Cooke & Ellis 2003 HCSNet December 2005

  20. Multisource decoding algorithm Work forward in time, maintaining a set of alternative decodings – Viterbi searches based on a choice of speech fragments. When new fragment arrives, split decodings - speech or non-speech? When fragment ends, merge decoders which differ in its interpretation. HCSNet December 2005

  21. Multisource Decoding on Aurora 2 HCSNet December 2005

  22. Multisource decoding with a competing speaker Andre Coy and Jon Barker Utterances of male and female speakers mixed at 0 db Voiced regions: Soft Harmonicity masks from autocorrelation peaks Voiceless regions: fragments from ‘image processing’ Gender-dependent HMMs. Separate decoding for male & female 73.7% accuracy on a connected digit task

  23. Informing Multisource Decoding – Work in progress • Ning Ma, Andre Coy, Phil Green • HMM Duration constraints • Links between fragments – pitch continuity • ‘Speechiness’ HCSNet December 2005

  24. Speech separation challenge Organisers: Martin Cooke (University of Sheffield, UK) , Te-Won Lee (UCSD, USA) • see http://www.dcs.shef.ac.uk/~martin • Global comparison of techniques for separating and recognising speech • Special session of Interspeech 2006 in Pittsburgh (USA) from 17-21 September, 2006. • Task- recognise speech from a target talker in the presence of either stationary noise or other speech. • Training and test data supplied. • One signal per mixture (i.e. the task is "single microphone"). • Speech material- simple sentences from the ‘Grid Task’, e.g. “place white at L 3 now" HCSNet December 2005

More Related