1 / 16

Speech Discrimination Based on Multiscale Spectro–Temporal Modulations

Speech Discrimination Based on Multiscale Spectro–Temporal Modulations. Nima Mesgarani, Shihab Shamma, University of Maryland. Malcolm Slaney IBM. Reporter : Chen, Hung-Bin. Outline. Introduction VAD ( Voice Activity Detection and Speech Segmentation )

vui
Download Presentation

Speech Discrimination Based on Multiscale Spectro–Temporal Modulations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech Discrimination Based on Multiscale Spectro–TemporalModulations Nima Mesgarani, Shihab Shamma, University of Maryland MalcolmSlaney IBM Reporter : Chen, Hung-Bin

  2. Outline • Introduction VAD( Voice Activity Detection and Speech Segmentation ) • discriminate speech from non-speech which consists of noise sounds • multiscale spectro-temporal modulation features extracted using a model of auditory cortex • Two state-of-the-art systems • Robust Multifeature Speech/Music Discriminator • Robust Speech Recognition In Noisy Environments • Auditory model • Experimental results • Summary and Conclusions

  3. Introduction - VAD • significance • Speech recognition systems designed for real world conditions,a robust discrimination of speech from other sounds is a crucial step. • advantage • Speech discrimination can also be used for coding or telecommunication applications. • proposed system • a feature set inspired by investigations of various stages of the auditory system

  4. Two state-of-the-art systems • Multi–feature System • Features • Thirteen features in Time, Frequency, and Cepstrum domain are used to model speech and music (noise). • Classification • A Gaussian mixture model (GMM) models each class of data as the union of several Gaussian clusters in the feature space. • Reference: • [1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997.

  5. Two state-of-the-art systems (cont) • Voicing–energy System • Features • frame-by-frame maximum autocorrelation and log-energy features is making the speech/non-speech decision. • PLP • LDA+MLLT • Segmentation • use an HMM-based segmentation procedure with two models, one for speech segments and one for non-speech segments. • Reference: • [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002,

  6. Auditory model • The computational auditory model is based on neurophysiological, biophysical, and psychoacoustical investigations at various stages of the auditory system. • transformation of the acoustic signal into an internal neural representation (auditory spectrogram)

  7. Auditory model (cont) • a complex spatiotemporal pattern • vibrations along the basilar membrane of the cochlea • 3–step process • highpass filter, by an instantaneous nonlinear compression • lowpass filter (hair cell membrane leakage) • detects discontinuities in the responses across the tonotopic axis of the auditory nerve array • computationally via a bank of modulation-selective filters centered at each frequency along the tonotopic axis.

  8. Auditory model (cont) • Sound is analyzed by a model of the cochlea (depicted on the left) consisting of a bank of 128 constant-Q bandpass .lters with center frequencies equally spaced on a logarithmic frequency axis

  9. Multilinear Analysis Of Cortical Representation • auditory model is a multidimensional array. • the time dimension is averaged over a given time window which results in a three mode tensor for each time window with each elements representing the overall modulations at corresponding frequency, rate and scale (128(frequency channels) ×26 (rates) ×6 (scales)

  10. Multilinear Analysis Of Cortical Representation (cont) • Using multi-dimensional PCA to tailor the amount of reduction in each subspace independently. • To generalize the multidimensional tensors concept, we consider a generalization of SVD (Singular Value Decomposition) to tensors. • D = S×1Ufrequency×2Urate×3Uscale×4Usamples • D : The resulting data • S : I1 × I2 × ... × IN • Original : (128(frequency channels) ×26 (rates) ×6 (scales) • The resulting tensorwhose retained singular vectors in each mode ( 7 for frequency , 5 for rate and 3 for scale dimensions) is used for classification. • Classification was performed using a Support Vector Machine (SVM)

  11. Experimental Results • Audio Database from TIMIT • Training data : 300 samples • Testing data : 150 different sentences spoken by 50 different speakers (25 male, 25 female) • training and test sets were different. • To make the non-speech class • from BBC Sound Effects audio CD, RWC Genre Database and Noisex and Aurora databases were assembled together. • The training set • 300 speech and 740 non-speech samples • the testing set • 150 speech and 450 non-speech samples • The audio length is equal.

  12. Experimental Results (cont) • speech detection/discrimination • Table 1 and 2 shows the effect

  13. Experimental Results (cont) • tests white and pink noise were added to speech with specified signal to noise ratio (SNR).

  14. Experimental Results (cont) • different levels of reverberation on the performance

  15. Summary and Conclusions • This work is but one in a series of efforts at incorporating multi–scale cortical representations (and more broadly, perceptual insights) in a variety of audio and speech processing applications. • Applications such as • automatic classification • segmentation of animal sounds • an efficient encoding of speech and music

  16. Reference • Two state-of-the-art systems • [1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997. • [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002, vol. I, pp. 53–56, 2002. • Central Auditory System • [4] K. Wang and S. A. Shamma, ”Spectral shape analysis in the central auditory system”, IEEE Trans. Speech Audio Proc. vol. 3 (5), pp. 382–395, 1995. • [6] M. Elhilali, T. Chi and S. A. Shamma, ”A spectro-temporal modulation index (STMI) for assessment of speech intelligibility”, Speech comm., vol. 41, pp. 331–348, 2003. • Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method • SHIHAB A. SHAMMA • http://www.isr.umd.edu/People/faculty/Shamma.html

More Related