950 likes | 956 Views
DeLiang Wang Perception & Neurodynamics Lab Ohio State University & Northwestern Polytechnical University http://www.cse.ohio-state.edu/pnl/. Supervised Speech Separation. Outline of presentation. Introduction: Speech separation problem Classifiers and learning machines
E N D
DeLiang Wang Perception & Neurodynamics Lab Ohio State University & Northwestern Polytechnical University http://www.cse.ohio-state.edu/pnl/ Supervised Speech Separation
Outline of presentation • Introduction: Speech separation problem • Classifiers and learning machines • Training targets • Features • Separation algorithms • Concluding remarks
Real-world audition What? • Speech message speaker age, gender, linguistic origin, mood, … • Music • Car passing by Where? • Left, right, up, down • How close? Channel characteristics Environment characteristics • Room reverberation • Ambient noise
additive noise from other sound sources channel distortion reverberationfrom surface reflections Sources of intrusion and distortion
Cocktail party problem • Term coined by Cherry • “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry’57) • “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92) • Ball-room problem by Helmholtz • “Complicated beyond conception” (Helmholtz, 1863) • Speech separation problem • Separation and enhancement are used interchangeably when dealing with nonspeech interference
Listener performance Speech reception threshold (SRT) • The speech-to-noise ratio needed for 50% intelligibility • Each 1 dB gain in SRT corresponds to about 10% increase in intelligibility dependent upon materials Source: Steeneken (1992)
Source: Wang & Brown (2006) Effects of competing source SRT Difference (23 dB!)
Some applications of speech separation • Robust automatic speech and speaker recognition • Noise reduction for hearing prosthesis • Hearing aids • Cochlear implants • Noise reduction for mobile communication • Audio information retrieval
Traditional approaches to speech separation Speech enhancement Monaural methods by analyzing general statistics of speech and noise Require a noise estimate Spatial filtering with a microphone array Beamforming Extract target sound from a specific spatial direction with a sensor array Independent component analysis Find a demixing matrix from multiple mixtures of sound sources Computational auditory scene analysis (CASA) Based on auditory scene analysis principles Feature-based (e.g. pitch) versus model-based (e.g. speaker model)
Supervised approach to speech separation Data driven, i.e. dependency on a training set Born out of CASA Time-frequency masking concept has led to the formulation of speech separation as a supervised learning problem A recent trend fueled in part by the success of deep learning Focus of this tutorial
Ideal binary mask as a separation goal Motivated by the auditory masking phenomenon and auditory scene analysis, we suggested the ideal binary mask as a main goal of CASA (Hu & Wang, 2001; 2004) The idea is to retain parts of a mixture where the target sound is stronger than the acoustic background, and discard the rest The definition of the ideal binary mask (IBM) θ: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB Optimal SNR: Under certain conditions the IBM with θ = 0 dB is the optimal binary mask in terms of SNR gain (Li & Wang’09) Maximal articulation index (AI) in a simplified version (Loizou & Kim’11) It does not actually separate the mixture!
Subject tests of ideal binary masking • IBM separation leads to large speech intelligibility improvements • Improvement for stationary noise is above 7 dB for normal-hearing (NH) listeners (Brungart et al.’06; Li & Loizou’08; Cao et al.’11; Ahmadi et al.’13), and above 9 dB for hearing-impaired (HI) listeners (Anzalone et al.’06; Wang et al.’09) • Improvement for modulated noise is significantly larger than for stationary noise • With the IBM as the goal, the speech separation problem becomes a binary classification problem • This new formulation opens the problem to a variety of pattern classification methods
Speech perception of noise with binary gains • Wang et al. (2008) found that, when LC is chosen to be the same as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech) • IBM modulated noise for ??? Speech shaped noise
Part II: Classifiers and learning machines Multilayer perceptrons (MLPs) Deep neural networks (DNNs)
Multilayer perceptrons • Multilayer perceptrons are extended from perceptrons (or simple perceptrons) • The task is to learn from examples, and then generalize to unseen samples • A core learning approach in neural networks
Perceptrons • First learning machine • Aiming for classification (apples vs. oranges) • Architecture: one-layer feedforward net • Without loss of generality, consider a single-neuron perceptron x1 y1 x2 y2 • • • • • • xm
Linear separability • A perceptron separates the input space into two halves via a linear boundary • Perceptrons cannot classify linearly inseparable classes
Multilayer perceptrons • If a training set is not linearly separable, a multilayer perceptron can give a solution
Backpropagation algorithm • Square error (loss) function between the desired (teaching) signal and actual output • Key question: how to minimize the error function by updating weights in multilayer structure with nonlinear activation functions? • Solution: apply the technique of gradient descent over weights • Gradient points to the direction of maximum E increase • Gradient descent leads to errors back-propagated layer by layer • Hence the name of the algorithm
Deep neural networks • Why deep? • As the number of layers increases, more abstract features are learned and they tend to be more invariant to superficial variations • Superior performance in practice if properly trained (e.g., convolutional neural networks) • The backprop algorithm is applicable to an arbitrary number of layers • However, deep structure is hard to train from random initializations • Vanishing gradients: Error derivatives tend to become very small in lower layers
Restricted Boltzmann machines • Hinton et al. (2006) suggest to unsupervisedly pretrain a DNN using restricted Boltzmann machines (RBMs) • RBMs simplify Boltzmann machines by allowing connections only between the visible and hidden layer, i.e. no intra-layer recurrent connections • Enables exact and efficient inference
DNN training • Unsupervised, layerwise RBM pretraining • Train the first RBM using unlabeled data • Fix the first layer weights. Use the resulting hidden activations as new data to train the second RBM • Continue until all layers are thus trained • Supervised fine-tuning • The weights from RBM pretraining provide the network initialization • Use standard backpropagation to fine tune all the weights for a particular task at hand • Recent practice suggests that RBM pretraining is not needed if large training data exists
Part III: Training targets What supervised training aims to learn is important for speech separation/enhancement Different training targets lead to different mapping functions from noisy features to separated speech Different targets may have different levels of generalization A recent study (Wang et al.’14) examines different training targets (objective functions) Masking based targets Mapping based targets Other targets
Background • While the IBM is first used in supervised separation (see Part V), the quality of separated speech is a persistent issue • What are alternative targets? • Target binary mask (TBM) (Kjems et al.’09; Gonzalez & Brookes’14) • Ideal ratio mask (IRM) (Srinivasan et al.’06; Narayanan & Wang’13; Hummersone et al.’14) • STFT spectral magnitude (Xu et al.’14; Han et al.’14)
Different training targets • TBM: similar to the IBM except that interference is fixed to speech-shaped noise (SSN) • IRM • β is a tunable parameter, and a good choice is 0.5 • With β = 0.5, the IRM becomes a square root Wiener filter, which is the optimal estimator of the power spectrum
Different training targets (cont.) • Gammatone frequency power spectrum (GF-POW) • FFT-MAG of clean speech • FFT-MASK
Illustration of various training targets Factory noise at -5 dB
Evaluation methodology • Learning machine: DNN with 3 hidden layers, each with 1024 units • Target speech: TIMIT • Noises: SSN + four nonstationary noises from NOISEX • Training and testing on different segments of each noise • Trained at -5 and 0 dB, and tested at -5, 0, and 5 dB • Evaluation metrics • STOI: standard metric for predicted speech intelligibility • PESQ: standard metric for perceptual speech quality • SNR
Comparisons • Comparisons among different training targets • Trained and tested on each noise, but different segments • An additional comparison for the IRM target with multi-condition (MC) training of all noises (MC-IRM) • Comparisons with different approaches • Speech enhancement (Hendriks et al.’10) • Supervised NMF: ASNA-NMF (Virtanen et al.’13), trained and tested in the same way as supervised separation
STOI comparison for factory noise Spectral mapping NMF & SPEH Masking
PESQ comparison for factory noise Spectral mapping NMF & SPEH Masking
Summary among different targets • Among the two binary masks, IBM estimation performs better in PESQ than TBM estimation • Ratio masking performs better than binary masking for speech quality • IRM, FFT-MASK, and GF-POW produce comparable PESQ results • FFT-MASK is better than FFT-MAG for estimation • Many-to-one mapping in FFT-MAG vs. one-to-one mapping in FFT-MASK, and the latter should be easier to learn • Estimation of spectral magnitudes or their compressed version tends to magnify estimation errors
Other targets: signal approximation • In signal approximation (SA), training aims to estimate the IRM but the error is measured against the spectral magnitude of clean speech (Weninger et al.’14) • RM(t, f) denotes an estimated IRM • This objective function maximizes SNR • Jin& Wang (2009) proposed an earlier version in conjunction with IBM estimation • There is some improvement over direct IRM estimation
Phase-sensitive target • Phase-sensitive target is an ideal ratio mask (FFT mask) that incorporates the phase difference, θ, between clean speech and noisy speech (Erdogan et al.’15) • Because of phase sensitivity, this target leads to a better estimate of clean speech than the FFT mask • It does not directly estimate phase • We will come back to this issue later
Part IV: Features for supervised separation • For supervised learning, features and learning machines are two key components • Early studies only used a few features • ITD/ILD (Roman et al.’03) • Pitch (Jin et al.’09) • Amplitude modulation spectrogram (AMS) (Kim et al.’09) • A subsequent study expanded the list (Wang et al.’13) • Newly included: MFCC, GFCC, PLP, and RASTA-PLP • A complementary set is recommended: AMS+RASTA-PLP+MFCC (and PITCH)
A systematic feature study • Chen et al. (2014) have evaluated an extensive list of acoustic features, previously used for robust ASR, for classification-based speech separation • Evaluation done at the low SNR level of -5 dB, with implications for speech intelligibility improvements • In addition to existing features, a new feature called Multi-Resolution Cochleagram (MRCG) was introduced • Classifier is fixed to an MLP
Evaluation framework • Each frame of features is sent to an MLP to estimate a frame of the IBM • A single fullband classifier
Evaluation criteria • Two criteria are used to measure the quality of IBM estimation: • Classification accuracy • HIT – FA rate, which is well correlated with human speech intelligibility (Kim et al.'09)
Multi-resolution cochleagram • MRCG is constructed by combining multiple cochleagrams at different resolutions • A high-resolution cochleagram captures local information, while a low-resolution cochleagram captures information in a broader spectrotemporal context
MRCG derivation • Steps for computing MRCG: • Given an input mixture, compute the first 64-channel cochleagram, CG1, with the frame length of 20 ms and frame shift of 10 ms. This is a common form of cochleagram. A log operation is applied to the energy of each T-F unit. • Similarly, compute CG2 with the frame length of 200 ms and frame shift of 10 ms. • CG3 is derived by averaging CG1 across a square window of 11 frequency channels and 11 time frames centered at a given T-F unit. • If the window goes beyond the given cochleagram, the outside units take the value of zero (i.e. zero padding) • CG4 is computed in a similar way to CG3, except that a 23×23 square window is used • Concatenate CG1-CG4 to obtain the MRCG feature, which has 64×4 dimensions for each time frame
MRCG visualization • Left: MRCG from a noisy mixture (-5 dB babble) • Right: MRCG from clean speech
Experimental setup • Speech from the IEEE male corpus and six nonstationary NOISEX noises • 480 sentences for training and another 50 sentences for testing • First half of each 4-minute noise is used for training, and second half for testing • Mixture SNR is fixed to -5 dB • The hidden layer of MLP has 300 units
Features considered • As robust ASR is likely related to speech separation, Chen et al. constructed an extensive list of features from robust ASR research • A subset of promising features is selected based on their ASR performance at low SNRs and for nonstationary noises
Features selected for evaluation • In addition to those previously studied for speech separation, newly selected ones include • Gammatone frequency modulation coefficient (GFMC; Maganti & Matassoni’10) • Zero-crossing with peak-amplitude (ZCPA, Kim et al.’99) • Relative autocorrelation sequence MFCC (RAS-MFCC, Yuo & Wang’99) • Autocorrelation sequence MFCC (AC-MFCC, Shannon & Paliwal’06) • Phase autocorrelation MFCC (PAC-MFCC, Ikbal et al.’03) • Power normalized cepstral coefficient (PNCC, Kim & Stern’12) • Gabor filterbank (GFB) feature (Schadler et al.’12) • Delta-spectral cepstral coefficient (DSCC, Kumar et al.’11) • Suppression of slowly-varying components and the falling edge of the power envelope (SSF, Kim & Stern’10)
Result summary • MRCG feature performs the best consistently • Gammatone-domain features (MRCG, GF and GFCC) perform best • Cepstral (DCT) compaction does not help • GF is better than GFCC • Modulation-domain features do not help • GFCC better than GMFC, with the latter derived from the former • MFCC with ARMA filtering performs reasonably well • Poor performance of pitch is largely due to pitch estimation errors at -5 dB SNR
Part V. Separation algorithms Monaural separation Speech-nonspeech separation Complex-domain separation Two-talker separation Separation of reverberant speech Binaural separation
Early monaural attempts at IBM estimation • Jin& Wang (2009) proposed MLP-based classification to separate reverberant voiced speech • A 6-dimensional pitch-based feature is extracted within each T-F unit • Classification aims at the IBM, but with a training target that takes into account of the relative energy of the T-F unit • Kim et al. (2009) proposed GMM-based classification to perform speech separation in a masker dependent way • AMS features are extracted within each T-F unit • Lower LCs are used at higher frequencies in the IBM definition • First monaural speech segregation algorithm to achieve speech intelligibility improvement for NH listeners