Supervised Speech Separation

DeLiang Wang Perception & Neurodynamics Lab Ohio State University & Northwestern Polytechnical University http://www.cse.ohio-state.edu/pnl/ Supervised Speech Separation

Outline of presentation • Introduction: Speech separation problem • Classifiers and learning machines • Training targets • Features • Separation algorithms • Concluding remarks

Real-world audition What? • Speech message speaker age, gender, linguistic origin, mood, … • Music • Car passing by Where? • Left, right, up, down • How close? Channel characteristics Environment characteristics • Room reverberation • Ambient noise

additive noise from other sound sources channel distortion reverberationfrom surface reflections Sources of intrusion and distortion

Cocktail party problem • Term coined by Cherry • “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry’57) • “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92) • Ball-room problem by Helmholtz • “Complicated beyond conception” (Helmholtz, 1863) • Speech separation problem • Separation and enhancement are used interchangeably when dealing with nonspeech interference

Listener performance Speech reception threshold (SRT) • The speech-to-noise ratio needed for 50% intelligibility • Each 1 dB gain in SRT corresponds to about 10% increase in intelligibility dependent upon materials Source: Steeneken (1992)

Source: Wang & Brown (2006) Effects of competing source SRT Difference (23 dB!)

Some applications of speech separation • Robust automatic speech and speaker recognition • Noise reduction for hearing prosthesis • Hearing aids • Cochlear implants • Noise reduction for mobile communication • Audio information retrieval

Traditional approaches to speech separation Speech enhancement Monaural methods by analyzing general statistics of speech and noise Require a noise estimate Spatial filtering with a microphone array Beamforming Extract target sound from a specific spatial direction with a sensor array Independent component analysis Find a demixing matrix from multiple mixtures of sound sources Computational auditory scene analysis (CASA) Based on auditory scene analysis principles Feature-based (e.g. pitch) versus model-based (e.g. speaker model)

Supervised approach to speech separation Data driven, i.e. dependency on a training set Born out of CASA Time-frequency masking concept has led to the formulation of speech separation as a supervised learning problem A recent trend fueled in part by the success of deep learning Focus of this tutorial

Ideal binary mask as a separation goal Motivated by the auditory masking phenomenon and auditory scene analysis, we suggested the ideal binary mask as a main goal of CASA (Hu & Wang, 2001; 2004) The idea is to retain parts of a mixture where the target sound is stronger than the acoustic background, and discard the rest The definition of the ideal binary mask (IBM) θ: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB Optimal SNR: Under certain conditions the IBM with θ = 0 dB is the optimal binary mask in terms of SNR gain (Li & Wang’09) Maximal articulation index (AI) in a simplified version (Loizou & Kim’11) It does not actually separate the mixture!

IBM illustration

Subject tests of ideal binary masking • IBM separation leads to large speech intelligibility improvements • Improvement for stationary noise is above 7 dB for normal-hearing (NH) listeners (Brungart et al.’06; Li & Loizou’08; Cao et al.’11; Ahmadi et al.’13), and above 9 dB for hearing-impaired (HI) listeners (Anzalone et al.’06; Wang et al.’09) • Improvement for modulated noise is significantly larger than for stationary noise • With the IBM as the goal, the speech separation problem becomes a binary classification problem • This new formulation opens the problem to a variety of pattern classification methods

Speech perception of noise with binary gains • Wang et al. (2008) found that, when LC is chosen to be the same as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech) • IBM modulated noise for ??? Speech shaped noise

Part II: Classifiers and learning machines Multilayer perceptrons (MLPs) Deep neural networks (DNNs)

Multilayer perceptrons • Multilayer perceptrons are extended from perceptrons (or simple perceptrons) • The task is to learn from examples, and then generalize to unseen samples • A core learning approach in neural networks

Perceptrons • First learning machine • Aiming for classification (apples vs. oranges) • Architecture: one-layer feedforward net • Without loss of generality, consider a single-neuron perceptron x1 y1 x2 y2 • • • • • • xm

Linear separability • A perceptron separates the input space into two halves via a linear boundary • Perceptrons cannot classify linearly inseparable classes

Multilayer perceptrons • If a training set is not linearly separable, a multilayer perceptron can give a solution

Backpropagation algorithm • Square error (loss) function between the desired (teaching) signal and actual output • Key question: how to minimize the error function by updating weights in multilayer structure with nonlinear activation functions? • Solution: apply the technique of gradient descent over weights • Gradient points to the direction of maximum E increase • Gradient descent leads to errors back-propagated layer by layer • Hence the name of the algorithm

Deep neural networks • Why deep? • As the number of layers increases, more abstract features are learned and they tend to be more invariant to superficial variations • Superior performance in practice if properly trained (e.g., convolutional neural networks) • The backprop algorithm is applicable to an arbitrary number of layers • However, deep structure is hard to train from random initializations • Vanishing gradients: Error derivatives tend to become very small in lower layers

Restricted Boltzmann machines • Hinton et al. (2006) suggest to unsupervisedly pretrain a DNN using restricted Boltzmann machines (RBMs) • RBMs simplify Boltzmann machines by allowing connections only between the visible and hidden layer, i.e. no intra-layer recurrent connections • Enables exact and efficient inference

DNN training • Unsupervised, layerwise RBM pretraining • Train the first RBM using unlabeled data • Fix the first layer weights. Use the resulting hidden activations as new data to train the second RBM • Continue until all layers are thus trained • Supervised fine-tuning • The weights from RBM pretraining provide the network initialization • Use standard backpropagation to fine tune all the weights for a particular task at hand • Recent practice suggests that RBM pretraining is not needed if large training data exists

Part III: Training targets What supervised training aims to learn is important for speech separation/enhancement Different training targets lead to different mapping functions from noisy features to separated speech Different targets may have different levels of generalization A recent study (Wang et al.’14) examines different training targets (objective functions) Masking based targets Mapping based targets Other targets

Background • While the IBM is first used in supervised separation (see Part V), the quality of separated speech is a persistent issue • What are alternative targets? • Target binary mask (TBM) (Kjems et al.’09; Gonzalez & Brookes’14) • Ideal ratio mask (IRM) (Srinivasan et al.’06; Narayanan & Wang’13; Hummersone et al.’14) • STFT spectral magnitude (Xu et al.’14; Han et al.’14)

Different training targets • TBM: similar to the IBM except that interference is fixed to speech-shaped noise (SSN) • IRM • β is a tunable parameter, and a good choice is 0.5 • With β = 0.5, the IRM becomes a square root Wiener filter, which is the optimal estimator of the power spectrum

Different training targets (cont.) • Gammatone frequency power spectrum (GF-POW) • FFT-MAG of clean speech • FFT-MASK

Illustration of various training targets Factory noise at -5 dB

Evaluation methodology • Learning machine: DNN with 3 hidden layers, each with 1024 units • Target speech: TIMIT • Noises: SSN + four nonstationary noises from NOISEX • Training and testing on different segments of each noise • Trained at -5 and 0 dB, and tested at -5, 0, and 5 dB • Evaluation metrics • STOI: standard metric for predicted speech intelligibility • PESQ: standard metric for perceptual speech quality • SNR

Comparisons • Comparisons among different training targets • Trained and tested on each noise, but different segments • An additional comparison for the IRM target with multi-condition (MC) training of all noises (MC-IRM) • Comparisons with different approaches • Speech enhancement (Hendriks et al.’10) • Supervised NMF: ASNA-NMF (Virtanen et al.’13), trained and tested in the same way as supervised separation

STOI comparison for factory noise Spectral mapping NMF & SPEH Masking

PESQ comparison for factory noise Spectral mapping NMF & SPEH Masking

Summary among different targets • Among the two binary masks, IBM estimation performs better in PESQ than TBM estimation • Ratio masking performs better than binary masking for speech quality • IRM, FFT-MASK, and GF-POW produce comparable PESQ results • FFT-MASK is better than FFT-MAG for estimation • Many-to-one mapping in FFT-MAG vs. one-to-one mapping in FFT-MASK, and the latter should be easier to learn • Estimation of spectral magnitudes or their compressed version tends to magnify estimation errors

Other targets: signal approximation • In signal approximation (SA), training aims to estimate the IRM but the error is measured against the spectral magnitude of clean speech (Weninger et al.’14) • RM(t, f) denotes an estimated IRM • This objective function maximizes SNR • Jin& Wang (2009) proposed an earlier version in conjunction with IBM estimation • There is some improvement over direct IRM estimation

Phase-sensitive target • Phase-sensitive target is an ideal ratio mask (FFT mask) that incorporates the phase difference, θ, between clean speech and noisy speech (Erdogan et al.’15) • Because of phase sensitivity, this target leads to a better estimate of clean speech than the FFT mask • It does not directly estimate phase • We will come back to this issue later

Part IV: Features for supervised separation • For supervised learning, features and learning machines are two key components • Early studies only used a few features • ITD/ILD (Roman et al.’03) • Pitch (Jin et al.’09) • Amplitude modulation spectrogram (AMS) (Kim et al.’09) • A subsequent study expanded the list (Wang et al.’13) • Newly included: MFCC, GFCC, PLP, and RASTA-PLP • A complementary set is recommended: AMS+RASTA-PLP+MFCC (and PITCH)

A systematic feature study • Chen et al. (2014) have evaluated an extensive list of acoustic features, previously used for robust ASR, for classification-based speech separation • Evaluation done at the low SNR level of -5 dB, with implications for speech intelligibility improvements • In addition to existing features, a new feature called Multi-Resolution Cochleagram (MRCG) was introduced • Classifier is fixed to an MLP

Evaluation framework • Each frame of features is sent to an MLP to estimate a frame of the IBM • A single fullband classifier

Evaluation criteria • Two criteria are used to measure the quality of IBM estimation: • Classification accuracy • HIT – FA rate, which is well correlated with human speech intelligibility (Kim et al.'09)

Multi-resolution cochleagram • MRCG is constructed by combining multiple cochleagrams at different resolutions • A high-resolution cochleagram captures local information, while a low-resolution cochleagram captures information in a broader spectrotemporal context

MRCG derivation • Steps for computing MRCG: • Given an input mixture, compute the first 64-channel cochleagram, CG1, with the frame length of 20 ms and frame shift of 10 ms. This is a common form of cochleagram. A log operation is applied to the energy of each T-F unit. • Similarly, compute CG2 with the frame length of 200 ms and frame shift of 10 ms. • CG3 is derived by averaging CG1 across a square window of 11 frequency channels and 11 time frames centered at a given T-F unit. • If the window goes beyond the given cochleagram, the outside units take the value of zero (i.e. zero padding) • CG4 is computed in a similar way to CG3, except that a 23×23 square window is used • Concatenate CG1-CG4 to obtain the MRCG feature, which has 64×4 dimensions for each time frame

MRCG visualization • Left: MRCG from a noisy mixture (-5 dB babble) • Right: MRCG from clean speech

Experimental setup • Speech from the IEEE male corpus and six nonstationary NOISEX noises • 480 sentences for training and another 50 sentences for testing • First half of each 4-minute noise is used for training, and second half for testing • Mixture SNR is fixed to -5 dB • The hidden layer of MLP has 300 units

Features considered • As robust ASR is likely related to speech separation, Chen et al. constructed an extensive list of features from robust ASR research • A subset of promising features is selected based on their ASR performance at low SNRs and for nonstationary noises

Features selected for evaluation • In addition to those previously studied for speech separation, newly selected ones include • Gammatone frequency modulation coefficient (GFMC; Maganti & Matassoni’10) • Zero-crossing with peak-amplitude (ZCPA, Kim et al.’99) • Relative autocorrelation sequence MFCC (RAS-MFCC, Yuo & Wang’99) • Autocorrelation sequence MFCC (AC-MFCC, Shannon & Paliwal’06) • Phase autocorrelation MFCC (PAC-MFCC, Ikbal et al.’03) • Power normalized cepstral coefficient (PNCC, Kim & Stern’12) • Gabor filterbank (GFB) feature (Schadler et al.’12) • Delta-spectral cepstral coefficient (DSCC, Kumar et al.’11) • Suppression of slowly-varying components and the falling edge of the power envelope (SSF, Kim & Stern’10)

Classification accuracy

HIT – FA rate in % (FA in parentheses)

Result summary • MRCG feature performs the best consistently • Gammatone-domain features (MRCG, GF and GFCC) perform best • Cepstral (DCT) compaction does not help • GF is better than GFCC • Modulation-domain features do not help • GFCC better than GMFC, with the latter derived from the former • MFCC with ARMA filtering performs reasonably well • Poor performance of pitch is largely due to pitch estimation errors at -5 dB SNR

Part V. Separation algorithms Monaural separation Speech-nonspeech separation Complex-domain separation Two-talker separation Separation of reverberant speech Binaural separation

Early monaural attempts at IBM estimation • Jin& Wang (2009) proposed MLP-based classification to separate reverberant voiced speech • A 6-dimensional pitch-based feature is extracted within each T-F unit • Classification aims at the IBM, but with a training target that takes into account of the relative energy of the T-F unit • Kim et al. (2009) proposed GMM-based classification to perform speech separation in a masker dependent way • AMS features are extracted within each T-F unit • Lower LCs are used at higher frequencies in the IBM definition • First monaural speech segregation algorithm to achieve speech intelligibility improvement for NH listeners

Supervised Speech Separation

Supervised Speech Separation

Presentation Transcript

Supervised Learning

Evaluating Speech Separation with a Speech Recognizer

Blind Single Channel Speech Separation by Spectrogram Masking

Semi-Supervised Classification by Low Density Separation

Supervised Classification

Supervised Writing

Supervised Writing

Supervised Clustering

Online PLCA for Real-Time Semi-supervised Source Separation

Blind Separation of Speech Mixtures

Supervised Learning

Separation of Multispeaker Speech Using Excitation Information

S ubband cocktail-party speech separation: CASA vs. BSS

Semi-Supervised Learning Processes in Speech Recognition Systems

Supervised classification

Learning Spectral Clustering, With Application to Speech Separation

Separation

Separation

Separation

Supervised Classification

Approaches of Interest in Blind Source Separation of Speech

Deep Neural Networks for Supervised Speech Separation